Taming the Hydra (part 1)
How to take care of a large Jenkins installation and still keep your sanity
Sometimes maintaining our Jenkins infrastructure reminds me of fighting the fabled Hydra. Every time you slash a problem in one of the jobs, in the meantime ten more jobs have sprouted, each of which will pose their own problems in the future.
Our hydra by the end of 2014 had grown almost 3000 heads. Something had to be done.
Where do all these Jenkins jobs come from?
How do people usually create a Jenkins job? Well, they use a technique which is frowned upon for decades in software development:
- find an existing job that does roughly what you want
- copy it
- adjust the copy till it more closely does what you need
- forget about it
This of course is called Copy Paste Programming and that, as we all know, leads to unmaintainable source code and likewise to an unmaintainable mess of Jenkins jobs.
Also, for us in the Engineering Support team, it lead to a serious series of déjà vus. Didn’t I fix this problem in that job last week? Oh no, it was this job, which looks almost the same as that job, but uses a different nodeJS version, for no reason whatsoever, except that it’s wrong. And this job is embarrassingly slow, because it doesn’t run Maven in parallel, although it could. And yet another one doesn’t contain the fix in the Groovy post-build step, which we did two months ago…
The Hydra grows more heads
And then something else happened. The Product Development team was fighting their own beast, commonly known as The Monolith:
The Monolith proved to be unmaintainable as well. Change one tiny thing over here, and a whole wall collapses over there. Build up the wall again, and a whole building collapses (an entirely different building, not the one you have been working on).
The solution is of course to break everything up into smaller pieces. These are as independent from one another as possible, only exposing their API, can be implemented in any language, and are much easier to understand and maintain, as they solve only one specific problem instead of trying to ensure world peace.
Enter Micro Services
But Micro Services pose their own challenges. Developers have to juggle with a lot more Git repositories. The Site Operations team has to provide the infrastructure to deploy and monitor all these small services. And we in the Engineering Support team wanted an easy way to maintain all the new jobs that would be needed.
Hercules’ solution fails us
Jenkins Script Console told us, that even without any of these new jobs, we had already quite a beast to fight.
Jenkins.instance.items.size() Result: 2936
That was end of 2014. We had several discussions back then, but one thing was clear from the start. Hercules’ solution won’t work for us. We can’t kill the hydra. A lot of these jobs are actually needed and with the advent of Micro Services, they will only become more.
So instead of killing, we have to tame the beast. Orchestrate the heads, so to speak.
Seeing that a lot of these heads, erm jobs, are quite similar to each other, we envisioned a mechanism that would allow us to:
- administer the similar parts of the jobs in one central place (making it easy for us to maintain), whereas to
- keep the specific configuration parts of the jobs separate. Ideally they would be stored in the Git repository itself, alongside the source code (where it would be easy for developers to maintain).
Of course, not all of the jobs are similar to each other. In reality, there are several very different job types, each forming a cluster of similar jobs. Some of the clusters are big, some only contain a single job (one-of-a-kind jobs). We were most interested in the big clusters. If all the jobs were one-of-a-kind jobs, none of the following would have made any sense.
So we were looking for some kind of flexible templating mechanism. Jenkins doesn’t provide one out of the box, but there are some plugins to be found, also Jenkins allows to do Groovy system scripting, and also provides a Remote Access API.
Tune in next week for a look at the alternatives, the failures, and the (quite cool, if I might say so) solution that we eventually came up with and are using successfully now for quite a while (hosting 1228 jobs now, which mostly take care of themselves)…