At BBC News, we've had a lot off issues with using CI environments for building and testing. Recently, we've taken a lot of time in order to solve these issues. We've created a new Jenkins setup to simplify our environment and in build & test our software inside on containers, even if ultimately we're not deploying in a container. This workflow has saved us potentially days of lost developer time.
I’m Simon Thulbourn, a senior software engineer at …
… BBC News. I’ve been there for the past 3 years. I’ve been working on tooling and developer based services to simplify process and encourage automation.
A little bit of background in to the BBC.
The BBC are a global media corporation who focus on being creative in all aspects of it’s business. We’re probably mostly known for…
things like Top Gear (The BBC’s biggest export) or Doctor Who
This logo is from the Tom Baker period. Who was known for his fantastic scarves.
- But we’re also known as a global News organisation… A pretty big one too…
- We’re pretty big for News in the UK, in the top 3. In the Top 10 for the world. We publish more than 80,000 things per day in English.
The BBC has a world service remit, that means that BBC News is served in 28 different languages. Each with it’s own editorial teams. Which if you recalculate the stats, puts us higher in the world rankings of News.
It’s worth noting, that for some of these services, we’re the only unbiased source of News.
- The BBC has 500+ developers split across about 10 major products and a centralised operations department to handle everything for us on our primary stack.
- Up until early last year: our primary stack was based around PHP for frontend and JVM based things for the data layer. This is what we call the “Forge” platform.
The forge is comprised of around 150 live servers, with two thirds dedicated to PHP. All servers are in mirrored datacentres in London
We run all services as shared infrastructure, so a PHP package would be installed on all 100 servers
Our newer stack is based around using the leading cloud provider…
We're moving to the cloud in order to give developers freedom to control their own architecture. We recognise that not all applications should be the same, so moving to the cloud is the best way to empower the developers. We take over a lot of the responsibility of operations.
I’m obviously here to talk about CI environments. Not just to show you nice folks some pictures.
- At the BBC there are at least 10 different shared CI environments that are centrally managed by the Platforms Team.
- Running a mix of Bamboo, Hudson (from 2009) and Jenkins.
- It’s worth noting, that all hudson servers where split, so people have to build their application for INT and TEST runtime environments
- We don’t tend to get a say how they’re administered. No control over packages and configuration.
Between all environments, there were 26,817 jobs/projects.
Contention was an issue, we struggled for build time, testing time and installing our packages to the INT environment
BBC News isn't like most development teams in the BBC. We don't tend to take no as an answer for a lot of things. we have a way of colouring outside of the lines and making a pretty picture that's adopted by the rest of the BBC (eventually— sometimes).
To get around our closed platform, I developed a sideloading technique that was essentially:
1. take a blob of source code
1. compile it
1. Tar it up
1. Download and extract on CI inside a job.
1. Hope it worked.
If it didn't, start at 1 and tweak.
This could take a few days to create a working "package". But, this was the best technique we had a for a while. And shamefully, it was copied by other teams, once we had a working "package" of something, other departments would start using it. People were actually using our bad ideas.
Starting again with a CI just for News.
Our new CI is a single master node
- The master has 3 tasks: backup, restore and send jobs to slaves
with the potential to have a lot of slave nodes.
The slaves talk to the master using the Swarm plugin since it allows for auto registration to the master.
We utilise a private registry storing images on S3
Users are free to use our registry for other things unrelated to CI
Every VM in our architecture is based around each one only having the packages it requires. Nothing more. Everything is installed over a base CentOS 6.5 install.
I decided very early on I didn't want to maintain CI forever. I’m not a sysadmin. I'm a developer. I wanted the easiest thing possible, so Docker was installed.
Docker gives us a lot of flexibility. Obviously you know most of them, because you’re sitting here listening to me.
We get rid of our very lengthy time consuming side loading process. You could argue that we get by just moving to our own infrastructure that we control, but Docker gives us the an extra step, we don’t have to maintain languages at all. We give that load to Docker though the official language stacks (who in turn push it to the owners of that technology).
By pushing the language stacks off to Docker, BBC News has saved days, which will into weeks and months and then years because if we keep with them, we don’t have to maintain anything about our environments. Our Jenkins Jobs can just depend on these.
Although, they’re not language stacks anymore, they’re just official images since they offer more than just languages, but entire parts of infrastructure like nginx, redis and etc.
Sometimes, the boring use cases are the best. Our use of docker just works, we're using docker for exactly what it was designed for. To keep code running my in isolation so it doesn't affect anything else.
It's not a flawless system. But it's much better than what we had.
We wanted to auto scale our slaves based on usage. The natural thing to do was scaling based on the average CPU usage of the cluster, but that has issues… With auto scaling groups in AWS, you can’t specify the instance you want to terminate effectively. You can choose either: Nearest to the next billing hour or the oldest. Neither protect us from trashing an instance that’s running something.
Scaling up was simple, a host would come up and accept jobs straight away.
Scaling down became a sore point for us. We ran the risk of killing a container that was active, a bug in Jenkins also meant that the job would also lose it’s workspace in this race condition.
So at the moment, instances are just added when all of the capacity is used up most of the time.
Build time secrets from a docker file don't exist at the moment. Which for BBC News is a shame, as all of our yum repos require a client side certificate. So we have a semi awkward problem which was automated with a make file.
Our makefile will perform a docker run against a build shell script and then commit and tag the resulting container.
So this is where we are now.
From where we started, BBC News has come a long way, our new CI is a huge improvement.
Docker to lower system administration.
Using Docker to significantly lower the barrier to entry.
We have an open CI environment, that’s reproducible.
Being able to debug failed jobs locally since we can use the same images
Scaleable infrastructure.
Other teams within the BBC have seen the way we’re using CI for BBC News and want to copy it. BBC News’ CI system will be replicated by other product teams.
We think the key to a great enterprise CI isn’t to have one big CI, but lots of small ones where it makes sense.
The GoCD project was recently released as an open source project. GoCD has a number of advantages over Jenkins for us. We’re currently in the planning stage of a migration to GoCD.
Despite the complexity of GoCD and the required assimilation of GoCD, we feel it’s worth it.
We find ourselves wanting to have proper build pipelines instead of simple jobs. GoCD will give us the needed complexity we need to build and test our applications properly.
I look forward using Docker Clusters. I think they will offer people a way to run small to medium clusters efficiently. I have plans to replace our dedicated Jenkins slaves with Docker hosts, perhaps even sharing the hosts with other applications.