Relationship status: It’s complicated — Getting Docker into Production

Last September we wrote about the start of our relationship with docker. Like most relationships at the start it was all kissing, hand holding, trips to the movies and about 3.9 selfies posted on Instagram per second just to let everyone know how in love we were.

Like most relationships, we got to know each other a little better and the prolific displays of our affection for each other started to feign. We’re still “going steady” but it’s complicated. With the benefit of experience, we’re now one of those couples that look down on young love in an “if only you knew kind of way”.

Getting Started

Like most things in software engineering, Docker began with setting up the dev environment. By the time we started to explore using Docker we already had several python services running in production plus our main product.

We made the decision to start slow, slowly moving services over to Docker one at a time so we didn’t cause too much upheaval for the rest of the team. Before long we had everything running in dev without any issues. This replaced a really horrible Vagrant+SaltStack setup that would work eventually given that you ran it enough times. This was a major improvement. We had something that worked first time, worked everywhere and was much easier to use. +1 for immutable infrastructure.

Launching the first service 🚀

Following our ethos of “measure twice, cut once” we picked a fairly low impact service as the first thing to migrate over to Docker in production. We mentioned in our last post on docker that one of the things that had prompted our decision to jump on the proverbial bandwagon was the release of Docker Swarm.

We did not have a good time with Swarm at all. We experienced significant stability issues from the start. We were experiencing lots of issues with the overlay network suddenly not working. Services would stop being able to talk to each other meaning that the whole cluster would eventually end up out of sync. Its ability to recover from these issues was also non-existent. This was our first bust up! We cried, we ate chocolate and we hugged our mums. Eventually we decided to make up and give things another try.

It wouldn’t be fair to say that it was entirely Docker Swarm’s fault. We were almost certainly doing something wrong somewhere. The main issue was that Docker Swarm was new and the usual ways of solving obscure problems like the ones we were having were offering little support. We agreed that we needed something a bit more mature.

I need a man, not a boy…

We quickly drew up a short list of options that offered a more mature way of running and managing Docker. We felt that using a managed service that would do all the complicated stuff for us was a better idea. We wanted to focus on the things Docker would enable us to do rather than managing Docker itself.

Two clear options presented themselves. AWS ECS (Elastic Container Service) and Google Cloud (GKE). ECS is essentially a proprietary Amazon orchestration tool on top of EC2. Google’s GCE offers hosted Kubernates. Both services were very competitive in terms of features. HA (high availability) auto-scaling, self-healing and easy deployments are included with both. As we were already running our infrastructure on AWS it seemed using ECS was a no brainer.

Amazon ECS 💪

Amazon suggests using Elastic Load Balancers for your clusters service discovery. ECS makes use of the ELB’s health check for failing over services when new versions are deployed. At the time, AWS still didn’t allow you to route requests based on hostnames with ELBs. This would mean that we would need an ELB per service and they are not cheap. We decided that we would look at other ways to manage service discovery across the clusters in favour of keeping the costs down. This decision wasn’t easy to make given our desire to have a service deal with all the complex stuff for us.

After a bit of searching, we settled on using Nginx + Consul + Consul Template to manage service discovery across the cluster. Consul acts as a service within the cluster. Consul can be asked questions about the state of clusters. As services are added or updated, Consul will itself update to keep a track of those changes. Consul template is responsible for re-writing the configs pointing to services as Consul updates. In our case this was the upstream proxies in Nginx.

We had to play around with various strategies to manage failover to new versions of services with Consul. Simply updating the service every time we deployed would mean that the ELB’s might still route to old tasks that were being drained as new ones were launching. After plenty of trial and error, we essentially ended up with a “roll your own” version of what the ELB’s were offering. Failover was not the only thing we spent plenty of time working on. If you take anything away from this post make it this. Memory allocation is everything on Amazon ECS.

Just as our deployments felt solid and the stability of the cluster had settled Amazon announced they had launched support for hostname routing with ELBs as well as the ability to route based on specific paths. We’re stoked that Amazon has finally shipped this much needed feature (even if it is four months later than we would have liked). So what does this mean? We can now have a single ELB that is capable of routing to all of our services. We think this is going to provide us with more stability and also allow us to just let Amazon do the hard work.

Using ECS over our existing infrastructure has also resulted in a 50% cost saving too. We previously had instances for each of our environments. Docker containers have allowed us to run various configurations of the services on the same host. Happy engineers + Happy wallets = a ton of awesome!

In conclusion

We’re still in love with Docker ❤️. We’ve learnt a huge amount transitioning our services to run inside containers. The time saved launching new services into production in contrast to what we had with salt cannot be measured. It’s enabled everyone in our team to work more closely with the infrastructure and getting new team members set up is super quick and extremely reliable.

We’ve got an exciting list of improvements to the way we deploy planned that are sure to make us even happier with our choice to use Docker.


Mike Waites