🕑 Estimated reading time: 10mn
Lectra is a company where 200+ developers are split across 22 teams including one for CI and a DevOps team. In 2015, it started its cloud & SaaS transition, which implies a progressive transition to Hign Availability by shutting down non-compatible infrastructure and migrating to micro-services.
The idea that struck right away: using Kubernetes, mostly with the Azure cloud. The Azure cloud allowed them to quickly deploy their infrastructure with Kubespray or Kubeadm, which are roughly similar. Today, their infrastructure is composed of seven clusters including five hosted in the Azure cloud and two on premise. They contain 50 nodes with 500 deployments and 1500 stateless pods. In two years, no major incident has held back production although some smaller ones were noticed.
etcd is used by Kubernetes to describe cluster states. AKS, the managed Kubernetes solution from Azure uses an obsolete version of etcd - obsolete might be strong since the technology is very recent - which saturated the disks at Lectra’s scale. Solution: manually migrate to a non-obsolete version of etcd.
As a reminder, in Kubernetes, the smaller unit is the Pod, which is defined in a ReplicaSet, which is defined in a Deployment. Kubernetes was not deploying pods optimally; some applications were hosted on the same node just because they were not demanding and the related nodes had more resources to spare at that moment. Solution: starting from Kubernetes 1.4, you can use pod affinity and anti-affinity to specify how you prefer your pods to be distributed.
You might be tempted to allocate all your infrastructure on the cheapest region or the one you are doing the most business with. While this might look like a sensible decision, you might not be correctly protected against datacenter outages and other large scale issues. Serving your services on multiple Availability Zones is not easy nor cheap either: updates, maintenances and migrations get a lot more complex when distance comes in play.
When someone needs to plan a disruption for migration causes, it is advisable to have a Pod Disruption Budget set to control how your application will transition to its new version.
Deploying applications by hand can become tedious, especially when your JSON or YAML configurations tend to get bigger. Helm and Tiller can help with that by creating, deploying, updating and rollbacking configuration templates. Still, keep in mind that this solution is not Highly available and that storing the history of all deployments is very expensive. To reduce the impact of both issues at once, you could create one Helm-Tiller per namespace and limit its deployment history. This solution is not perfect, though, as it requires more resources.
In case your Kubernetes cluster is shared across many teams, it becomes hard to track the resource needs of every application. This situation can lead to the cluster resources getting hogged by an inefficient application. To avoid this, team developers can set Requests and Limits to indicate the minimum amount of CPU and Memory resources an application requires and the maximum it should be allowed to reach. What if those Limits and Requests are not present, then? First, it is recommended to set Resource quotas to evenly share resources. On top of that, a Kubernetes manager can set default limits for the applications running inside the namespace. This solution implies that each team will manage their own namespace.
When you want to expose an application to the real world, you can create Services or rely on Ingresses. The latter is preferable as it allows more complete routing. NginX is the de facto standard Ingress controller for Kubernetes, which is curious as it is not a CNCF-backed project, does not support hot-reload and which will destroy active connections and sockets when it needs to reload its vhost configuration. Kubernetes needs to reload its network configuration pretty often, especially on a cluster serving a large number of applications. Unfortunately, there is no perfect solution here; you can limit reloads, enable dynamic configuration at your own risk, have one Ingress controller per namespace or change the Ingress controller to Envoy or Traefik.
Always expose healthckeck routes in your application. Better: normalize the routes that you will use for your endpoints like “/ping” for the readiness probe and “/health” for the liveness probe. Also, the Healthcheck should not report the health of dependent services; if one of the dependent services fails, it might cause a failure in your current Healthcheck and cause a domino effect in the best case scenario and a crash and restart loop in the worst case scenario. The worst case scenario is when two applications A and B check each other and one of those two fail for a reason, causing the other to misbehave and fail, etc.
After years of preparations and tests, the first Space Station was about to be launched into space. The first component to be launched only contained supplies for the next components and inhabited capsules, however, the mission was very close to a failure. The cause: the loss of the anti-radiation shield. Knowing that the resources and tools contained in the launched Space Station component were very precious and it would be insanely expensive to restart from scratch, engineers chose to devise and enact a plan B as soon as possible.
The plan B was to orient the system in such a way that it would create a replacement shield. There was no way to check if the tools were still in a functioning state, though so this was a bet. Fortunately, this plan B was a success and the contents of the system were still usable.
What can we learn from this story? First, that the scientists were reusing a system that has worked before but did not question what would happen if any of its components were defective. Never believe that everything will ever be fine. Reintroduce doubt in your design process and embrace questions like “will that process hold up?”.
In other words: when an incident happens, start recovering as soon as possible while collecting data on the incident. Identify the root cause, issue a fix, validate and deploy it. This requires a paradigm shift: accept that Production might be very different than what you think and be ready to be not ready.
To move forward in a system, you need to be in control of its risks. Without that, major incidents can become way more spectacular.
In December of 2018, the network carrier O2 lost their whole network because of a certificate issue in a component of a third-party equipment belonging to the Ericsson brand! Resilience must become the top priority of a system, not the velocity of change. The reliability of a system must be measured in real time over the lifespan of the system. That is the purpose of Chaos Engineering.
The whole point of Chaos Engineering is not to break everything but rather to degrade systems progressively by eliminating infrastructure components, introducing latency, expiring certificates, checking third-parties, dependencies, fallbacks, configuration changes, etc. In practice, Chaos Engineers uncover and explore potential Points of Failure in a scientific and engineering approach.
-> Hypothesis -> Exploration -> Observation ⌝ ⌞ Learning <- Conclusion <- Analysis ⌟
This approach is particularly adapted since new outages can be considered like unknowns caused by technical debt and dark debt. Dark debt, unlike technical debt, is way more insidious as it is implied by the collective complexity of an ecosystem and not on faulty logic or programming style.
When a hypothesis uncovers a regression, the hypothesis can be converted to a validation rule. Nevertheless, do not get it wrong: the final objective is not to eliminate outages and get as close as possible to 100% availability. Chaos Engineering eliminates the most obvious causes of outages. To reach such an objective, relevant business or technical metrics and indicators are a must to measure the impact of each hypothesis, as long as they represent a real world value for end users or technicians/
Chaos Toolkit makes your hypotheses formal and allow you to test them against your systems thanks to the plethora of drivers it supports while keeping tabs on logs and metrics. ChaosPlatform iterates on Chaos Toolkit by enabling collaboration and increasing visibility.
Chaos Engineering is a very recent practice. At Netflix, the infrastructure is stable to the point it does not slow down business and development!