🕑 Estimated reading time: 9mn
If you want to learn a lot about the internals of Kubernetes, try kelseyhightower/kubernetes-the-hard-way when you have some time to spare and some leftover credit on your favorite cloud provider.
The OVH Managed Kubernetes platform has been released in beta during the OVH Summit, the 18th of October of 2018. The response has been fast; today the solution supports over 2000 active clusters, 10k pods on the administration clusters, 1500 CPUs and 22 Terabytes of RAM! The offer is managed because the Masters, a.k.a. Control plane, and some Node components, a.k.a. Compute, are taken in charge by OVH.
In their architecture, the Masters contain etcd, the OpenStack interface for Persistent Volumes, health checks, a load balancer, a route controller, the scheduler and API Server. This essential last component centralizes the connections between the Control plane and the Compute. Each node contains a kubelet and a kube-proxy, both managed by OVH as well.
In all of this, the critical components are etcd, API Server and kube-proxy. This project being very dependent of Kubernetes and having rewritten most of its core components, it is not recommended to try this at home.
The presented architecture actually contains itself! The Managed Kubernetes solution is a giant Kubernetes cluster which nodes are also Kubernetes! Only the client-facing nodes and etcd are managed separately from the system. This allows OVH to very quickly deploy new clusters.
Side-effect: internally, the communication is handled by kube-proxy and routed through the OVH Load Balancer. This allows OVH to use their dedicated network to their full potential and to provide a solid and performant backbone for the Managed Kubernetes solution. The interface to access the services is NodePort.
Admin nodes can communicate with API Server with TLS + SNI, meaning that the hostname appears in clear in the request. Client nodes can communicate through a TCP tunnel with API Server, using the kubelet and kube-proxy.
etcd as a standalone Pod: Simple to deploy but impossible to scale, non-persistent storage and risk of loss of quorum. The solution as been deemed impractical for the use case.
etcd as an Operator: Easy to allocate and to manage its lifecycle but still non-persistent storage.
StatefulSet / Persistent Volume: Crash-resilient but no lifecycle management, terrible performance linked to network calls and costly.
etcd on dedicated hardware: Used in multi-tenant, cost-effective and dedicated resources. Logically separates clients using tokens and namespaces. This solution has been chosen as it is both the most performant and the easiest to manage.
Th Nodes need to interact using the Container Network Interface. Such an infrastructure is very sensitive to latency and there are many implementations and many CVEs. OVH made the choice to rely on their vRack solution and route at Level 2 without using CNI!
Unfortunately, Level 2 routing implies that the heavy lifting must be handled by the infrastructure. Here, BREX (External Bridge) gives complete control over the network, which makes it a tempting solution. BREX requires ip_forward to be enabled. In addition, it rewrites the origin MAC address during SYN but not during SYN/ACK, which causes a broadcast. This can be problematic when a Node calls the API Server: when the number of nodes increases, one broadcast can essentially become an internal DDoS.
A solution that solves the broadcast issue for ARP requests resides in an ARP Kernel module which intrecepts connections and rewrites the MAC address exposed by the Pod to the one exposed by the Node.
Kube-proxy is the component that handles communication with the Internet. It works in three modes:
For the reload performance and the bandwidth limit, IPVS has been chosen. IPTables would handle 5k services while IPVS is able to scale to 25k before performance and bandwidth loss becomes noticeable.
Recommended read: OVH Blog
Traefik could be classified as a web proxy, like NginX or HAProxy but that would not suffice to describe its true purpose.
Since the 90s, engineers trended towards decoupling their applications, moving from monoliths to service oriented architecture to micro-services. Today, we write application with the most appropriate technologies and can update only part of a system.
The micro-service revolution caused an explosion of the tools that orchestrate them and as such of the configuration that comes with them. A question always remained unaswered, though: how to locate a particular service in a micro-service architecture? Thet is from this idea that Traefik was born.
Traefik will listen orchestrators and reload routes dynamically. The project exceeded 600M downloads, 350 contributors and 21K stars on GitHub. It defenitely responds to the needs of many users. It has been written in Go to follow the path of the tools of its ecosystem and because Go is low-level enough, OSS-friendly and can include dependencies in its builds that are compatible for many architectures. In addition, what if all that configuration was not really needed?
There are a small number of key concepts to understand to get started with Traefik:
Traefik will use your Entrypoints to consult your Frontends in order to call the most appropriate Backend.
In essence, Traefik wants your code to feel simple: it handles heavy lifting such as SSL certificates and reduces the configuration to what is necessary for your application. Routing is done using labels and the solutions supports both Websockets and HTTP/2.
Traefik can serve as an Ingress controller for Kubernetes and it is probably its main use today. It will handle routing, tracing and connects navitely with Prometheus.
Gives the possibility to install Traefik in the Data and in the Control plane, both clustered. For it to be secure, scalable and highly available, the Control plane listens the Providers and alert the Data place whenever some change has been detected, without increasing complexity much. Another one of the main differences with the Standard edition is that TraefikEE retains the state and allows for more complex usage such as rate limiting.
The Enterprise Edition is in early access as of the time of writing.
Traefik 2.0 will soon be generally available. This architecture refresh will push back the limits imposed by the current way of doing things and add some features along the way:
The Middleware feature currently represents a challenge to solve as the Go package system does not simply allow to plug additional components so the Middlewares are currently included in the codebase.
Later, other features might get released such as UDP, YAML syntax, a new WebUI, NewMetrics, TLS stores, Canary buiilds, more load balancers, etc. More info here. Release planned in a few months.
If you would like to work at Containous, run the following docker image: