⏱ Estimated reading time: 9 minutes
Table of Contents
- CNCF time
- Traefik: A scalable and HA edge router
- CNCF Bordeaux is now an official CNCF group 🎉;
- Kubernetes 1.14 is now generally available. Among the features: support for Windows nodes, updates in Kubectl and addition of Persistent Local Volumes. Go read the announcement here;
- The French translation initiative for the Kubernetes documentation is still ongoing. If you are interested, visit the Kubernetes doc or fork the website repository!
Discovery of the session
If you want to learn a lot about the internals of Kubernetes, try kelseyhightower/kubernetes-the-hard-way when you have some time to spare and some leftover credit on your favorite cloud provider.
The OVH Managed Kubernetes platform has been released in beta during the OVH Summit, the 18th of October of 2018. The response has been fast; today the solution supports over 2000 active clusters, 10k pods on the administration clusters, 1500 CPUs and 22 Terabytes of RAM! The offer is managed because the Masters, a.k.a. Control plane, and some Node components, a.k.a. Compute, are taken in charge by OVH.
In their architecture, the Masters contain etcd, the OpenStack interface for Persistent Volumes, health checks, a load balancer, a route controller, the scheduler and API Server. This essential last component centralizes the connections between the Control plane and the Compute. Each node contains a kubelet and a kube-proxy, both managed by OVH as well.
In all of this, the critical components are etcd, API Server and kube-proxy. This project being very dependent of Kubernetes and having rewritten most of its core components, it is not recommended to try this at home.
Kubernetes in Kubernetes?
The presented architecture actually contains itself! The Managed Kubernetes solution is a giant Kubernetes cluster which nodes are also Kubernetes! Only the client-facing nodes and etcd are managed separately from the system. This allows OVH to very quickly deploy new clusters.
Side-effect: internally, the communication is handled by kube-proxy and routed through the OVH Load Balancer. This allows OVH to use their dedicated network to their full potential and to provide a solid and performant backbone for the Managed Kubernetes solution. The interface to access the services is NodePort.
Admin nodes can communicate with API Server with TLS + SNI, meaning that the hostname appears in clear in the request. Client nodes can communicate through a TCP tunnel with API Server, using the kubelet and kube-proxy.
Experience feedbacks: etcd
- etcd as a standalone Pod
Simple to deploy but impossible to scale, non-persistent storage and risk of loss of quorum. The solution as been deemed impractical for the use case.
- etcd as an Operator
Easy to allocate and to manage its lifecycle but still non-persistent storage.
- StatefulSet / Persistent Volume
Crash-resilient but no lifecycle management, terrible performance linked to network calls and costly.
- etcd on dedicated hardware
Used in multi-tenant, cost-effective and dedicated resources. Logically separates clients using tokens and namespaces. This solution has been chosen as it is both the most performant and the easiest to manage.
Th Nodes need to interact using the Container Network Interface. Such an infrastructure is very sensitive to latency and there are many implementations and many CVEs. OVH made the choice to rely on their vRack solution and route at Level 2 without using CNI!
Unfortunately, Level 2 routing implies that the heavy lifting must be handled by the infrastructure. Here, BREX (External Bridge) gives complete control over the network, which makes it a tempting solution. BREX requires ip_forward to be enabled. In addition, it rewrites the origin MAC address during SYN but not during SYN/ACK, which causes a broadcast. This can be problematic when a Node calls the API Server: when the number of nodes increases, one broadcast can essentially become an internal DDoS.
A solution that solves the broadcast issue for ARP requests resides in an ARP Kernel module which intrecepts connections and rewrites the MAC address exposed by the Pod to the one exposed by the Node.
Kube-proxy is the component that handles communication with the Internet. It works in three modes:
- Userspace: Now deprecated, acted as an actual proxy which forwarded everything;
- IPTables: Loads IPTables rules in a blocking way. Very fast for small clusters and full of features but not very scalable;
- IPVS: Hashes Destination ports and IPs, which eventually reduces routing to locating the right hash. Handles connection loss and load balancing but not much more.
For the reload performance and the bandwidth limit, IPVS has been chosen. IPTables would handle 5k services while IPVS is able to scale to 25k before performance and bandwidth loss becomes noticeable.
Questions & Answers session
Recommended read: OVH Blog
- OVH maintains a Staging and a Production cluster and runs the CNCF CI on them;
- OVH uses the OpenStack API for their cloud provider services with no reimplementation. That means that migrations are painless for OVH users;
- etcd is not replicated in multiple Availability Zones but is nonetheless replicated and backupped with the commits. Multi-AZ etcd would introduce latency issues and reduce the performance.
Traefik: A scalable and HA edge router
Traefik could be classified as a web proxy, like NginX or HAProxy but that would not suffice to describe its true purpose.
Since the 90s, engineers trended towards decoupling their applications, moving from monoliths to service oriented architecture to micro-services. Today, we write application with the most appropriate technologies and can update only part of a system.
The micro-service revolution caused an explosion of the tools that orchestrate them and as such of the configuration that comes with them. A question always remained unaswered, though: how to locate a particular service in a micro-service architecture? Thet is from this idea that Traefik was born.
Traefik will listen orchestrators and reload routes dynamically. The project exceeded 600M downloads, 350 contributors and 21K stars on GitHub. It defenitely responds to the needs of many users. It has been written in Go to follow the path of the tools of its ecosystem and because Go is low-level enough, OSS-friendly and can include dependencies in its builds that are compatible for many architectures. In addition, what if all that configuration was not really needed?
There are a small number of key concepts to understand to get started with Traefik:
- Providers: they deliver the configuration and routes. Usually, Docker and Kubernetes are considered Traefik providers;
- Entrypoints: Actual entry points for your system;
- Backends: sets of servers that eventually handle requests;
- Frontends: Rules that match and route each request;
Traefik will use your Entrypoints to consult your Frontends in order to call the most appropriate Backend.
In essence, Traefik wants your code to feel simple: it handles heavy lifting such as SSL certificates and reduces the configuration to what is necessary for your application. Routing is done using labels and the solutions supports both Websockets and HTTP/2.
Traefik can serve as an Ingress controller for Kubernetes and it is probably its main use today. It will handle routing, tracing and connects navitely with Prometheus.
Gives the possibility to install Traefik in the Data and in the Control plane, both clustered. For it to be secure, scalable and highly available, the Control plane listens the Providers and alert the Data place whenever some change has been detected, without increasing complexity much. Another one of the main differences with the Standard edition is that TraefikEE retains the state and allows for more complex usage such as rate limiting.
The Enterprise Edition is in early access as of the time of writing.
Traefik 2.0 will soon be generally available. This architecture refresh will push back the limits imposed by the current way of doing things and add some features along the way:
- Documentation overhaul;
- Clarification of the concepts, with only three remaining: Entrypoints, Routers and Services;
- A more expressive routing syntax;
- Middlewares and chains of middlewares: authentication, retries, cache etc.;
- Custom resource definition. This is not standard but will offer a more future-proof solution;
- TCP-level routing with SNI support, passthrough and port to HTTP forwarding;
The Middleware feature currently represents a challenge to solve as the Go package system does not simply allow to plug additional components so the Middlewares are currently included in the codebase.
Later, other features might get released such as UDP, YAML syntax, a new WebUI, NewMetrics, TLS stores, Canary buiilds, more load balancers, etc. More info here. Release planned in a few months.
If you would like to work at Containous, run the following docker image: