Cloud Native Computing Foundation Bordeaux #1

⏱ Estimated reading time: 4 minutes

The kickoff
Prometheus at scale

The kickoff

By Étienne Coutaud, DevOps consultant @ OCTO Technology - Slides

The CNCF, short for Cloud Native Computing Foundation, is the place to go for people that are interested in cloud solutions, IT automation and more broadly the future of IT. As such, the main goals of the meetup are to share ideas and initiatives around the DevOps culture.

Cloud Native represents a set of tools that can be made use of to develop and deploy cloud applications. The CNCF aims to lead the development of these tools and their practices in the right direction. In other words, their primary objectives are:

To avoid vendor lock-in;
To scale infinitely;
To ease maintainability and agility;
To be resilient;
To optimize resource usage.

The CNCF is also a certification organism for Kubernetes. Whenever possible, always choose Kubernetes certified products. There are also certifications for Kubernetes developers and administrators.

If you are interested in CI/CD, the CNCF has a dashboard displaying the build state of the graduated projects and more.

Prometheus at scale

By Adrien Fillon, Site Reliability Engineer @ CDiscount

For any infrastructure, the concept of Observability can be observed in the sum of the concepts of Logging, Monitoring, Tracing and Visualization. Why, though? Because these underlying concepts can tell if an application runs, if it runs well, to make informed decisions, go ahead or return to a safe state. Of course, the idea is to always push back the limits. Doing so meets teams with philosophical challenges:

Having a strong test culture: tests are an important component of an application and must never be neglected;
Not fearing the root cause of the issues you might face;
Finding and accepting the limits of our tools;
Accepting that alerts are to be tuned and cannot be perfectly set the first time.

Note that logging is not exactly the same as monitoring. Logging means that you will receive points of data when there is activity. Monitoring generates metrics at your desired pace, whether there is zero or millions of users on your platform.

Prometheus has been born in 2012 at SoundCloud. There are two main paradigms when it comes to monitoring and metrics: Pull and Push. With Push, the monitoring tool receives metrics from external applications like with ELK or Graylog. With Pull, the tool retrieves metrics and Prometheus is one of those.

The solution is compatible with the main cloud vendors, integrates with a lot of technologies or can be integrated thanks to its support of the OpenMetrics format, supports auto updates and naming conventions and even has a query language to navigate through the metrics! Nevertheless, Prometheus is not highly available by default, does not scale well, has a high storage cost and makes it hard to manage users and access rights.

Prometheus is one of the graduated projects of the CNCF. This basically means that it is currently running at scale in major companies and is being maintained by a large community of enthusiasts. The limitations described above can be avoided by writing to a distant time-scale database, a feature provided by Prometheus. The solution chosen at CDiscount is Thanos, which promises high availability, metrics merging and block storage.

At CDiscount, the metrics collection system is designed like so: first, you have metrics targets where crawlers extract raw data, the crawlers are federated by the Thanos instances which store their data in block storages and display it to the end users.

Further down the path of monitoring and visualization you will find Vizceral, a tool made by Netflix that will display the network traffic in real time! Netflix uses Chaos engineering to ensure the resilience and reliability of their infrastructure, which makes up for good Vizceral visualizations.

Table of Contents

The kickoff

Prometheus at scale