Uber's M3 vs Thanos

prometheus

#1

Hello Everyone,

We are currently using Prometheus for monitoring Kubernetes components and services running on it. As we have Prometheus per Kubernetes cluster(we have around 30 K8s clusters), we are facing issues mostly around global search, high availability of monitoring data and data retention.

I was exploring Thanos and M3.

Thoughts from the community?


#2

Hi,

I would say it depends on your requirements a bit. If you are looking for a general purpose time-series database M3 looks like a great choice, it’s also a distributed database, which comes with maintenance cost. That said, it has only recently been open sourced so really no one but the people from uber can comment on long term usage.

In terms of performance, Prometheus’s tsdb and M3’s embedded type database have been benchmarked against each other and their performance seems to be close to identical (unsurprising as they are based on the same paper). They deviate a bit but that’s there are different trade-offs like uber’s tsdb decided to always offload time-series ID tracking to the application and different atomicity semantics.[0]

What I appreciate about Thanos is that its setup and maintenance cost is incredibly low, you start by just adding the sidecar to your existing Prometheus server, which essentially acts like a backup mechanism to an s3 like storage and you deploy the storage gateway and a querier and you suddenly have all the features you asked for without a distributed storage. Queriers can be hierarchically deployed so you can also just have a querier per cluster and one that collects to all queriers. That said, Thanos does somewhat require that you have direct network access, which may be an additional setup burden for multi-cluster setups.

My personal preference would be to use Thanos as it is operationally much simpler, but I also know the Prometheus and Thanos codebase quite well, so I might be biased :slight_smile:.

[0] https://github.com/prometheus/tsdb/pull/445


#3

Thanks Brancz.

We ended up using central Cortex with Prometheus remote write configs as our requirement is to build monitoring as a service internally.