Fluentd logs memory overflow

My Box is Ubuntu 18.04 with microk8s.
I enabled Fluentd addon. It works correct but the log memory buffer grows indefinitely and eventually the machine dies by lack of memory. How can I make fluentd rotate logs in memory, or remediate to this issue?
Thanks for for you advices.

I am not sure it is fluentd consuming the memory. It can very well be elasticsearch which comes with the fluentd addon.
Elasticsearch uses a lot of memory though but im curious to know what is your system’s allocated memory.

The machine is an Azure VM 8 cores 32 GO RAM. I was also coming to the conclusion that’s an Elasticsearch issue. fluentd collects all kube-system logs and also some application logs. The consumption / leakage is approximately 100 MiB / hour. Since 50 pods run (low workload however), the cluster dies in a few days. I read several mailing lists on this topic, but no actual clue on what to do to fix the issue. Any idea is welcome.

BTW, Elasticsearch consumes 1500 MiB just after startup.

This is strange. Both pods have memory resource limits defined. Here.

If the pod utilizes more than that ideally it will be OOMkilled.
Running top may show some sign. Or enabling metrics-server may also help track which pod can be misbehaving.
We’ve run the same fluentd image for several years and it never brought down a server.

Thanks, I looked at the config, yes it looks good !
But I see that RBAC is invoked in this config. RBAC is not enabled on the cluster because it is in my playground. May be I have to enable it prior to enable fluentd addon ? Can it be the root cause of the issue ?
Additionally I added the two lines on the kublet file to rotate docker log files.
Otherwise, I use microk8s right out of the box.

I installed the metrics server dans looking at the config I discovered

this error. It look that this problem is linked to RBAC : https://github.com/ubuntu/microk8s/issues/729

Enabling rbac will probably resolve that error you are seeing in the dashboard.
When you say machine fails, meaning it becomes unresponsive? What does top command show which process using the most memory.

I have not recorded the top. The machine becomes extremely slow to respond, for instance moving the cursor in the console takes 20 or 30 seconds, the CPU is 400 % and total free ram is a few Mi.
I have posted on Github as you seen, I have redone an install on a VM but with insufficient memory :frowning: So I am doing a new one with 6 Gi on my local workstation. I keep you informed of the ability to reproduce the problem or not.
Best regards

It comes from the gossip of system logs which which fill ElasticSearch index at high speed. I can’t manage to stop this noise source.
It fills 1,5 GBi per day for no activity.
Here is the reporting of Elasticsearch on yesterday index.
yellow open logstash-2020.08.25 1aIWCccbTmSK-t2vqbUwXA 5 1 244222 0 1.5gb 1.5gb

The log is repeated : ERROR: logging before flag.Parse: E0826 09:12:22.951717 1 reflector.go:205] k8s.io/autoscaler/addon-resizer/nanny/kubernetes_client.go:107: Failed to list *v1.Node: v1.NodeList: Items: v1.Node: v1.Node: ObjectMeta: v1.ObjectMeta: readObjectFieldAsBytes: expect : after object field, parsing 1550 …:{},“k:{”… at {“kind”:“NodeList”,“apiVersion”:“v1”,“metadata”:{“selfLink”:"/api/v1/nodes",“resourceVersion”:“22659126”},“items”:[{“metadata”:{“name”:“copservices”,“selfLink”:"/api/v1/nod

If somebody knows how to stop this noisy log, I would be grateful.
Best regards

I couldn’t determine where this is coming from. It is possible it comes from the kubernetes control plane components like apiserver, kubelet, controller manager etc.

If it is from the control plane you can opt to exclude these logs from being fed to elasticsearch.

In the kube-system namespace you will find the fluentd config. The ConfigMap name is something like this fluentd-es-config-v0.2.0. do a kubectl edit of that configmap.

You will find the section system.input.conf:. From there you can delete the section where kubelet, etcd, kube-proxy, …

After saving it, you will have to bounce the fluentd DaemonSet. It should reduce your elastic index sizes.

Just a reminder that doing this will force you to check the system journal for logs related to the control plane components.

Thank you @balchua1 , I look at the config and I tell you. I saw on some mailing lists that our cluster is not the only one suffering this issue. I keep you informed.
Best regards

I’m trying to get around this problem with using elastic search built in life cycle management tool with automatic snapshots and log rotation following the docs

I’m currently stuck with appending the elasticsearch.yml config file on startup with something like

         - /bin/bash
         - -c
         - |
           echo 'path.repo: [\"/var/backups\"]' >> /usr/share/elasticsearch/config/elasticsearch.yml'

Has anyone else gone this route?

Seems like snapshot lifecycle management is not available in the open source version so it was a dead end.

Hello, Yes I think it is not in OSS version. I intend to do a Kubernetes CronJob which does the deletion call to the ES REST API. It might be possible with a very short script.