Aqua Blog

When Security Scans Break the Brain: Solving Trivy’s etcd Exhaustion Problem

When Security Scans Break the Brain: Solving Trivy’s etcd Exhaustion Problem

This post comes from Pushkar Joglekar, Principal Security Engineer at Broadcom, where he focuses on VMware Kubernetes distributions. Pushkar is a Kubernetes security maintainer and co-author of the security chapters in Nigel Poulton’s “The Kubernetes Book.” He writes from experience securing infrastructure at scale.

It starts with a failing deployment. You attempt a quick fix—maybe a kubectl scale or a simple label update—but instead of success, you’re met with a cryptic, terrifying error: etcdserver: mvcc: database space exceeded.

Suddenly, your Kubernetes cluster is a ghost ship. You’ve fallen into a “Read-Only” state where the API server rejects every single write operation. You can’t scale, you can’t deploy, and crucially, you can’t even run kubectl delete pod to free up resources, because the deletion itself requires a write to the database. Your cluster’s “brain” etcd, is paralyzed.

As an architect, here is the hard reality: at scale, improperly configured vulnerability reporting can create significant control plane pressure. Trivy and the Trivy Operator are gold-standard tools for visibility, but proper configuration is crucial for continued infrastructure availability.

1. Your Security Reports are “Too Much of a Good Thing”

The Trivy Operator follows the Kubernetes Operator Model, translating vulnerabilities into VulnerabilityReport Custom Resources (CRDs). These reports live directly in etcd. While having this data available via kubectl is great for visibility, it is a primary driver of storage bloat.

Users often seek “full visibility” by enabling optional fields like “Links,” “CVSS,” and “Description.” In a container image with hundreds of vulnerabilities, these extra fields can cause a single report to exceed the 1.5 MiB Kubernetes API server hard limit. This isn’t a limit you can “brute force” with better hardware; it’s a protocol-level constraint designed to protect the cluster.

2. The MVCC Revision Trap: Churn vs. Objects

A common misconception is that if you only have a few dozen images, your etcd usage should stay low. This ignores how etcd handles data using Multi-Version Concurrency Control (MVCC).

Etcd never actually overwrites an existing key. Every time the Trivy Operator updates a report, etcd creates a new revision—essentially a full copy of the object. If you are scanning images every hour and have high resource churn, you aren’t just storing ten reports; you are storing hundreds of historical versions of those reports.

While compaction marks that old space as “free” for etcd to reuse internally, only a defragmentation operation actually releases that physical disk space back to the underlying operating system. Without a maintenance strategy for compaction and defrag, your 8GB quota will vanish in a matter of days.

3. The “Off-Etcd” Escape Hatch

For large-scale environments, the most robust architectural move is to stop using etcd for vulnerability data entirely when fine tuning of reporting fields is not practical. The Trivy Operator provides a high-impact “escape hatch” that moves report storage to a Persistent Volume. The downside of this is the vulnerability reports are not stored in etcd as CustomResource objects so kubectl cannot be used to find more information around vulnerabilities.

When you enable the OPERATOR_ALTERNATE_REPORT_STORAGE_ENABLED feature, the operator installation will include a Persistent Volume Claim (PVC). You must ensure your cluster has a default StorageClass (or configure one specifically) to fulfill this claim. This bypasses the 1.5 MiB API limit and the etcd quota completely.

To implement this storage shift, configure your deployment with these variables:

  • Set OPERATOR_ALTERNATE_REPORT_STORAGE_ENABLED to true
  • Set OPERATOR_ALTERNATE_REPORT_STORAGE_DIR to a writable path like /var/reports

4. The One-Two Punch: TTL vs. etcd Maintenance

Maintaining a lean database requires a two-pronged approach: managing the Kubernetes objects, and maintaining the database itself.

First, you need to cap your future growth by implementing an aggressive Time-To-Live (TTL) strategy. By configuring the OPERATOR_SCANNER_REPORT_TTL variable (often 24 hours), you ensure the operator’s controller systematically deletes stale vulnerability reports rather than letting your active object count grow infinitely.

Note, in some older versions of the trivy operator, this is named OPERATOR_VULNERABILITY_SCANNER_REPORT_TTL

However, TTL configuration alone will not resolve accumulated etcd growth.

Because etcd uses a Multi-Version Concurrency Control (MVCC) model, it never truly overwrites data. When your TTL policy deletes a report, etcd simply creates a new revision marking that object as deleted. To actually reclaim disk space and cure the accumulated history, you must pair your TTL strategy with routine etcd maintenance:

  • Compaction: This process scrubs the database of those historical revisions and deleted objects, freeing up logical space within the database.
  • Defragmentation: This is the mandatory final step. While compaction frees up internal space, only a defragmentation operation actually releases that physical disk space back to the underlying operating system.

In short: TTL puts a ceiling on your future report growth, while compaction and defragmentation clean up the historical mess left behind.

5. Efficiency Gains with Client-Server Mode

If you are running the operator in “Standalone” mode, you are paying a massive “infrastructure tax.” In Standalone mode, every single scan job creates a pod with an init container that must download the entire vulnerability database (several hundred megabytes) from a registry or GitHub releases.

This is the least efficient possible method. It adds significant network overhead, slows down pod startup times, and risks GitHub rate-limiting. By switching to ClientServer mode, the operator queries a centralized, shared Trivy Server. This reduces redundant data traffic and ensures the scanner pods themselves remain lightweight and fast.

6. The “High-Severity Only” Filter

If your etcd is still under pressure, the simplest fix is to stop recording data you aren’t actually going to act on. Storing thousands of LOW or UNKNOWN vulnerabilities as persistent CRDs provides little value but creates massive key-value churn.

You can patch the trivy-operator-trivy-config ConfigMap to filter results at the source. By setting the trivy.severity key, you instruct the scanner to only generate reports for the issues that matter. This allows you to keep the vulnerability data in etcd, but prevents you from getting a full picture of vulnerabilities across severities.

Example Patch:

kubectl patch cm trivy-operator-trivy-config -n trivy-system \
--patch '{"data":{"trivy.severity":"HIGH,CRITICAL"}}'

This single change can reduce the number of objects (and their size) by up to 80% in many environments.

Conclusion: A Healthier Security Posture

Visibility should never come at the cost of availability. A cluster that is “secure” but read-only is a cluster that isn’t doing its job. By moving to Client-Server mode, utilizing alternate storage via PVCs, and being intentional about your severity filters, you can maintain deep security insights without threatening your control plane.

Pro Tip

Don’t wait for the NOSPACE alarm. Set up an alert in Prometheus for apiserver_storage_size_bytes (or etcd_db_total_size_in_bytes on older clusters). If you see it trending toward 80% of your quota, it’s time to defrag.

Pushkar Joglekar
Pushkar is a Principal Security Engineer @ Broadcom focusing on VMware Kubernetes Distribution and Releases. He is also a Kubernetes security maintainer and is based in San Francisco. Since 2019, he has felt incredibly fortunate to have written the security chapters in Nigel Poulton’s “The Kubernetes Book” which he looks forward to updating every year. Prior to his current role, he leveraged his expertise to secure massive planet-scale private and public cloud infrastructure built to run sensitive workloads at End User companies like Credit Karma and Visa.