Uncovering the Secret Ingredients of Enterprise-Scale Continuous Delivery with Argo CD

The original text was published on Akuity’s blog by Yuan Tang (Akuity), Hong Wang (Akuity) and Alexander Matyushentsev (Intuit).

In-depth understanding of best practices for Argo CD and enterprise-level operations

Here’s a recap of our [talk at KubeCon China 2021] . If you are interested in learning more about Argo or Akuity products and services, you can find all our past and upcoming conference talks on our website .

Unlike Kung Fu Panda’s noodle soup where there are no secret ingredients, a fair amount of effort has been put into the enterprise-grade Argo CD. Did you know that Argo CD can support thousands of applications? Have you tried connecting hundreds of Kubernetes clusters? What about thousands of objects in a single application? In this post, we’ll take a deep dive into Argo CD, answer those questions, and show best practices for using Argo CD at an enterprise scale.

What is Argo?

[The Argo project] is a suite of Kubernetes-native tools for deploying and running jobs and applications. It uses GitOps paradigms like continuous and incremental delivery and implements MLOps on Kubernetes. It consists of 4 separate Kubernetes-native projects. We see teams use different combinations of these products to solve their unique challenges. If you would like to share your Argo journey, please contact us on the CNCF Argo Slack channel and we will invite you to speak at our community conferences, related conferences and meetups.

We have a very strong community and the product has been recognized and used by many companies. It is being adopted as the de facto Kubernetes-native GitOps solution and data processing engine.

We were accepted as CNCF’s incubator project. The project has accumulated over 20,000 GitHub stars, 600 contributors, and 350 end-user companies. We are very proud of the current progress and love being part of the open source community. We are actively working towards the graduation of CNCF.

GitOps operator

Before we dive into the scalability challenges, let’s talk about GitOps in general. First, what is GitOps? One of the definitions is. GitOps is a set of practices for using Git to manage infrastructure and application configuration.

This means that any GitOps operator needs to automate the following steps in sequence.

  • Retrieve manifests from Git by cloning Git repositories (eg GitHub, GitLab)
  • kubectl diffCompare a Git manifest with live resources in a Kubernetes cluster using
  • Finally, use kubectl applypush changes to the Kubernetes cluster.

That’s exactly what Argo CD does. The GitOps workflow doesn’t seem difficult, but the devil is in the details. Let’s move on to see what can go wrong when implementing this GitOps workflow and what you can do about it.

Argo CD Architecture

First, let’s take a look at the architecture of Argo CD. It has three main components: one for each GitOps operator function.

  • The Argo CD repo server is responsible for cloning the git repository and pulling the Kubernetes resource manifest.
  • The Argo CD application controller takes the managed Kubernetes cluster resources and compares the real-time resource inventory with the Git inventory for each application.
  • The Argo CD API server presents the diff results (between the live manifest and the manifest stored in git) to the end user.

Now you may be wondering — why are there so many components? Why not package everything into a small program that performs all three GitOps functions?

multi-tenancy

The reason is that Argo CD provides GitOps functionality as a service to multiple teams. It is capable of managing multiple clusters, retrieving manifests from multiple Git repositories, and serving multiple independent teams.

In other words, you can enable GitOps for your company’s application engineers without requiring them to run and manage any other software.

This is extremely important , especially if your organization is adopting Kubernetes and the application developers are not Kubernetes experts yet. GitOps as a service not only enforces best practices, but also enables self-service by reducing the number of issues/issues that support teams receive from developers.

Scalability

It also means that Argo CD needs to manage potentially hundreds of Kubernetes clusters, retrieve manifests from thousands of Git repositories, and present the results to thousands of users. At this point things can get a little complicated.

The good news is that Argo CD scales very well out of the box. Argo CD is optimized to run on top of Kubernetes, enabling users to take full advantage of the extensibility of Kubernetes.

The screenshot above shows the metrics exposed by an existing Argo CD instance. As you can see, it manages almost 2300 applications deployed on 26 clusters and the manifests are stored in 500 Git repositories. This means that hundreds of application development teams are using the instance and taking advantage of GitOps without much overhead.

Unfortunately, no application scales infinitely, and at some point you may need to tweak your configuration to save resources and get better performance in some edge cases. Let’s get started, walking through some Argo CD configurations you may need to tweak.

application controller

too many apps

Argo CD’s controller runs with multiple workers. Workers form a pipeline, reconciling applications one by one. The default number of processes is 20. This is usually enough to handle hundreds of applications. However, if you have a thousand or more applications, you may start seeing delays of a few hundred milliseconds. Latency increases as you add more and more applications.

One strategy to improve performance and reduce latency is to increase the number of workers in the controller. controller.status.processorsYou can modify it in your Argo CD configuration map . More workers means Argo CD will be processing more applications at the same time. Note that this will also require more memory and CPU, so don’t forget to update the controller’s resource requests and limits accordingly.

too many clusters

As there are more and more applications, the controller will consume more memory and CPU. At some point, it may make sense to run multiple instances of the controller, each using a smaller amount of computing resources. To do this, you can take advantage of the controller’s sharding capabilities.

Unlike stateless web applications, it is not possible to just run multiple instances of the Kubernetes controller. The challenge with Argo CD is that the controller needs to know the state of the entire managed Kubernetes cluster to properly coordinate application resources. However, you might run multiple controller instances, each responsible for a subset of the Kubernetes cluster.

Sharding can be enabled by increasing the number of replications in the argocd-application-controller stateful collection. Don’t forget this, update with the same value ARGOCD_CONTROLLER_REPLICAS. This is necessary for the controller instance to know the total number of replicas and trigger a restart to rebalance work based on the updated configuration. Therefore, each controller instance will do less work and consume less memory and CPU resources.

replication server

manifesto generation

The next component that might need tweaking is the Argo CD repo server. As mentioned earlier, the repo server is responsible for retrieving the resource manifest from the Git repository. This means that Argo CD needs to clone the repository and retrieve the YAML files from the cloned repository.

Cloning a Git repository is not the most challenging task. One of the best practices of GitOps is to separate application source code from deployment manifests, so deployment repositories are usually small and don’t require a lot of disk space. So, if you have a bunch of plain YAML files in your repository, you should be fine, and you won’t need to make any changes to the repo server configuration.

The problem is that deployment libraries usually don’t have plain YAML files. Instead, users prefer to use configuration management tools such as Kustomize, Helm or Jsonnet. These tools can help developers avoid duplicating YAML content and allow changes to be introduced more efficiently. Of course, you can ask the user to store the generated YAML in the deployment repository, but Argo CD has a better solution: it can run manifest generation on the fly.

Argo CD supports several configuration management tools out of the box and allows configuration of any other configuration management tool. During the manifest generation process, the Argo CD repo-server executes/forks the configuration management tool binary and returns the generated manifest, which usually requires memory and CPU. To ensure quick manifest generation, it is recommended to increase the number of replicas on the repo-server.

Mono Repositories

Typically, running 3 to 4 Repo server instances is enough to handle hundreds or even thousands of Git repositories. Argo CD aggressively caches generated manifests and does not need to regenerate manifests as often.

However, if you store your deployment manifest in a so-called single repository, you may encounter some performance issues. A single repository is one that only has a large number of applications.

A real-world mono repository might have hundreds of applications, including infrastructure components as well as multiple microservices. Typically, a single repository is used to represent the desired state of the entire cluster.

This creates some performance challenges. Every commit to a mono repo invalidates existing caches for all applications in that repo. This means that Argo CD needs to suddenly regenerate manifests for hundreds of applications, causing CPU/memory spikes. Some configuration management tools do not allow the manifest to be generated at the same time. For example, multiple applications that rely on Helm charts with conditional dependencies must be processed sequentially.

limit parallelism

Generating a large number of manifests can cause CPU and memory spikes. Memory spikes are the biggest problem as it can lead to out of memory (OOM) kills. To solve this problem, you can reposerver.parallelism.limitlimit the number of concurrent manifests generated by each repo server instance. This number depends on how much memory you plan to give to the repo server, and how much memory your configuration management tools use.

Caching with Webhooks

Next, we’d like to introduce another performance optimization technique that might help you avoid manifest generation spikes altogether. Argo CD invalidates the manifest cache for all applications because it doesn’t think the generated manifest depends only on files within the application’s relevant directory. However, this is often the case.

To avoid unnecessarily invalidating the cache when unrelated files change, you can configure commit webhooks and argocd.argoproj.io/manifest-generate-pathsannotate your Argo CD application with annotations. The value of the annotation should contain a list of directories that the application depends on. Every time the webhook notifies Argo CD of a new commit, it checks the changed files listed in the webhook payload and reuses the previous commit if the new commit doesn’t touch any files related to the application. Any generated manifest.

API server

The API server is a stateless API that scales well and doesn’t require much computing resources. The API server maintains an in-memory cache for all Argo CD applications. Therefore, if you are using a single Argo CD instance to manage more than 5000 applications, you may want to consider adding additional memory limits.

Monitoring and Alerting

Argo CD exposes many Prometheus metrics . Below are a few examples.

  • argocd_app_reconcileRepresents the harmonic performance of the application
  • workqueue_depth, representing the depth of the controller queue
  • argocd_app_sync_totalCounts the number of sync operations applied in history.

You can use [the community-maintained Grafana dashboard] and review the high availability documentation for metrics.

Community and Other Resources

We hope this article has given you a glimpse of what your Dev/Ops team can achieve when using Argo CD (and this is just one product in the Argo suite we offer).

For more information, visit Akuity’s website and check out the other community-related links below.

Thanks to Wojtek Cichon.

Leave a Comment

Your email address will not be published. Required fields are marked *