Self-Healing Rollouts: Automating Production Fixes with Agentic AI and Argo Rollouts

Rolling out changes to all users at once in production is risky—we’ve all learned this lesson at some point. But what if we could combine progressive delivery techniques with AI agents to automatically detect, analyze, and fix deployment issues? In this article, I’ll show you how to implement self-healing rollouts using Argo Rollouts and agentic AI to create a fully automated feedback loop that can fix production issues while you grab a coffee.

The Case for Progressive Delivery

Progressive Delivery is a term that encompasses deployment strategies designed to avoid the pitfalls of all-or-nothing deployments. The concept gained significant attention after the CrowdStrike incident, where a faulty update took down a substantial portion of the internet. Their post-mortem revealed a crucial lesson: they should have deployed to progressive “rings” or “waves” of customers, with time between deployments to gather metrics and telemetry.

The key principles of progressive delivery are:

  • Avoiding downtime: Deploy changes gradually with quick rollback capabilities
  • Limiting the blast radius: Only a small percentage of users are affected if something goes wrong
  • Shorter time to production: Safety nets enable faster, more confident deployments

As I like to say: “If you haven’t automatically destroyed something by mistake, you’re not automating enough.”

Progressive Delivery Techniques

Rolling Updates

Kubernetes provides rolling updates by default. As new pods come up, old pods are gradually deleted, automatically shifting traffic to the new version. If issues arise, you can roll back quickly, affecting only the percentage of traffic that hit the new pods during the update window.

Blue-Green Deployment

This technique involves deploying a complete copy of your application (the “blue” version) alongside the existing production version (the “green” version). After testing, you switch all traffic to the new version. While this provides quick rollbacks, it requires twice the resources and switches all traffic at once, potentially affecting all users before you can react.

Canary Deployment

Canary deployments offer more granular control. You deploy a new version alongside the stable version and gradually increase the percentage of traffic going to the new version—perhaps starting with 5%, then 10%, and so on. You can route traffic based on various parameters: internal employees, IP ranges, or random percentages. This approach allows you to detect issues early while minimizing user impact.

Feature Flags

Feature flags provide even more granular control at the application level. You can deploy code with new features disabled by default, then enable them selectively for specific user groups. This decouples deployment from feature activation, allowing you to:

  • Ship faster without immediate risk
  • Enable features for specific customers or user segments
  • Quickly disable problematic features without redeployment

You can implement feature flags using dedicated services like OpenFeature or simpler approaches like environment variables.

Progressive Delivery in Kubernetes

Kubernetes provides two main architectures for traffic routing:

Service Architecture

The traditional approach uses load balancers directing traffic to services, which then route to pods based on labels. This works well for basic scenarios but lacks flexibility for advanced routing.

Ingress Architecture

The Ingress layer provides more sophisticated traffic management. You can route traffic based on domains, paths, headers, and other criteria, enabling fine-grained control essential for canary deployments. Popular ingress controllers include:

Enter Argo Rollouts

Argo Rollouts is a Kubernetes controller that provides advanced deployment capabilities including blue-green deployments, canary releases, analysis, and experimentation. It’s a powerful tool for implementing progressive delivery in Kubernetes environments.

How Argo Rollouts Works

The architecture includes:

  1. Rollout Controller: Manages the deployment process
  2. Rollout Object: Defines the deployment strategy and analysis configuration
  3. Analysis Templates: Specify metrics and success criteria
  4. Replica Sets: Manages stable and canary versions with automatic traffic shifting

When you update a Rollout, it creates separate replica sets for stable and canary versions, gradually increasing canary pods while decreasing stable pods based on your defined rules. If you’re using a service mesh or advanced ingress, you can implement fine-grained routing—sending specific headers, paths, or user segments to the canary version.

Analysis Options

Argo Rollouts supports various analysis methods:

  • Prometheus: Query metrics to determine rollout health
  • Datadog: Integration with Datadog monitoring
  • Kubernetes Jobs: Run custom analysis logic—check databases, call APIs, or perform any custom validation

The experimentation feature is particularly interesting. We considered using it to test Java upgrades: deploy a new Java version, run it for a few hours gathering metrics on response times and latency, then decide whether to proceed with the full rollout—all before affecting real users.

Adding AI to the Mix

Now, here’s where it gets interesting: what if we use AI to analyze logs and automatically make rollout decisions?

The AI-Powered Analysis Plugin

I developed a plugin for Argo Rollouts that uses Large Language Models (specifically Google’s Gemini) to analyze deployment logs and make intelligent decisions about whether to promote or rollback a deployment. The workflow is:

  1. Log Collection: Gather logs from stable and canary versions
  2. AI Analysis: Send logs to an LLM with a structured prompt
  3. Decision Making: The AI responds with a promote/rollback recommendation and confidence level
  4. Automated Action: Argo Rollouts automatically promotes or rolls back based on the AI’s decision

The prompt asks the LLM to:

  • Analyze canary behavior compared to the stable version
  • Respond in JSON format with a boolean promotion decision
  • Provide a confidence level (0-100%)

For example, if the confidence threshold is set to 50%, any recommendation with confidence above 50% is executed automatically.

The Complete Self-Healing Loop

But we can go further. When a rollout fails and rolls back, the plugin automatically:

  1. Creates a GitHub Issue: The LLM generates an appropriate title and detailed description of the problem, including log analysis and recommended fixes
  2. Assigns a Coding Agent: Labels the issue to trigger agents like JulesGitHub Copilot, or similar tools
  3. Automatic Fix: The coding agent analyzes the issue, creates a fix, and submits a pull request
  4. Continuous Loop: Once merged, the new version goes through the same rollout process

Live Demo Results

In my live demonstration, I showed this complete workflow in action:

Successful Deployment: When deploying a working version (changing from “blue” to “green”), the rollout progressed smoothly through the defined steps (20%, 40%, 60%, 80%, 100%) at 10-second intervals. The AI analyzed the logs and determined: “The stable version consistently returns 100 blue, the canary version returns 100 green, both versions return 200 status codes. Based on the logs, the canary version seems stable.”

Failed Deployment: When deploying a broken version that returned random colors and threw panic errors, the system:

  • Detected the issue during the canary phase
  • Automatically rolled back to the stable version
  • The AI analysis identified: “The canary version returns a mix of colors (purple, blue, green, orange, yellow) along with several panic errors due to runtime error index out of range with length zero”
  • Provided a confidence level of 95% that the deployment should not be promoted
  • Automatically created a GitHub issue with detailed analysis
  • Assigned the issue to Jules (coding agent)
  • Within 3-5 minutes, received a pull request with a fix

The coding agents (I demonstrated both Jules and GitHub Copilot) analyzed the code, identified the problem in the getColor() function, fixed the bug, added tests, and created well-documented pull requests with proper commit messages.

Technical Implementation

The Rollout Configuration

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: canary-demo
spec:
  strategy:
    canary:
      analysis:
        templates:
          - templateName: canary-analysis-ai

The Analysis Template

The template configures the AI plugin to check every 10 seconds and require a confidence level above 50% for promotion:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: canary-analysis-ai
spec:
  metrics:
    - name: success-rate
      interval: 10s
      successCondition: result > 0.50
      provider:
        plugin:
          argoproj-labs/metric-ai:
            model: gemini-2.0-flash
            githubUrl: https://github.com/carlossg/rollouts-demo
            extraPrompt: |
              Ignore color changes.

Agent-to-Agent Communication

The plugin supports two modes:

  1. Inline Mode: The plugin directly calls the LLM, makes decisions, and creates GitHub issues
  2. Agent Mode: Uses agent-to-agent (A2A) communication to call specialized agents with domain-specific knowledge and tools

The native mode is particularly powerful because you can build agents that understand your specific problem space, with access to internal databases, monitoring tools, or other specialized resources.

The Future of Self-Healing Systems

This approach demonstrates the practical application of AI agents in production environments. The key insight is creating a continuous feedback loop:

  1. Deploy changes progressively
  2. Automatically detect issues
  3. Roll back when necessary
  4. Generate detailed issue reports
  5. Let AI agents propose fixes
  6. Review and merge fixes
  7. Repeat

The beauty of this system is that it works continuously. You can have multiple issues being addressed simultaneously by different agents, working 24/7 to keep your systems healthy. As humans, we just need to review and ensure the proposed fixes align with our intentions.

Practical Considerations

While this technology is impressive, it’s important to note:

  • AI isn’t perfect: The agents don’t always get it right on the first try (as demonstrated when the AI ignored my instruction about color variations)
  • Human oversight is still crucial: Review pull requests before merging
  • Start simple: Begin with basic metrics before adding AI analysis
  • Tune your confidence thresholds: Adjust based on your risk tolerance
  • Monitor the monitors: Ensure your analysis systems are reliable

Getting Started

If you want to implement similar systems:

  1. Start with Argo Rollouts: Learn basic canary deployments without AI
  2. Implement analysis: Use Prometheus or custom jobs for analysis
  3. Add AI gradually: Experiment with AI analysis for non-critical deployments
  4. Build the feedback loop: Integrate issue creation and coding agents
  5. Iterate and improve: Refine your prompts and confidence thresholds

Conclusion

Progressive delivery isn’t new, but combining it with agentic AI creates powerful new possibilities for self-healing systems. While we’re not at full autonomous production management yet, we’re getting closer. The technology exists today to automatically detect, analyze, and fix many production issues without human intervention.

As I showed in the demo, you can literally watch the system detect a problem, roll back automatically, create an issue, and have a fix ready for review—all while you’re having coffee. That’s the future I want to work toward: systems that heal themselves and learn from their mistakes.

Resources

Google Cloud Next Recap

google-next-logoSeveral interesting announcements from last week Google Next conference.

Knative, a new OSS project built by Google, Red Hat, IBM,… to build, deploy, and manage modern serverless workloads on Kubernetes. Built upon Istio, with 1.0 coming soon and managed Istio on GCP. It includes a build primitive to manage source to kubernetes flows, that can be used independently. Maybe it is the new standard to define sources and builds in Kubernetes. Read more from Mark Chmarny.

GKE on premise, a Google-configured version of Kubernetes with multi-cluster management, running on top of VMware’s vSphere.

Another Kubernetes related mention was the gVisor pod sandbox, with experimental support for Kubernetes, to allow running sandboxed containers in a Kubernetes cluster. Very interesting for multi-tenant clusters and docker image builds.

Cloud Functions are now Generally Available, and more serverless features are launched:

Serverless containers allow you to run container-based workloads in a fully managed environment and still only pay for what you use. Sign up for an early preview of serverless containers on Cloud Functions to run your own containerized functions on GCP with all the benefits of serverless.

A new GKE serverless add-on lets you run serverless workloads on Kubernetes Engine with a one-step deploy. You can go from source to containers instantaneously, auto-scale your stateless container-based workloads, and even scale down to zero.

Cloud Build, a fully-managed CI/CD platform that lets you build and test applications in the cloud. With an interesting approach where all the pipeline steps are containers themselves so it is reasonably easy to extend. It integrates with GitHub for repos with a Dockerfile (let’s see if it lasts long after Microsoft acquisition).

Other interesting announcements include:

  • Edge TPU, a tiny ASIC chip designed to run TensorFlow Lite ML models at the edge.
  • Shielded VMs – untampered virtual machines

  • Titan Security Key, a FIDO security key with firmware developed by Google. Google security was giving away at the conference both NFC and bluetooth keys, a good replacement for the yubikeys specially for mobile devices.

Introduction to Amazon Web Services Identity and Access Management

Using AWS Identity and Access Management you can create separate users and permissions to use any AWS service, for instance EC2, and avoid giving other people your Amazon username, password or private key.

You can set very granular permissions, on users, groups, specific resources, and a combination of them. It will become really complex soon! But there are several very common use cases, that IAM is useful for. For instance having a AWS account for a team of developers.

Getting started

You can go through the Getting Started Guide, but I’ll save you some time:

Download IAM command line tools

Store your AWS credentials in a file, ie. ~/account-key

AWSAccessKeyId=AKIAIOSFODNN7EXAMPLE
AWSSecretKey=wJalrXUtnFEMI/K7MDENG/bPxRfiCYzEXAMPLEKEY

Configure environment variables

export AWS_IAM_HOME=<path_to_cli>
export PATH=$AWS_IAM_HOME/bin:$PATH
export AWS_CREDENTIAL_FILE=~/account-key

Creating an admin group

When you have IAM setup, the next step is to create an Admins group where you can add yourself

iam-groupcreate -g Admins

Create a policy in a file, ie. MyPolicy.txt

{
   "Statement":[{
      "Effect":"Allow",
      "Action":"*",
      "Resource":"*"
      }
   ]
}

Upload the policy

iam-groupuploadpolicy -g Admins -p AdminsGroupPolicy -f MyPolicy.txt

Creating an admin user

Create an admin user with

iam-usercreate -u YOUR_NAME -g Admins -k -v

The response looks similar to this:

AKIAIOSFODNN7EXAMPLE
wJalrXUtnFEMI/K7MDENG/bPxRfiCYzEXAMPLEKEY
arn:aws:iam::123456789012:user/YOUR_NAME
AIDACKCEVSQ6C2EXAMPLE

The first line is your Access Key ID; the second line is your Secret Access Key. You need to save these IDs.

Save your Access Key ID and your Secret Access Key to a file called for instance ~/YOUR_NAME_cred.txt. You can use those credentials from now on instead of the global AWS credentials for the whole account.

export AWS_CREDENTIAL_FILE=~/YOUR_NAME_cred.txt

Creating a dev group

Let’s create an example dev group where the users will have only read access to EC2 operations.

 iam-groupcreate -g dev

Now we need to set the group policy to allow all EC2 Describe* actions, which are the ones that allow users to see data, but not to change it. Create a file MyPolicy.txt with these contents

{
  "Statement": [
     {
       "Sid": "EC2AllowDescribe",
       "Action": [
         "ec2:Describe*"
       ],
       "Effect": "Allow",
       "Resource": "*"
     }
   ]
 }

Now upload the policy

iam-groupuploadpolicy -g dev -p devGroupPolicy -f MyPolicy.txt

Creating dev users

To create a new AWS user under the dev group

iam-usercreate -u username -g dev -k -v

Create a login profile for the user to log into the web console

iam-useraddloginprofile -u username -p password

The user can now access the AWS console at

https://your_AWS_Account_ID.signin.aws.amazon.com/console/ec2

Or you can make life easier by creating an alias

 iam-accountaliascreate -a maestrodev

and now the console is available at

https://maestrodev.signin.aws.amazon.com/console/ec2

About Policies

AWS policy files can be really complex. The AWS Policy Generator will help as a start point and see what actions can be used, but it won’t help you making them easier to read (using wildcards) or applying them to specific resources. Amazon could have provided a better generator tool allowing you to choose your own resources (users, groups, S3 buckets,…) from a easy to use interface and not having to lookup all sorts of crazy AWS identifiers. Hopefully they will be provide a comprehensive tool as part of the AWS Console.

There is more information available at the IAM User Guide.

Update

Just after I wrote this post Amazon has made IAM available in the AWS management console, which makes using IAM way easier.

Javagruppen 2011: Build and test in the cloud slides

Last week spent some good days in Denmark for Javagruppen annual conference as I mentioned in a previous post. It’s a small conference that allows you to cover any question that the attendees have and be able to select what you talk about based on their specific interests.

I talked about creating an Apache Continuum + Selenium grid on EC2 for massively multi-environment and parallel build and test. You can find the slides below, although it’s mostly a talk/visual presentation.

The location was great, in a hotel with spa in Jutland and very nice people and the other speakers too. My advice, go to Denmark, but try to do it in summer 🙂 I’m sure it makes a difference – although it’s pretty cool to be on a hot tub outside at 0C (32F)

And you can find some trip pictures in flickr.

Nyhavn panorama

Nyhavn panorama

Cloud Computing opportunities in the development (build-test-deploy) space

You heard the word “cloud” everywhere, running applications on the cloud, scaling with the cloud,… but not so often from the development lifecycle perspective: code, commit, test, deploy to QA, release, etc but it brings fundamental changes to this aspect too.

The scenario

If you belong to, or manage, a group of developers, you are doing at least some sort of automated builds with continuous integration. You have continuous integration servers, building your code on specific schedules, not as often as you would like, when developers commit changes. The number of projects grow and you add more servers for the new projects, mixing and matching environments for different needs (linux, windows, os x,…)

The problem and the solution

The architecture we use for our Maestro 3 product is composed of one server that handles all the development lifecycle assets. Behind the scenes we use proven open source projects: Apache Continuum for distributed builds, Apache Archiva for repository management, Sonar for reporting, Selenium for multienvironment integration and load testing. And we add Morph mCloud private cloud solution, which is also based on open source projects such as Eucalyptus or Puppet.

We have multiple Continuum build agents doing builds, and multiple Selenium agents for webapp integration testing, as well as several application servers for continuous deployment and testing.

  • limited capacity
    • problem: your hardware is limited, and provision and setup of new servers requires a considerable amount of time
    • solution: assets are dynamic, you can spin off new virtual machines when you need them, and shuffle infrastructure around in a matter of minutes with just few clicks from the web interface. The hybrid cloud approach means you can start new servers in a public cloud, Amazon EC2, if you really need to
  • capacity utilization
    • problem: you need to setup less active projects in the same server as more active ones to make sure servers are not under/over-utilized
    • solution: infrastructure is shared across all projects. If a project needs it more often than another then it’s there to be used
  • scheduling conflicts
    • problem: at specific times, i.e. releases, you need to stop automatic jobs to ensure resources are available for those builds
    • solution: a smart queue management can differentiate between different build types (ie. continuous builds, release builds) and prioritize
  • location dependence
    • problem: you need to manage the infrastructure, knowing where each server is and what is it building
    • solution: a central view of all the development assets for easier management: build agents, test agents or application servers
  • continuous growing
    • problem: new projects are being added while you are trying to manage and optimize your current setup
    • solution: because infrastructure is shared adding new projects is just a matter of scaling wide the cloud, without assigning infrastructure to specific projects
  • complexity in process
    • problem: multiply that for the number of different stages in your promotion process: development environment, QA, Staging, Production
    • solution: you can keep independent networks in the cloud while sharing resources like virtual machine templates for your stack for instance
  • long time-to-market
    • problem: transition from development to QA to production is a pain point because releases and promotion is not automated
    • solution: compositions (workflows) allow to design and automate the steps from development to production, including manual approval
  • complexity in organization:
    • problem: in large organizations, multiply that for the number of separate entities, departments or groups that have their own separate structure
    • solution: enabling self provisioning you can assign quotas to developers or groups to start servers as they need them in a matter of minutes from prebuilt templates

Why a private cloud?

  • cost effectiveness: development infrastructure is running continuously. Global development teams make use of it 24×7
  • bandwidth usage: the traffic between your source control system and the cloud can be expensive, because it’s continuously getting the code for building
  • security restrictions: most companies don’t like their code being exported anywhere outside their firewall. Companies that need to comply with regulations (ie. PCI) have strong requirements on external networks
  • performance: in a private cloud you can optimize the hardware for your specific scenario, reducing the number of VMs needed for the same job compared to public cloud providers
  • heterogeneous environments: if you need to develop in different environments then there are chances that the public cloud service won’t be able to provide them

The new challenges

  • parallelism, you need to know the dependencies between components to know what needs to be built before what
  • stickyness, or how to take advantage of the state of the agents to start builds in the same ones if possible, ie. agents that built a project before can do a source control update instead of a checkout, or have the dependencies already in the filesystem
  • asset management, when you have an increasing number of services running, stoping and starting as needed, you need to know what’s running and where, not only at hardware level but at service level: build agents, test agents and deployment servers.

The new vision

  • you can improve continuous integration as developers checkin code because the barrier to add new infrastructure is minimal, given you have enough hardware in your cloud or if you use external cloud services, which means reduced time to find problems
  • developers have access to infrastructure they need to do their jobs, for instance start an exact copy of the production environment to fix an issue by using a cloud template that they can get up and running in minutes and tear down at the end, not incurring in high hardware costs
  • less friction and easier interaction with IT people as developers can self provision infrastructure, if necessary shuffling virtual machines that they no longer need for the ones they needed

By leveraging the cloud you can solve existing problems in your development lifecycle and at the same time you will be able to do things that you would not even consider because the technology made it complicated or impossible to do. Definitely something worth checking out for large development teams.

Maestro 3 is going to be released this week at InterOp New York (come over and say hi if you are around) but we are already demoing the development snapshots to clients and at conferences like JavaOne.

JavaOne slides: Enterprise Build and Test in the Cloud

I have uploaded the slides from my talk Enterprise Build and Test in the Cloud at JavaOne in San Francisco.

You can check also the code, and an introduction in previous posts

Enterprise build and Test in the Cloud with Selenium I
and
Enterprise build and Test in the Cloud with Selenium II.

Follow me on twitter

JavaOne talk: Enterprise Build and Test in the Cloud

I’ll be presenting Enterprise Build and Test in the Cloud at JavaOne in San Francisco, Wednesday June 3rd 11:05am Esplanade 301 and will be around the whole week.

You can check the slides from the previous talk at ApacheCON, the code, and an introduction in previous posts

Enterprise build and Test in the Cloud with Selenium I
and
Enterprise build and Test in the Cloud with Selenium II.

Follow me on twitter

Enterprise Build and Test in the Cloud code available

The code accompanying the slides Enterprise Build and Test in the Cloud is available at the appfuse-selenium github page.

Provides a Selenium test environment for Maven projects, Appfuse as an example. Allows to run Selenium tests as part of the Maven build, either in an specific container and browser or launching the tests in parallel in several browsers at the same time.

For more information check my slides on Enterprise Build and Test in the Cloud and the blog entries Enterprise Build and Test in the Cloud with Selenium I and Enterprise Build and Test in the Cloud with Selenium II.

By default it’s configured to launch 3 browsers in parallel, Internet Explorer, Firefox 2 and 3

Check src/test/resources/testng.xml for the configuration.

In the single browser option you could do

  • Testing in Jetty 6 and Firefox

    • mvn install

  • Testing in Internet Explorer

    • mvn install -Pjetty6x,iexplore

  • Testing with any browser

    • mvn install
      -Pjetty6x,otherbrowser -DbrowserPath=path/to/browser/executable

  • Start the server (no tests
    running, good for recording tests)

    • mvn package cargo:start

ApacheCON slides: “Enterprise Build and Test in the Cloud” and “Eclipse IAM, Maven integration for Eclipse”

Here you have the slides from my talks at ApacheCON

Enterprise Build and Test in the Cloud

Building and testing software can be a time and resource consuming task. Cloud computing / on demand services like Amazon EC2 allow a cost-effective way to scale applications, and applied to building and testing software can reduce the time needed to find and correct problems, meaning a reduction also in time and costs. Properly configuring your build tools (Maven, Ant,…), continuous integration servers (Continuum, Cruise Control,…), and testing tools (TestNG, Selenium,…) can allow you to run all the build/testing process in a cloud environment, simulating high load environments, distributing long running tests to reduce their execution time, using different environments for client or server applications,… and in the case of on-demand services like Amazon EC2, pay only for the time you use it.
In this presentation we will introduce a development process and architecture using popular open source tools for the build and test process such as Apache Maven or Ant for building, Apache Continuum as continuous integration server, TestNG and Selenium for testing, and how to configure them to achieve the best results and performance in several typical use cases (long running testing processes, different client platforms,…) by using he Amazon Elastic Computing Cloud EC2, and therefore reducing time and costs compared to other solutions.

Download PDF

Eclipse IAM, Maven integration for Eclipse

Eclipse IAM (Eclipse Integration for Apache Maven), formerly “Q for Eclipse”, is an Open Source project that integrates Apache Maven and the Eclipse IDE for faster, more agile, and more productive development. The plugin allows you to run Maven from the IDE, import existing Maven projects without intermediate steps, create new projects using Maven archetypes, synchronize dependency management, search artifact repositories for dependencies that are automatically downloaded, view a graph of dependencies and more! Join us to discover how to take advantage of all these features, as well as how they can help you to improve your development process.

Download PDF

Conference season: JavaOne

If last week I mentioned the two conferences I got talks accepted, ApacheCON and EclipseCON, now I just got the confirmation that my talk Enterprise build and test in the cloud was accepted for JavaOne, June 2-5 in San Francisco

You can read a little bit about what I’m going to talk about in my posts

Enterprise build and Test in the Cloud with Selenium I
and
Enterprise build and Test in the Cloud with Selenium II, probably a 3rd part coming after ApacheCON.