Self-Healing Rollouts: Automating Production Fixes with Agentic AI and Argo Rollouts

Rolling out changes to all users at once in production is risky—we’ve all learned this lesson at some point. But what if we could combine progressive delivery techniques with AI agents to automatically detect, analyze, and fix deployment issues? In this article, I’ll show you how to implement self-healing rollouts using Argo Rollouts and agentic AI to create a fully automated feedback loop that can fix production issues while you grab a coffee.

The Case for Progressive Delivery

Progressive Delivery is a term that encompasses deployment strategies designed to avoid the pitfalls of all-or-nothing deployments. The concept gained significant attention after the CrowdStrike incident, where a faulty update took down a substantial portion of the internet. Their post-mortem revealed a crucial lesson: they should have deployed to progressive “rings” or “waves” of customers, with time between deployments to gather metrics and telemetry.

The key principles of progressive delivery are:

  • Avoiding downtime: Deploy changes gradually with quick rollback capabilities
  • Limiting the blast radius: Only a small percentage of users are affected if something goes wrong
  • Shorter time to production: Safety nets enable faster, more confident deployments

As I like to say: “If you haven’t automatically destroyed something by mistake, you’re not automating enough.”

Progressive Delivery Techniques

Rolling Updates

Kubernetes provides rolling updates by default. As new pods come up, old pods are gradually deleted, automatically shifting traffic to the new version. If issues arise, you can roll back quickly, affecting only the percentage of traffic that hit the new pods during the update window.

Blue-Green Deployment

This technique involves deploying a complete copy of your application (the “blue” version) alongside the existing production version (the “green” version). After testing, you switch all traffic to the new version. While this provides quick rollbacks, it requires twice the resources and switches all traffic at once, potentially affecting all users before you can react.

Canary Deployment

Canary deployments offer more granular control. You deploy a new version alongside the stable version and gradually increase the percentage of traffic going to the new version—perhaps starting with 5%, then 10%, and so on. You can route traffic based on various parameters: internal employees, IP ranges, or random percentages. This approach allows you to detect issues early while minimizing user impact.

Feature Flags

Feature flags provide even more granular control at the application level. You can deploy code with new features disabled by default, then enable them selectively for specific user groups. This decouples deployment from feature activation, allowing you to:

  • Ship faster without immediate risk
  • Enable features for specific customers or user segments
  • Quickly disable problematic features without redeployment

You can implement feature flags using dedicated services like OpenFeature or simpler approaches like environment variables.

Progressive Delivery in Kubernetes

Kubernetes provides two main architectures for traffic routing:

Service Architecture

The traditional approach uses load balancers directing traffic to services, which then route to pods based on labels. This works well for basic scenarios but lacks flexibility for advanced routing.

Ingress Architecture

The Ingress layer provides more sophisticated traffic management. You can route traffic based on domains, paths, headers, and other criteria, enabling fine-grained control essential for canary deployments. Popular ingress controllers include:

Enter Argo Rollouts

Argo Rollouts is a Kubernetes controller that provides advanced deployment capabilities including blue-green deployments, canary releases, analysis, and experimentation. It’s a powerful tool for implementing progressive delivery in Kubernetes environments.

How Argo Rollouts Works

The architecture includes:

  1. Rollout Controller: Manages the deployment process
  2. Rollout Object: Defines the deployment strategy and analysis configuration
  3. Analysis Templates: Specify metrics and success criteria
  4. Replica Sets: Manages stable and canary versions with automatic traffic shifting

When you update a Rollout, it creates separate replica sets for stable and canary versions, gradually increasing canary pods while decreasing stable pods based on your defined rules. If you’re using a service mesh or advanced ingress, you can implement fine-grained routing—sending specific headers, paths, or user segments to the canary version.

Analysis Options

Argo Rollouts supports various analysis methods:

  • Prometheus: Query metrics to determine rollout health
  • Datadog: Integration with Datadog monitoring
  • Kubernetes Jobs: Run custom analysis logic—check databases, call APIs, or perform any custom validation

The experimentation feature is particularly interesting. We considered using it to test Java upgrades: deploy a new Java version, run it for a few hours gathering metrics on response times and latency, then decide whether to proceed with the full rollout—all before affecting real users.

Adding AI to the Mix

Now, here’s where it gets interesting: what if we use AI to analyze logs and automatically make rollout decisions?

The AI-Powered Analysis Plugin

I developed a plugin for Argo Rollouts that uses Large Language Models (specifically Google’s Gemini) to analyze deployment logs and make intelligent decisions about whether to promote or rollback a deployment. The workflow is:

  1. Log Collection: Gather logs from stable and canary versions
  2. AI Analysis: Send logs to an LLM with a structured prompt
  3. Decision Making: The AI responds with a promote/rollback recommendation and confidence level
  4. Automated Action: Argo Rollouts automatically promotes or rolls back based on the AI’s decision

The prompt asks the LLM to:

  • Analyze canary behavior compared to the stable version
  • Respond in JSON format with a boolean promotion decision
  • Provide a confidence level (0-100%)

For example, if the confidence threshold is set to 50%, any recommendation with confidence above 50% is executed automatically.

The Complete Self-Healing Loop

But we can go further. When a rollout fails and rolls back, the plugin automatically:

  1. Creates a GitHub Issue: The LLM generates an appropriate title and detailed description of the problem, including log analysis and recommended fixes
  2. Assigns a Coding Agent: Labels the issue to trigger agents like JulesGitHub Copilot, or similar tools
  3. Automatic Fix: The coding agent analyzes the issue, creates a fix, and submits a pull request
  4. Continuous Loop: Once merged, the new version goes through the same rollout process

Live Demo Results

In my live demonstration, I showed this complete workflow in action:

Successful Deployment: When deploying a working version (changing from “blue” to “green”), the rollout progressed smoothly through the defined steps (20%, 40%, 60%, 80%, 100%) at 10-second intervals. The AI analyzed the logs and determined: “The stable version consistently returns 100 blue, the canary version returns 100 green, both versions return 200 status codes. Based on the logs, the canary version seems stable.”

Failed Deployment: When deploying a broken version that returned random colors and threw panic errors, the system:

  • Detected the issue during the canary phase
  • Automatically rolled back to the stable version
  • The AI analysis identified: “The canary version returns a mix of colors (purple, blue, green, orange, yellow) along with several panic errors due to runtime error index out of range with length zero”
  • Provided a confidence level of 95% that the deployment should not be promoted
  • Automatically created a GitHub issue with detailed analysis
  • Assigned the issue to Jules (coding agent)
  • Within 3-5 minutes, received a pull request with a fix

The coding agents (I demonstrated both Jules and GitHub Copilot) analyzed the code, identified the problem in the getColor() function, fixed the bug, added tests, and created well-documented pull requests with proper commit messages.

Technical Implementation

The Rollout Configuration

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: canary-demo
spec:
  strategy:
    canary:
      analysis:
        templates:
          - templateName: canary-analysis-ai

The Analysis Template

The template configures the AI plugin to check every 10 seconds and require a confidence level above 50% for promotion:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: canary-analysis-ai
spec:
  metrics:
    - name: success-rate
      interval: 10s
      successCondition: result > 0.50
      provider:
        plugin:
          argoproj-labs/metric-ai:
            model: gemini-2.0-flash
            githubUrl: https://github.com/carlossg/rollouts-demo
            extraPrompt: |
              Ignore color changes.

Agent-to-Agent Communication

The plugin supports two modes:

  1. Inline Mode: The plugin directly calls the LLM, makes decisions, and creates GitHub issues
  2. Agent Mode: Uses agent-to-agent (A2A) communication to call specialized agents with domain-specific knowledge and tools

The native mode is particularly powerful because you can build agents that understand your specific problem space, with access to internal databases, monitoring tools, or other specialized resources.

The Future of Self-Healing Systems

This approach demonstrates the practical application of AI agents in production environments. The key insight is creating a continuous feedback loop:

  1. Deploy changes progressively
  2. Automatically detect issues
  3. Roll back when necessary
  4. Generate detailed issue reports
  5. Let AI agents propose fixes
  6. Review and merge fixes
  7. Repeat

The beauty of this system is that it works continuously. You can have multiple issues being addressed simultaneously by different agents, working 24/7 to keep your systems healthy. As humans, we just need to review and ensure the proposed fixes align with our intentions.

Practical Considerations

While this technology is impressive, it’s important to note:

  • AI isn’t perfect: The agents don’t always get it right on the first try (as demonstrated when the AI ignored my instruction about color variations)
  • Human oversight is still crucial: Review pull requests before merging
  • Start simple: Begin with basic metrics before adding AI analysis
  • Tune your confidence thresholds: Adjust based on your risk tolerance
  • Monitor the monitors: Ensure your analysis systems are reliable

Getting Started

If you want to implement similar systems:

  1. Start with Argo Rollouts: Learn basic canary deployments without AI
  2. Implement analysis: Use Prometheus or custom jobs for analysis
  3. Add AI gradually: Experiment with AI analysis for non-critical deployments
  4. Build the feedback loop: Integrate issue creation and coding agents
  5. Iterate and improve: Refine your prompts and confidence thresholds

Conclusion

Progressive delivery isn’t new, but combining it with agentic AI creates powerful new possibilities for self-healing systems. While we’re not at full autonomous production management yet, we’re getting closer. The technology exists today to automatically detect, analyze, and fix many production issues without human intervention.

As I showed in the demo, you can literally watch the system detect a problem, roll back automatically, create an issue, and have a fix ready for review—all while you’re having coffee. That’s the future I want to work toward: systems that heal themselves and learn from their mistakes.

Resources

Serverless Jenkins Pipelines with Google Cloud Run

jenkins-google-cloud-run

Jenkinsfile-Runner-Google-Cloud-Run project is a Google Cloud Run (a container native, serverless platform) Docker image to run Jenkins pipelines. It will process a GitHub webhook, git clone the repository and execute the Jenkinsfile in that git repository. It allows high scalability and pay per use with zero cost if not used.

This image allows Jenkinsfile execution without needing a persistent Jenkins master running in the same way as Jenkins X Serverless, but using the Google Cloud Run platform instead of Kubernetes.

Google Cloud Run vs Project Fn vs AWS Lambda

I wrote three flavors of Jenkinsfile Runner

The image is similar to the other ones. The main difference between Lambda and Google Cloud Run is in the packaging, as Lambda layers are limited in size and are expanded in /opt while Google Cloud Run allows any custom Dockerfile where you can install whatever you want in a much easier way.

This image is extending the Jenkinsfile Runner image instead of doing a Maven build with it as a dependency as it simplifies classpath magement.

Limitations

Max build duration is 15 minutes but we can use a timeout value up tos 60 minutes by using gcloud beta.

Current implementation limitations:

  • checkout scm does not work, change it to sh 'git clone https://github.com/carlossg/jenkinsfile-runner-example.git'

Example

See the jenkinsfile-runner-example project for an example.

When the PRs are built Jenkins writes a comment back to the PR to show status, as defined in the Jenkinsfile, and totally customizable.

Check the PRs at carlossg/jenkinsfile-runner-example

Extending

You can add your plugins to plugins.txt. You could also add the Configuration as Code plugin for configuration, example at jenkins.yaml.

Other tools can be added to the Dockerfile.

Installation

GitHub webhooks execution will time out if the call takes too long, so we also create a nodejs Google function (index.js) that forwards the request to Google Cloud Run and returns the response to GitHub while the build runs.

Building

Build the package

mvn verify 
docker build -t jenkinsfile-runner-google-cloud-run .

Publishing

Both the function and the Google Cloud Run need to be deployed.

Set GITHUB_TOKEN_JENKINSFILE_RUNNER to a token that allows posting PR comments. A more secure way would be to use Google Cloud Secret Manager.

export GITHUB_TOKEN_JENKINSFILE_RUNNER=... 
PROJECT_ID=$(gcloud config get-value project 2> /dev/null) 
make deploy

Note the function url and use it to create a GitHub webhook of type json.

Execution

To test the Google Cloud Run execution

URL=$(gcloud run services describe jenkinsfile-runner \ 
  --platform managed \ 
  --region us-east1 \ 
  --format 'value(status.address.url)') 

curl -v -H "Content-Type: application/json" ${URL}/handle \
  -d @src/test/resources/github.json

Logging

gcloud logging read \
  "resource.type=cloud_run_revision AND resource.labels.service_name=jenkinsfile-runner" \ 
  --format "value(textPayload)" --limit 100

or

gcloud alpha logging tail \
  "resource.type=cloud_run_revision AND resource.labels.service_name=jenkinsfile-runner" \ 
  --format "value(textPayload)"

GitHub events

Add a GitHub json webhook to your git repo pointing to the Google Cloud Function url than you can get with

gcloud functions describe jenkinsfile-runner-function \
  --format 'value(httpsTrigger.url)'

Testing

The image can be run locally

docker run -ti --rm -p 8080:8080 \
  -e GITHUB_TOKEN=${GITHUB_TOKEN_JENKINSFILE_RUNNER} \
  jenkinsfile-runner-google-cloud-run
curl -v -H "Content-Type: application/json" \
  -X POST http://localhost:8080/handle \
  -d @src/test/resources/github.json

More information in the Jenkinsfile-Runner-Google-Cloud-Run GitHub page.

Google Container Registry Service Account Permissions

21046548While testing Jenkins X I hit an issue that puzzled me. I use Kaniko to build Docker images and push them into Google Container Registry. But the push to GCR was failing with

INFO[0000] Taking snapshot of files...
error pushing image: failed to push to destination gcr.io/myprojectid/croc-hunter:1: DENIED: Token exchange failed for project 'myprojectid'. Caller does not have permission 'storage.buckets.get'. To configure permissions, follow instructions at: https://cloud.google.com/container-registry/docs/access-control

During installation Jenkins X creates a GCP Service Account based on the name of the cluster (in my case jx-rocks) called jxkaniko-jx-rocks with roles:

  • roles/storage.admin
  • roles/storage.objectAdmin
  • roles/storage.objectCreator

More roles are added if you install Jenkins X with Vault enabled.

A key is created for the service account and added to Kubernetes as secrets/kaniko-secret containing the service account key json, which is later on mounted in the pods running Kaniko as described in their instructions.

After looking and looking the service account and roles they all seemed correct in the GCP console, but the Kaniko build was still failing. I found a stackoverflow post claiming that the permissions were cached if you had a previous service account with the same name (WAT?), so I tried with a new service account with same permissions and different name and that worked. Weird. So I created a script to replace the service account by another one and update the Kubernetes secret.

ACCOUNT=jxkaniko-jx-rocks
PROJECT_ID=myprojectid

# delete the existing service account and policy binding
gcloud -q iam service-accounts delete ${ACCOUNT}@${PROJECT_ID}.iam.gserviceaccount.com
gcloud -q projects remove-iam-policy-binding ${PROJECT_ID} --member=serviceAccount:${ACCOUNT}@${PROJECT_ID}.iam.gserviceaccount.com --role roles/storage.admin
gcloud -q projects remove-iam-policy-binding ${PROJECT_ID} --member=serviceAccount:${ACCOUNT}@${PROJECT_ID}.iam.gserviceaccount.com --role roles/storage.objectAdmin
gcloud -q projects remove-iam-policy-binding ${PROJECT_ID} --member=serviceAccount:${ACCOUNT}@${PROJECT_ID}.iam.gserviceaccount.com --role roles/storage.objectCreator

# create a new one
gcloud -q iam service-accounts create ${ACCOUNT} --display-name ${ACCOUNT}
gcloud -q projects add-iam-policy-binding ${PROJECT_ID} --member=serviceAccount:${ACCOUNT}@${PROJECT_ID}.iam.gserviceaccount.com --role roles/storage.admin
gcloud -q projects add-iam-policy-binding ${PROJECT_ID} --member=serviceAccount:${ACCOUNT}@${PROJECT_ID}.iam.gserviceaccount.com --role roles/storage.objectAdmin
gcloud -q projects add-iam-policy-binding ${PROJECT_ID} --member=serviceAccount:${ACCOUNT}@${PROJECT_ID}.iam.gserviceaccount.com --role roles/storage.objectCreator

# create a key for the service account and update the secret in Kubernetes
gcloud -q iam service-accounts keys create kaniko-secret --iam-account=${ACCOUNT}@${PROJECT_ID}.iam.gserviceaccount.com
kubectl create secret generic kaniko-secret --from-file=kaniko-secret

And it did also work, so no idea why it was failing, but at least I’ll remember now how to manually cleanup and recreate the service account.

Serverless Jenkins Pipelines with Fn Project

jenkins-lambdaThe Jenkinsfile-Runner-Fn project is a Fn Project (a container native, cloud agnostic serverless platform) function to run Jenkins pipelines. It will process a GitHub webhook, git clone the repository and execute the Jenkinsfile in that git repository. It allows scalability and pay per use with zero cost if not used.

This function allows Jenkinsfile execution without needing a persistent Jenkins master running in the same way as Jenkins X Serverless, but using the Fn Project platform (and supported providers like Oracle Functions) instead of Kubernetes.

Fn Project vs AWS Lambda

The function is very similar to the one in jenkinsfile-runner-lambda with just a small change in the signature. The main difference between Lambda and Fn is in the packaging, as Lambda layers are limited in size and are expanded in /optwhile Fn allows a custom Dockerfile where you can install whatever you want in a much easier way, just need to include the function code and entrypoint from fnproject/fn-java-fdk.

Oracle Functions

Oracle Functions is a cloud service providing Project Fn function execution (currently in limited availability). jenkinsfile-runner-fn function runs in Oracle Functions, with the caveat that it needs a syslog server running somewhere to get the logs (see below).

Limitations

Current implementation limitations:

  • checkout scm does not work, change it to sh 'git clone https://github.com/carlossg/jenkinsfile-runner-fn-example.git'
  • Jenkinsfile must use /tmp for any tool that needs writing files, see the example

Example

See the jenkinsfile-runner-fn-example project for an example that is tested and works.

Extending

You can add your plugins to plugins.txt. You could also add the Configuration as Code plugin for configuration.

Other tools can be added to the Dockerfile.

Installation

Install Fn

Building

Build the function

mvn clean package

Publishing

Create and deploy the function locally

fn create app jenkinsfile-runner
fn --verbose deploy --app jenkinsfile-runner --local

Execution

Invoke the function

cat src/test/resources/github.json | fn invoke jenkinsfile-runner jenkinsfile-runner

Logging

Get the logs for the last execution

fn get logs jenkinsfile-runner jenkinsfile-runner \
$(fn ls calls jenkinsfile-runner jenkinsfile-runner | grep 'ID:' | head -n 1 | sed -e 's/ID: //')

Syslog

Alternatively, start a syslog server to see the logs

docker run -d --rm -it -p 5140:514 --name syslog-ng balabit/syslog-ng:latest
docker exec -ti syslog-ng tail -f /var/log/messages-kv.log

Update the function to send logs to the syslog server

fn update app jenkinsfile-runner --syslog-url tcp://logs-01.loggly.com:514

GitHub events

Add a GitHub json webhook to your git repo pointing to the function url.

More information in the Jenkinsfile-Runner-Fn GitHub page.

Running Jenkins Pipelines in AWS Lambda

jenkins-lambdaThe Jenkinsfile-Runner-Lambda project is a AWS Lambda function to run Jenkins pipelines. It will process a GitHub webhook, git clone the repository and execute the Jenkinsfile in that git repository. It allows huge scalability with 1000+ concurrent builds and pay per use with zero cost if not used.

This function allows Jenkinsfile execution without needing a persistent Jenkins master running in the same way as Jenkins X Serverless, but using AWS Lambda instead of Kubernetes. All the logs are stored in AWS CloudWatch and are easily accessible.

Why???

Why not?

I mean, it could make sense to run Jenkinsfiles in Lambda when you are building AWS related stuff, like creating an artifact and uploading it to S3.

Limitations

Lambda limitations:

  • 15 minutes execution time
  • 3008MB of memory
  • git clone and generated artifacts must fit in the 500MB provided

Current implementation limitations:

  • checkout scm does not work, change it to sh 'git clone https://github.com/carlossg/jenkinsfile-runner-lambda-example.git'
  • Jenkinsfile must add /usr/local/bin to PATH and use /tmp for any tool that needs writing files, see the example

Extending

Three lambda layers are created:

  • jenkinsfile-runner: the main library
  • plugins: minimal set of plugins to build a Jenkinsfile
  • tools: git, openjdk, maven

You can add your plugins in a new layer as a zip file inside a plugins dir to be expanded in /opt/plugins. You could also add the Configuration as Code plugin and configure the Artifact Manager S3 to store all your artifacts in S3.

Other tools can be added as new layers, and they will be expanded in /opt. You can find a list of scripts for inspiration in the lambci project (gcc,go,java,php,python,ruby,rust) and bash, git and zip (git is already included in the tools layer here)

The layers are built with Docker, installing jenkinsfile-runner, tools and plugins under /opt which is where Lambda layers are expanded. These files are then zipped for upload to Lambda.

Installation

Create a lambda function jenkinsfile-runner using Java 8 runtime. Use the layers built in target/layer-* and target/jenkinsfile-runner-lambda-*.jar as function. Could use make publish to create them.

Set

  • handler: org.csanchez.jenkins.lambda.Handler::handleRequest
  • memory: 1024MB
  • timeout: 15 minutes
aws lambda create-function \
    --function-name jenkinsfile-runner \
    --handler org.csanchez.jenkins.lambda.Handler::handleRequest \
    --zip-file fileb://target/jenkinsfile-runner-lambda-1.0-SNAPSHOT.jar \
    --runtime java8 \
    --region us-east-1 \
    --timeout 900 \
    --memory-size 1024 \
    --layers output/layers.json

Exposing the Lambda Function

From the lambda function configuration page add a API Gateway trigger. Select Create a new API and choose the security level. Save the function and you will get a http API endpoint.

Note that to achieve asynchronous execution (GitHub webhooks execution will time out if your webhook takes too long) you would need to configure API Gateway to send the payload to SNS and then lambda to listen to SNS events. See an example.

GitHub events

Add a GitHub json webhook to your git repo pointing to the lambda api gateway url.

 

More information in the Jenkinsfile-Runner-Lambda GitHub page.

Google Cloud Next Recap

google-next-logoSeveral interesting announcements from last week Google Next conference.

Knative, a new OSS project built by Google, Red Hat, IBM,… to build, deploy, and manage modern serverless workloads on Kubernetes. Built upon Istio, with 1.0 coming soon and managed Istio on GCP. It includes a build primitive to manage source to kubernetes flows, that can be used independently. Maybe it is the new standard to define sources and builds in Kubernetes. Read more from Mark Chmarny.

GKE on premise, a Google-configured version of Kubernetes with multi-cluster management, running on top of VMware’s vSphere.

Another Kubernetes related mention was the gVisor pod sandbox, with experimental support for Kubernetes, to allow running sandboxed containers in a Kubernetes cluster. Very interesting for multi-tenant clusters and docker image builds.

Cloud Functions are now Generally Available, and more serverless features are launched:

Serverless containers allow you to run container-based workloads in a fully managed environment and still only pay for what you use. Sign up for an early preview of serverless containers on Cloud Functions to run your own containerized functions on GCP with all the benefits of serverless.

A new GKE serverless add-on lets you run serverless workloads on Kubernetes Engine with a one-step deploy. You can go from source to containers instantaneously, auto-scale your stateless container-based workloads, and even scale down to zero.

Cloud Build, a fully-managed CI/CD platform that lets you build and test applications in the cloud. With an interesting approach where all the pipeline steps are containers themselves so it is reasonably easy to extend. It integrates with GitHub for repos with a Dockerfile (let’s see if it lasts long after Microsoft acquisition).

Other interesting announcements include:

  • Edge TPU, a tiny ASIC chip designed to run TensorFlow Lite ML models at the edge.
  • Shielded VMs – untampered virtual machines

  • Titan Security Key, a FIDO security key with firmware developed by Google. Google security was giving away at the conference both NFC and bluetooth keys, a good replacement for the yubikeys specially for mobile devices.

Serverless CI/CD with AWS ECS Fargate

Amazon AWS has recently launched ECS Fargate to “run containers without having to manage servers or clusters”.

So this got me interested enough to patch the Jenkins ECS plugin to run Jenkins agents as containers using Fargate model instead of the previous model where you would still need to create and manage VM instances to run the containers.

How does it work?

With the Jenkins ECS plugin you can configure a “Cloud” item that will launch all your agents on ECS Fargate, matching jobs to different container templates using labels. This means you can have unlimited agents with no machines to manage and just pay for what you use.

Some tips on the configuration:

  • Some options need to be configured, like subnet, security group and assign a public ip to the container in order to launch in Fargate.
  • Agents need to adhere to some predefined cpu and memory configurations. For instance for 1 vCPU you can only use 2GB to 8GB in 1GB increments.

Pricing

Price per vCPU is $0.00001406 per second ($0.0506 per hour) and per GB memory is $0.00000353 per second ($0.0127 per hour).

If you compare the price with a m5.large instance (4 vCPU, 16 GB) that costs $0.192 per hour, it would cost you $0,4056 in Fargate, more than twice, ouch! You could build something similar and cheaper with Kubernetes using the cluster autoscaler given you can achieve a high utilization of the machines.

While I was writing this post someone already beat me to submit a PR to the ECS plugin to add the Fargate support.

Cheap backups with Amazon Glacier

Last week Amazon announced Amazon Glacier, where you can have files stored at $0.01 per GB / month, quite a good deal, considering that S3 goes for $0.093 GB/month with reduced redundancy, or Dropbox at its best is 0.825/GB committing to 100GB for a full year, although obviously they fill very different use cases.

To get that pricing there are some drawbacks that make it only useful for storing files that don’t need to be retrieved often, ie. backups for disaster recovery. Downloading or listing files in Glacier take more than 4 hours, so that gives you an idea. Behind the scenes it uses Amazon SQS (Simple Queue Service) and SNS (Simple Notification Service) to handle the download and inventory requests, so you can do extra things like getting emails when your requests are ready.

I have created glacier-cli using the Java API to upload, download, delete and list files stored in Glacier from the command line, as Amazon only provides the APIs for now and some examples. Make sure you save the output when uploading the files, as you will need the ids of the files later on when you need to download them.

Get the code from GitHub.

Glacier-CLI

Building

mvn clean package

Configuration

Create $HOME/AwsCredentials.properties with your AWS keys

secretKey=…
accessKey=…

Commands

  • upload vault_name file1 file2 …
  • download vault_name archiveId output_file
  • delete vault_name archiveId
  • inventory vault_name

Command line options

 -output <file_name>   File to save the inventory to. Defaults to 'glacier.json'
 -queue <queue_name>   SQS queue to use for inventory retrieval. Defaults to 'glacier'
 -region <region>      Specify URL as the web service URL to use. Defaults to 'us-east-1'
 -topic <topic_name>   SNS topic to use for inventory retrieval. Defaults to 'glacier'

Examples

Upload file1 and file2 to vault pictures

java -jar glacier-1.0-jar-with-dependencies.jar upload pictures file1 file2

Download archive with id xxx from vault pictures to file pic.tar (takes >4 hours)

java -jar glacier-1.0-jar-with-dependencies.jar download pictures xxx pic.tar

Delete archive with id xxx from vault pictures

java -jar glacier-1.0-jar-with-dependencies.jar delete pictures xxx

Get the inventory for vault pictures (takes >4 hours)

java -jar glacier-1.0-jar-with-dependencies.jar inventory pictures

Upload file1 and file2 to vault pictures in Europe region

java -jar glacier-1.0-jar-with-dependencies.jar -region eu-west-1 upload pictures file1 file2

Introduction to Amazon Web Services Identity and Access Management

Using AWS Identity and Access Management you can create separate users and permissions to use any AWS service, for instance EC2, and avoid giving other people your Amazon username, password or private key.

You can set very granular permissions, on users, groups, specific resources, and a combination of them. It will become really complex soon! But there are several very common use cases, that IAM is useful for. For instance having a AWS account for a team of developers.

Getting started

You can go through the Getting Started Guide, but I’ll save you some time:

Download IAM command line tools

Store your AWS credentials in a file, ie. ~/account-key

AWSAccessKeyId=AKIAIOSFODNN7EXAMPLE
AWSSecretKey=wJalrXUtnFEMI/K7MDENG/bPxRfiCYzEXAMPLEKEY

Configure environment variables

export AWS_IAM_HOME=<path_to_cli>
export PATH=$AWS_IAM_HOME/bin:$PATH
export AWS_CREDENTIAL_FILE=~/account-key

Creating an admin group

When you have IAM setup, the next step is to create an Admins group where you can add yourself

iam-groupcreate -g Admins

Create a policy in a file, ie. MyPolicy.txt

{
   "Statement":[{
      "Effect":"Allow",
      "Action":"*",
      "Resource":"*"
      }
   ]
}

Upload the policy

iam-groupuploadpolicy -g Admins -p AdminsGroupPolicy -f MyPolicy.txt

Creating an admin user

Create an admin user with

iam-usercreate -u YOUR_NAME -g Admins -k -v

The response looks similar to this:

AKIAIOSFODNN7EXAMPLE
wJalrXUtnFEMI/K7MDENG/bPxRfiCYzEXAMPLEKEY
arn:aws:iam::123456789012:user/YOUR_NAME
AIDACKCEVSQ6C2EXAMPLE

The first line is your Access Key ID; the second line is your Secret Access Key. You need to save these IDs.

Save your Access Key ID and your Secret Access Key to a file called for instance ~/YOUR_NAME_cred.txt. You can use those credentials from now on instead of the global AWS credentials for the whole account.

export AWS_CREDENTIAL_FILE=~/YOUR_NAME_cred.txt

Creating a dev group

Let’s create an example dev group where the users will have only read access to EC2 operations.

 iam-groupcreate -g dev

Now we need to set the group policy to allow all EC2 Describe* actions, which are the ones that allow users to see data, but not to change it. Create a file MyPolicy.txt with these contents

{
  "Statement": [
     {
       "Sid": "EC2AllowDescribe",
       "Action": [
         "ec2:Describe*"
       ],
       "Effect": "Allow",
       "Resource": "*"
     }
   ]
 }

Now upload the policy

iam-groupuploadpolicy -g dev -p devGroupPolicy -f MyPolicy.txt

Creating dev users

To create a new AWS user under the dev group

iam-usercreate -u username -g dev -k -v

Create a login profile for the user to log into the web console

iam-useraddloginprofile -u username -p password

The user can now access the AWS console at

https://your_AWS_Account_ID.signin.aws.amazon.com/console/ec2

Or you can make life easier by creating an alias

 iam-accountaliascreate -a maestrodev

and now the console is available at

https://maestrodev.signin.aws.amazon.com/console/ec2

About Policies

AWS policy files can be really complex. The AWS Policy Generator will help as a start point and see what actions can be used, but it won’t help you making them easier to read (using wildcards) or applying them to specific resources. Amazon could have provided a better generator tool allowing you to choose your own resources (users, groups, S3 buckets,…) from a easy to use interface and not having to lookup all sorts of crazy AWS identifiers. Hopefully they will be provide a comprehensive tool as part of the AWS Console.

There is more information available at the IAM User Guide.

Update

Just after I wrote this post Amazon has made IAM available in the AWS management console, which makes using IAM way easier.

Javagruppen 2011: Build and test in the cloud slides

Last week spent some good days in Denmark for Javagruppen annual conference as I mentioned in a previous post. It’s a small conference that allows you to cover any question that the attendees have and be able to select what you talk about based on their specific interests.

I talked about creating an Apache Continuum + Selenium grid on EC2 for massively multi-environment and parallel build and test. You can find the slides below, although it’s mostly a talk/visual presentation.

The location was great, in a hotel with spa in Jutland and very nice people and the other speakers too. My advice, go to Denmark, but try to do it in summer 🙂 I’m sure it makes a difference – although it’s pretty cool to be on a hot tub outside at 0C (32F)

And you can find some trip pictures in flickr.

Nyhavn panorama

Nyhavn panorama