C R U M B H O L E

Kustomizing A Helm Chart With Argocd?

2021-12-01T00:00:00+00:00

I often find myself stuck in a situation with a third-party Helm Chart where a particular manifest manipulation has not been exposed by the developers to the values.yaml, which in turn prevents me from installing the application in the way I’d like. Out of the box, Argo CD allows you to deploy from a Helm Chart, flat yaml manifests, or from a Kustomization manifest. However, what you can’t do is take the resulting manifest from a Helm Chart and post-render it through Kustomize to overcome this problem.

Historically, in situations like this, I end up either creating two Argo CD applications to deploy my full set of required manifests, or manually running helm template, and then passing the resulting file through a kustomization.yaml to add my patches as needed. The second option here is definitely not gitops-friendly. The first option is messy and creates an Argo CD application interdependency that I don’t like.

The argocd-lovely-plugin aims to resolve this issue, and provide other lovely additions to your day-to-day gitops workflow.

In this example here, we deploy a third party Helm chart (as defined in the chart directory), a secret (defined in secret) and finally a kustomized configmap (defined in configmap). All of this is controlled through one argocd application, with just two extra lines added to the manifest to tell Argo CD to render the application manifests through the argocd-lovely-plugin:

apiVersion: argoproj.io/v1alpha1
kind: Application
...
spec:
...
  source:
    plugin:
      name: argocd-lovely-plugin

The argocd-lovely-plugin is not just limited to post-rending Helm through Kustomize. You can also:

Merge in text file snippets to your Helm or Kustomization output.
Apply Kustomization and modify Helm’s values.yaml per application to apply minor differences to your applications trivially when using Argo CD Application Sets.

The argocd-lovely-plugin can have its own plugins. In this example, we use the argocd-lovely-plugin to deploy some kustomizations alongside a Helm chart, but also to use the argocd-vault-replacer plugin to pull secrets from Hashicorp Vault and to inject them into the manifests at deploy time. argocd-lovely-plugin acts as a master plugin running (acting as the only plugin to Argo CD), and then runs other Argo CD compatible plugins in a chain. This acts a bit like a unix pipe, so you can helm

kustomize

argocd-vault-replacer.

Give the argocd-lovely-plugin a go today. Pull requests, issues and feedback are very welcome.

The Ci Github Notifier

2021-11-30T00:00:00+00:00

I wrote a lightweight container to post the status of a CI task to GitHub, allowing GitHub users to see the status of a PR or Branch. Designed for cloud native workflows (eg Argo Workflows, or Tekton), but will run wherever a container can be run.

It annotates a github Pull Request with the status of a particular CI task, along with a link back to the CI tool performing the task, so users can dive deeper into failures when they occur.

The container clocks in at just 7MB in size so should fit nicely in any CI process. You can find out more on the github page.

Migrating From Jenkins To Argo At Sendible

2021-05-17T00:00:00+00:00

This was originally written by me and published on the Argo blog in May 2021.

Here at Sendible, we are embarking on a program to make our application and development stacks more cloud-native, and we soon found that our existing CI solution wasn’t up to the job. We set about finding an alternative and thought that documenting our process might help others in a similar situation.

Why?

Jenkins is arguably still the de facto CI tool. It’s mature and there are a wealth of knowledgeable people out there on the Internet who can help you get the best out of it. However, with maturity can come challenges.

The main pinch points were…

Plugin spaghetti

Jenkins has an abundance of plugins. The downside is, Jenkins has an abundance of plugins! Finding the right one to suit your needs, assessing the security impact of the plugin, and then keeping on top of updates/maintenance can start to become a real headache.

Not cloud-native

It is, of course, possible to run Jenkins in Kubernetes, and equally possible to spin up dynamic pods as jobs are triggered. However, Jenkins wasn’t originally designed to work this way and after using it, it starts to become clear that it doesn’t interoperate fully with Kubernetes. An obvious example is that the main installation of Jenkins can only run in one pod, so there is no HA deployment in case it is evicted or crashes.

Similarly, Jenkins’ natural approach to running a job is to deploy all required containers into one pod. This means starting up all required containers at the beginning of the run, and not releasing them until the end. As everything is in one pod, and pods cannot span multiple nodes, there is a limitation on how nodes can be used to accommodate the workload.

There are of course ways around this — for a while, we had cascading Jenkins jobs to trick it into providing us with dynamically provisioned pods… but after a while, we realized that we were just fighting a tool into doing something it wasn’t designed to do. The pipeline code soon became hard to maintain as a result, and debugging jobs became complex.

Cost efficiency

At Sendible, we found ourselves putting more and more workarounds in place to try and balance running our CI in a tool we knew, using Kubernetes, and keeping costs down. In reality, we were losing more time and money in maintenance costs than we would ever save.

There are other cost considerations. A well-used Jenkins controller can consume a large number of system resources, and the aforementioned single pod-per-job concern means you may need to provision large servers. If you’re running Jenkins outside of Kubernetes, and you don’t have an auto-scaling system in place, you might have agent nodes running all the time, which can increase your costs.

So why Argo?

We were already using Argo CD for GitOps, and have completed a POC on Argo Rollouts to manage future releases. So it made sense to at least investigate their brothers Workflows and Events.

It was immediately apparent how much faster Argo Workflows was when compared to our existing CI solution, and due to the retry option, we could make use of the Cluster Autoscaler and AWS Spot Instances, which immediately brought our CI/CD costs down by up to 90%! More cost savings were found when we saw that pods are only created when needed, resulting in the ability to provision smaller servers for the same job.

We also wanted something that had the potential to expand beyond CI. Ultimately we were after a flexible “thing-doer” that we could use in multiple situations. As well as regular CI jobs, we already use Argo Workflows and Argo Events for:

Alert Remediation (receive an alert from Alertmanager and trigger a workflow to remediate the issue).
Test environment provisioning from Slack.
Automatically testing our backup restores and alerting when there’s an issue.

How long did it take?

As with all things DevOps, the process is ongoing, but with just one person on the initial project armed with just some Kubernetes knowledge, but no Argo Workflows or Events knowledge, we had a basic proof of concept up and running within a day. This was then refined over two weeks, where we productionised Workflows enough that we were comfortable (making it HA, adding SSO etc.) for it to be adopted by the wider team.

Some things we learned along the way

As with all tool implementations, the process wasn’t without its challenges. Hopefully, this short list below might help others when they embark on a similar journey:

Un-learn “The Jenkins Way”

If you have spent years using Jenkins Pipelines, a cloud-native pipeline solution probably won’t come to your mind naturally. Try to avoid just re-writing a Jenkins pipeline in a different tool. Instead, take the time to understand what the pipeline is designed to achieve, and improve on it.

The dynamic pod provisioning of Argo Workflows means you will have to re-approach how you persist data during your job. The official approach is to use an artifact repository in an external storage solution such as S3, but for more transient data, you could consider setting up a RWM PVC to share a volume between a few pods.

Equally, you can use this migration as an opportunity to re-think parallelism and task ordering. Jenkins pipelines of course offer parallel running of steps, but it’s something one has to consciously choose. Argo Workflows’ approach is to run steps in parallel by default, allowing you to simply define dependencies between tasks. You can write your workflow in any order and just tweak the dependencies afterward. We recommend you keep refining these dependencies to find the best fit for you.

Make use of workflow templates

Where possible, try to treat each step in a workflow as its own function. You’ll likely find that your various CI jobs have a lot of common functions. For example:

Cloning from Git
Building container(s)
Updating a ticket management system or Slack with a status

Write each of these process steps as an individual workflow template. This allows you to relatively build a new CI process by just piecing these templates together in a DAG, and then passing the appropriate parameters to them. With time, writing a new CI process becomes primarily an exercise in putting the building blocks together.

You don’t have to ‘Big Bang’ it

The word “Migration” is scary, and has the potential to be filled with dollar signs. It doesn’t have to be.

If you have Jenkins in place already, resist the urge to just rip it out or think you have to replace everything in one go. You can slowly run Workflows alongside Jenkins — you can even get Jenkins to trigger Workflows. When we started, we moved our automated integration tests across, before then moving on to the more complex CI jobs.

Make use of the Argo Slack channels and the Github Discussions pages

The Argo docs are good, as is the Github repo itself (especially the Github Discussions pages), but there is a good group of knowledgeable people using Argo in weird and wonderful ways, and they mostly hang out in the Slack channels).

What next?

We are still learning new things about Argo Workflows with each day we use it, and we are still in the phase of continually refactoring our Workflows to get the absolute best out of them.

Version 3.1 of Argo Workflows isn’t too far away and we are looking forward to the upcoming features. Of particular note, Conditional Parameters will enable us to remove a number of script steps and Container Sets will allow us to speed up certain steps in our CI.

If you have any questions about my experience with Argo Workflows and Argo Events, you’ll probably find me in the CNCF Slack workspace, or you can contact me through the Sendible website.

The Argocd Vault Replacer Plugin

2021-02-22T00:00:00+00:00

I recently collaborated on an Argo CD plugin called ArgoCD-Vault-Replacer. It allows you to merge your code in Git with your secrets in Hashicorp Vault to deploy into your Kubernetes cluster(s). It supports ‘normal’ Kubernetes yaml (or yml) manifests (of any type) as well as argocd-managed Kustomize and Helm charts.

TL;DR

You can find the plugin here, complete with instructions and some samples so you can test the installation.

Why?

GitOps is a great thing to have (or at least be aiming for)… ensuring all your infrastructure and configuration is stored as code, and ensuring that your Git repo is the source of truth for your environment. Argo CD Happily sits there and ensures that your environment(s) match the configuration in Git.

The painful reality of GitOps is that it becomes all too easy to store secrets in your Git repo alongside your code, combined with the distributed nature of Git, it can become quite easy to lose those secrets to a bad actor, or simply to more people in your organisation who really should have those secrets. One solution is to store your secrets in a secrets manager, but then extracting those secrets in a programatic way can become tricky.

So how does this work?

I’m not going to go into super detail here, because the readme is relatively comprehensive already (and if it isn’t, PRs are more than welcome!). Firstly you have to create a Kubernetes serviceAccount, and then give the serviceAccount permission to look at (some of your) secrets in Vault. Hashicorp’s own documentation covers this well.

Then you append that serviceAccount to your installation of ArgoCD and install the plugin (using an init container).

Lastly, you need to modify your yaml (or yml, or Helm, or Kustomize scripts) to point it at the relevant path(s) and key(s) in vault that you wish to add to your code.

In the following example, we will populate a Kubernetes Secret with the key secretkey on the path path/to/your/secret. As we are using a Vault kv2 store, we must include ../data/.. in our path. Kubernetes secrets are base64 encoded, so we add the modifier |base64 and the plugin handles the rest.

apiVersion: v1
kind: Secret
metadata:
  name: argocd-vault-replacer-secret
data:
  sample-secret: <vault:path/data/to/your/secret~secretkey|base64>
type: Opaque

When Argo CD runs, it will pull your yaml from Git, find the secret at the given path and will merge the two together inside your cluster. The result is exactly what you’d expect, a nicely populated Kubernetes Secret.

Try it out

If you’re already using Argo CD and Vault, then this is really simple to set up and start using. Please do try it out, and issues, comments and PRs are more than welcome: github.com/crumbhole/argocd-vault-replacer

I Accidentally Proof Read A Book

2021-02-17T00:00:00+00:00

I’ve been studying for my CKA exam and the focus is on churning out the right commands at a fast pace, as well as modifying yaml in the command line. I’m not a terribly slow typer, but I do rely on graphical IDEs in my day-to-day work and I’m usually a single-shell-window kinda guy.

I knew about tmux and started searching around for some guides, and I stumbled across a work-in-progress book called The Mouseless Development Environment. It just so happens that I stumbled across the book in the last couple of months of its development, so when the author, Matthieu Cneude asked for volunteer proof-readers I jumped at the chance.

I’m still working my way through the book itself (aside from the pages I proof-read of course), learning the deeper arts of vim before I dive into tmux.

If you’re interested, do help Matthieu out by buying a copy. There’s a sample of the book on the website too.

Graphing Honeywell’s Evohome

2021-02-11T00:00:00+00:00

Ever wanted to see the temperature of each room in your house plotted on a graph against the current outside temperature and the temperature you’re requesting from your boiler? No, me neither… but I’ve written a thing to do just that.

This was mostly an experiment to see if a) I can copy/paste enough Python from the internet to make A Thing that works, and b) to see what it is possible to extract from the Honeywell Evohome API. The mini project combines a Granafa, Influxdb and a Python container in a small docker-compose stack. It calls the Evohome API, Openweathermap’s API and integrates with healthchecks.io so you know if/when data stops being collected. The project is designed to be run on a Raspberry pi, but will build and run on a x86/64 machine too.

Prior to this I’d never had a need for Python, some other scripting language always filled whatever automation gap I was trying to fill. I’m very much a learn-by-doing person, preferring to just break stuff until it works rather than to watch hours of YouTube videos.

If you’re still bothering to read this far, be aware that you need a Honeywell Evohome system, and you need to be in Europe. Apparently American Evohome systems use a different API.

I was already aware of a project to write a python module that does a lot of the heavy lifting for this project. They haven’t done an official release in a long time, but they are still maintaining the project, so to start with, I clone their repo into my container and deploy it.

The Python to query the Evohome API was pretty simple, largely a copy/paste job from their readme.

Next in the stack was Influxdb…. Something I’ve been aware of but never used in anger. It’s a time series database… I.e. a database that’s great for storing data values against a time stamp… perfect for this project. All I did was fire up their docker container and start poking about (with a bit of help from Duck Duck Go) to understand how to put data into it. From there it was a case of a bit more searching to work out how to translate that into Python.

Lastly Grafana. I’ve used this for years, and I love that you can configure pretty much every aspect of it through environment variables (or a config file if you prefer). A few taps later and I had a Grafana container that connected itself to my Influxdb data source and had a dashboard all provisioned for it.

This is wrapped up in a docker-compose stack, and there’s an optional line to ping healthchecks.io every time data is collected. If it stops collecting data, then healthchecks.io will let you know (via email, Slack, Pushover… whatever you configure). Healthchecks.io is free for a decent number of checks.

Anyway, I have no real reason to have written all this. If you’ve got this far, well done and thanks for indulging me. If you’d like to look at the project, feel free, you can find it here.

Indroducing The Sorry Cypress Helm Chart

2021-01-06T00:00:00+00:00

Yes, it’s another Sorry Cypress post, sorry!

Ever since I posted information on how to deploy Sorry Cypress to Kubernetes, there’s been quite a lot of noise from people asking for a helm chart version. So I finally found some time to make one.

tl;dr

The chart can be found here

Installing

Install the chart using:

$ helm repo add sorry-cypress https://sorry-cypress.github.io/charts
$ helm install my-release sorry-cypress/sorry-cypress

Upgrading

Upgrade the chart deployment using:

$ helm upgrade my-release sorry-cypress/sorry-cypress

Uninstalling

Uninstall the my-release deployment using:

$ helm uninstall my-release

More words

By default, the chart deploys everything to Kubernetes (as you’d expect), but there’s no persistence, this is so that you can just get up and running as quickly as possible.

By default, the director uses the in-memory executionDriver, and the dummy screenshotsDriver. However, You can choose to enable a persistent mongo database for execution, and you can use S3 for screenshots.

Do have a play and if you have any feedback, it’s probably best to open an issue.

A Less Sucky Cypress Docker Image

2020-11-16T00:00:00+00:00

For some reason, the official Cypress Docker images are a bit odd. They seem to be programatically created but there’s never any consistency between npm versions and browser versions. They contain repetitions across the layers which cause them to be bulkier than they need to be too. To make things worse, their images contain critical vulnerabilities that they have no interest in resolving.

Frustrated by their images, I have knocked together my own. I don’t think it is perfect, but it’s easier for you to configure yourself to suit your needs, and it’s less vulnerability-ridden!

I have run my image through a Trivy scanner and there are 138 medium (and lower) vulnerabilities. 0 of them are fixable. This is down from over 2000 fixable critical vulnerabilities found on the official Cypress docker images.

It is built off of ubuntu 20.04. You can modify the Dockerfile to specify your specific browser and node versions and build it locally (which I’d recommend).

Alternatively, there’s an example containing Firefox 81 and Chrome 86 on Docker Hub.

Automatically Verifying Velero Kubernetes Backups

2020-10-26T00:00:00+00:00

So you’ve gone and set up Kubernetes backups… Nice one. You’ve probably used Velero if you haven’t used something native to your cloud provider. Velero is a solid tool, admittedly with a few quirks, but it gets the job done and can be relied upon. But how can you be certain the backups worked? You could manually restore every now and then, but that’s not very DevOpsy.. plus it’s really boring. Here’s how I did it.

First, I created a namespace called “backup-canary” and deployed a simple deployment containing an nginx pod with a service. This pod has a 1GB Persistent Volume mounted at /usr/share/nginx/html.

I then manually wrote a small index.html file inside the volume, containing a known phrase:

# Get the name of the pod
kubectl -n backup-canary get pods -l=app=backup-canary -o jsonpath='{.items[-1:].metadata.name}'

# Write to the index.html file
kubectl exec -n backup-canary -it ${podName} -- bash -c "echo hello-world > /usr/share/nginx/html/index.html" y

If we were to put an ingress on the service (or did magic kubectl port forwarding), we would see a web page with “hello-world” emblazened on it.

I then allowed Velero to back this all up.

I chose to create a Jenkins job to then perform the restore test. Any CI tool should be up for the job, or you could do this natively in Kubernetes, but I wanted an outside tool to do the hard work.

Firstly I created a simple kubectl + velero CLI container:

FROM dtzar/helm-kubectl:3.3.4
RUN wget -nv -O /tmp/velero.tar.gz https://github.com/vmware-tanzu/velero/releases/download/v1.5.2/velero-v1.5.2-linux-amd64.tar.gz \
    && tar -xvf /tmp/velero.tar.gz --directory /tmp/ \
    && mv /tmp/velero-v1.5.2-linux-amd64/velero/ /usr/local/bin/ \
    && rm -rf /tmp/velero.*

Then I wrote a simple Jenkins pipline to perform the following steps:

Read the contents of the current index.html file and store as a variable.
Delete the “backup-canary” namespace.*
Find out the name of the latest successful backup.
Restore this backup to a new namespace (“restore-canary”).
Test that the restored file matches what we expected.**
Assuming we are successful, I then delete the restored namespace.
Restore the backup-canary namespace and deployment.
Write a new phrase (I use a phrase containing the jenkins job build ID) to the index.html to be backed up next time.

Velero will not restore a file, (even to a different namespace) if the original remains.

** I test this by using kubectl to connect to the restored pod, and then perform a curl against the service to get the value. This has the benefit of not just testing that the PV restored successfully, but also the pod and service restored well enough that the webserver is accessible.

The jenkins job is then set to run regularly. Velero can produce Prometheus metrics if you set it up to do so, so restores (successes and failures) are recorded on a Grafana dashboard. If restores fail, a Prometheus alert is triggered and I can resolve the problem before I really need the backup.

Kubernetes Cron To Prometheus

2020-10-12T00:00:00+00:00

I recently had a requirement to execute a small python script inside a cron job. This script produced a series of metric values that we ultimately wanted to be able to represent on a graph and alert on in the future.

I didn’t want to waste a whole EC2 instance on running a small script every hour and I’m not particularly a fan of vendor-lockin so wanted to avoid the serverless route. So as you’ve probably guessed from the title, I decided to use Kubernetes to manage the cron jobs.

We have Prometheus and Grafana already set up in our cluster, so all I had to do was to work out how to push data from our ephemeral cron pod into Prometheus.

If you don’t already have Prometheus and Grafana installed, a very brief interlude:

I installed Prometheus using this particular helm chart. There are a lot of ‘competing’ charts out there and some seem to have been deprecated, so it was a bit of a minefield to try and source one that I liked.
The prometheus helm does come with a Grafana instance, but I didn’t like certain aspects of inflexibility that came with it. So I installed Grafana in its own namespace using this helm and configured it to use Prometheus as a data source.

Prometheus uses a pull method of collecting data. It scrapes data from defined service endpoints (or pods if you set it up to do so). This isn’t any use to us. At its slowest, our cron job takes 3 seconds to spin up and execute, and then it destroys itself… this would mean we’d have to turn the Prometheus scrape interval up to 11 to be able to catch the pod while its running, and even then we can’t guarantee that Prometheus will collect the generated data in time. Luckily, I’m not the first to encounter this problem.

Enter Prometheus Pushgateway

The Pushgateway is an additional deployment that acts as a middle man between ephemeral pods (such as our cron job) and Prometheus. Jobs can push data to the Pushgateway which holds it for Prometheus to scrape in the way it knows how.

The Helm Chart for the Pushgateway can be found here. Installation is pretty straightforward and the values.yml file is well documented.

For testing purposes, I would recommend you set up the ingress (or be prepared to forward the service locally so you can access the UI).

After installing the Pushgateway, we will need to tell Prometheus that it exists. This is as simple as adding an additionalServiceMonitor to the Prometheus helm chart and performing an update. Here’s a small example with a bit of annotation:

additionalServiceMonitors:
  - name: pushgateway  # Generic name so you can identify it.
    selector:
      matchLabels:
        app: prometheus-pushgateway  # Match with any service where the label key is 'app' and the value is 'prometheus-pushgateway'.
    namespaceSelector:
      matchNames:
        - prometheus # Only match with services in this namespace.
    endpoints:
      - port: http # Port name in the service.
        interval: 10s # How often to scrape.

The Python bit

We need to tweak our cron job slightly so it knows how to push the data to the push gateway. I’m not a Pythonmonger, so this bit probably took me longer than it should have done.

The official docs are here, but I found that the example didn’t actually work (YMMV). This didn’t matter too much as I already had a python script that generated values, I just needed to punt them somewhere.

The key takeaway was that I needed to use a Gauge in my instance as my values may go up or down. So I just had to add a couple of import lines:

import prometheus_client as prom
from prometheus_client import CollectorRegistry, Gauge, push_to_gateway

… and then push the values (taken from a dictionary) to the gateway:

# Push to Prometheus Gateway
if __name__ == '__main__':
  registry = CollectorRegistry()
  g = Gauge('identifying_name_of_my_metric', 'Metric description to help humans', registry=registry)
  g.set(a_dictionary["metric"])
  push_to_gateway('servicename.namespace.svc.cluster.local:9091', job='Demo-job', registry=registry)

‘servicename.namespace.svc.cluster.local’ is ‘the name of the pushgateway service’.’the name of the namespace the service is in’.svc.cluster.local. This will use Kubernetes’ internal DNS to route to the correct service, so nothing needs to be exposed.

Once complete you should be able to see your data if you look firstly at your pushgateway url through a browser, and then at your prometheus instance.

Making the container reusable

I envisage that this won’t be our only Python cron job, so I wanted to make something that could be reused with ease.

So the python container is extremely lightweight, just python plus dependencies. Here’s the dockerfile:

FROM python:3-alpine
RUN pip install requests prometheus_client

From there, my kubernetes yaml looks a little like this:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: demo-cronjob
spec:
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      template:
        metadata:
          labels:
            app: demo-cronjob
        spec:
          containers:
          - name: demo-cronjob
            image: location-of-our-container
            imagePullPolicy: Always
            command:
            - python
            args:
            - /tmp/demo.py
            volumeMounts:
              - name: scripts-mount
                mountPath: /tmp
          volumes:
            - name: scripts-mount
              configMap:
                name: scripts-mount
          restartPolicy: Never
  schedule: '*/5 * * * *'
  successfulJobsHistoryLimit: 3
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: scripts-mount
data:
  demo.py: |-
    import requests
    import prometheus_client as prom
    from prometheus_client import CollectorRegistry, Gauge, push_to_gateway

    ...

    # Push to Prometheus Gateway
    if __name__ == '__main__':
      registry = CollectorRegistry()
      g = Gauge('identifying_name_of_my_metric', 'Metric description to help humans', registry=registry)
      g.set(a_dictionary["metric"])
      push_to_gateway('servicename.namespace.svc.cluster.local:9091', job='Demo-job', registry=registry)

This allows me to add additional python scripts easily by appending the configmap and just adding additional cron jobs as required.