Test-Driven Infrastructure

infrastructure-as-code
test-driven-development
continuous-delivery
design-pattern
Test-Driven Infrastructure

Most teams ship infrastructure without tests. That’s like writing application code with no CI and hoping for the best. Infrastructure is critical, complex, and fragile—but too often it’s left unchecked.

With Test-Driven Development (TDD), we can flip the script. Instead of praying our Terraform and IAM rules “just work,” we define what good looks like, write tests, and let automation keep us safe.

Why Test-Driven Development for Infrastructure?

In 15 years of building systems, I’ve never seen a project with comprehensive automated infrastructure tests. That gap is dangerous. Infrastructure touches everything: networking, IAM, deployments, storage. When something breaks, it often breaks catastrophically.

TDD forces us to ask “what does good look like?” before we change anything. The payoff:

Clear outcomes – we know what success means Fast feedback – catch issues in seconds, not hours Safe changes – refactor without fear Living documentation – tests show how the system works Built-in troubleshooting – validation suite ready when things go wrong

We test changes manually anyway. Why not automate them?

The Lightweight TDD Pattern

We don’t need heavyweight test frameworks. With Terragrunt hooks and bats, we can build lightweight, shell-native, and adaptable infrastructure tests.

A diagram showing the flow of Terragrunt, Bats, and GitHub Actions

Key idea: Assert behavior at the boundaries. For example, don’t test whether an IAM role is attached—test whether the service account can actually upload to a bucket.

Tool Stack

Terragrunt orchestrates Terraform and provides execution hooks. We run tests immediately after infrastructure changes.

Bats is Bash-native testing. With bats-detik, we get natural-language assertions for Kubernetes. Call kubectl, helm, flux, gcloud, or aws directly—no abstraction layers.

GitHub Actions runs everything consistently. dorny/test-reporter turns JUnit XML into clean GitHub reports.

Test Layout Convention and Hooking Up Test Execution

Keep it conventional:

  • Place tests in a tests directory next to the module’s terragrunt.hcl.
  • Terragrunt’s root.hcl defines a hook that runs all tests of a module after apply.
  • If no tests exist, it simply warns.
root.hcl
hcl
terraform {
    after_hook "tests" {
        commands = ["apply"]
        execute = [
            "bash", "-c", <<EOF
        if [ -d tests ]; then
          mkdir -p test-results
          bats --report-formatter junit --output test-results tests/
        else
          echo '⚠️ No tests found'
        fi
      EOF
        ]
    }
}

Writing Tests Style

Use Bats’ setup_suite to fetch cluster credentials once before running tests.

cluster/tests/setup_suite.bash
bash
#!/usr/bin/env bash
set -euo pipefail

function setup_suite() {
  tf_output_json=$(terragrunt output -json)

  PROJECT_ID=$(echo ${tf_output_json} | jq -r .platform_project.value.id)
  CLUSTER_NAME=$(echo ${tf_output_json} | jq -r .platform_cluster.value.name)
  CLUSTER_REGION=$(echo ${tf_output_json} | jq -r .platform_cluster.value.location)
  KUBECONFIG=~/.kube/config

  gcloud container clusters get-credentials "${CLUSTER_NAME}" 
    --region "${CLUSTER_REGION}" --project "${PROJECT_ID}"
  kubectl version
  export KUBECONFIG PROJECT_ID CLUSTER_NAME CLUSTER_REGION
}

Next an example test, to validate the flux installation on the cluster:

cluster/tests/flux.bats
bash
#!/usr/bin/env bats

bats_load_library bats-support
bats_load_library bats-assert
bats_load_library bats-detik/detik.bash

DETIK_CLIENT_NAME="kubectl"
DETIK_CLIENT_NAMESPACE="flux-system"

@test "Flux controllers are healthy" {
  flux check
}

@test "Flux Kustomization reconciled successfully" {
  verify "'status.conditions[*].reason' matches 'ReconciliationSucceeded' for kustomization named 'flux-system'"
}

@test "Given image automation is enabled, Then its CRDs are installed" {
  for crd in 
    imagerepositories.image.toolkit.fluxcd.io 
    imagepolicies.image.toolkit.fluxcd.io 
    imageupdateautomations.image.toolkit.fluxcd.io
  do
    verify "there is 1 crd named '$crd'"
  done
}

@test "When a managed label on flux-system namespace is tampered, Then Flux reconciles it back" {
  kubectl label namespace flux-system drift-test=temporary --overwrite
  flux reconcile kustomization flux-system
  try "at most 3 times every 1s 
      to get namespace named 'flux-system' 
      and verify that 'metadata.labels.drift-test' is '<none>'"
}

Assume the cluster directory is a terragrunt enabled terraform module which provisions a kubernetes cluster with flux also being bootstrapped via terraform. This test will verify that the flux controllers are healthy, that the flux kustomization is reconciled successfully, that the image automation CRDs are installed, and that the flux can reconciled its own configuration.

It’s strongly recommended to write tests on this level in a most high-level way: Focus on the desired behavior and not on concrete properties or state of the infrastructure.

For example, if you need to attach an IAM Role to a principal, don’t validate the exact role presence. Instead, verify that the principal can perform its intended actions—for example, uploading to a bucket. This avoids brittle tests coupled to implementation details that break on minor changes like role composition.

CI Integration and Reporting

Just a cherry on top: bats (and almost all other test runners) can report test results in JUnit XML or in another format supported by dorny/test-reporter@v2. This way, we can integrate the test results into our CI pipeline to provide a quick overview when failures and unexpected behavior occur. A short trimmed-down example of how to integrate reporting:

name: 'Infrastructure Apply'
on:
    push:
        branches:
            - 'main'
    workflow_dispatch:

env:
    TF_IN_AUTOMATION: 'true'
    TG_NON_INTERACTIVE: 'true'

permissions:
    # Minimal permissions required for reports
    contents: read
    actions: read
    checks: write

jobs:
    apply:
        name: 'Apply'
        environment: 'infrastructure'
        runs-on: ubuntu-latest
        concurrency:
            group: infrastructure
        steps:
            - uses: 'actions/checkout@v5'

            # Install bats and bats-detik

            # Add validation and plan if needed

            - id: apply-all
              name: '🚀 Apply All'
              run: terragrunt apply --all

            - name: Test Report
              uses: dorny/test-reporter@v2
              if: ${{ !cancelled() }}
              with:
                  name: Foundation Apply Tests
                  path: '**/test-results/report.xml'
                  reporter: java-junit

Result: a clean pass/fail report embedded in your GitHub Actions workflow.

A screenshot of a bats report in GitHub action workflow

Pattern Summary

  • Place tests in a tests directory at root of the Terraform module.
  • Hook Terragrunt to run them automatically after apply.
  • Write high-level behavior tests, not brittle state checks.
  • Integrate results into CI for instant visibility.

This pattern is lightweight, shell-native, and extends to any test runner. As a bonus, you build a validation suite that pinpoints infrastructure issues instantly. Every production incident becomes a new test case.

Ready to make your infrastructure iterations safer and faster?

I help teams implement test-driven infrastructure—through architecture reviews, training, and hands-on implementation.

Get in touch!

Jan-Philip Loos's avatar
Jan-Philip Loos

References