Test-Driven Infrastructure

infrastructure-as-code
test-driven-development
continuous-delivery
design-pattern
Test-Driven Infrastructure

Most teams ship infrastructure without tests. That’s like writing application code with no CI and hoping for the best. Infrastructure is critical, complex, and fragile—but too often it’s left unchecked.

With Test-Driven Development (TDD), we can flip the script. Instead of praying our Terraform and IAM rules “just work,” we define what good looks like, write tests, and let automation keep us safe.

Why Test-Driven Development for Infrastructure?

In 15 years of building systems, I’ve never seen a project with comprehensive automated infrastructure tests. That gap is dangerous. Infrastructure touches everything: networking, IAM, deployments, storage. When something breaks, it often breaks catastrophically.

By asking ourselves “what does good look like?” before making changes, TDD gives us:

  1. Defined outcomes — clarity on what “good” infrastructure means.
  2. Automated checks — fast feedback loops that catch issues early.
  3. Confidence in change — safer refactors and upgrades.
  4. A living safety net — tests that evolve with the system.
  5. Bonus: Quick Validation Suite — Troubleshooting built-in.

We already test changes manually. Automating those tests turns fragile rituals into repeatable, trustworthy pipelines.

The Lightweight TDD Pattern

We don’t need heavyweight test frameworks. With Terragrunt hooks and bats, we can build lightweight, shell-native, and adaptable infrastructure tests.

A diagram showing the flow of Terragrunt, Bats, and GitHub Actions

Key idea: Assert behavior at the boundaries. For example, don’t test whether an IAM role is attached—test whether the service account can actually upload to a bucket.

Tool Stack

Terragrunt & Terraform

Terragrunt orchestrates Terraform and provides execution hooks. We use these hooks to run tests right after infrastructure changes.

Bats & Bats-Detik

Bats is a Bash-native testing framework. Combined with bats-detik, it gives us natural-language assertions against Kubernetes resources. The power is simplicity: call kubectl, helm, flux, gcloud, or aws directly in tests.

GitHub Actions with Test Reporting

CI pipelines ensure consistency. With dorny/test-reporter, we turn JUnit XML output into clean GitHub reports.

Test Layout Convention and Hooking Up Test Execution

Keep it conventional:

  • Place tests in a tests directory next to the module’s terragrunt.hcl.
  • Terragrunt’s root.hcl defines a hook that runs all tests of a module after apply.
  • If no tests exist, it simply warns.
root.hcl
hcl
terraform {
    after_hook "tests" {
        commands = ["apply"]
        execute = [
            "bash", "-c", <<EOF
        if [ -d tests ]; then
          mkdir -p test-results
          bats --report-formatter junit --output test-results tests/
        else
          echo '⚠️ No tests found'
        fi
      EOF
        ]
    }
}

Writing Tests Style

Use Bats’ setup_suite to fetch cluster credentials once before running tests.

cluster/tests/setup_suite.bash
bash
#!/usr/bin/env bash
set -euo pipefail

function setup_suite() {
  tf_output_json=$(terragrunt output -json)

  PROJECT_ID=$(echo ${tf_output_json} | jq -r .platform_project.value.id)
  CLUSTER_NAME=$(echo ${tf_output_json} | jq -r .platform_cluster.value.name)
  CLUSTER_REGION=$(echo ${tf_output_json} | jq -r .platform_cluster.value.location)
  KUBECONFIG=~/.kube/config

  gcloud container clusters get-credentials "${CLUSTER_NAME}" 
    --region "${CLUSTER_REGION}" --project "${PROJECT_ID}"
  kubectl version
  export KUBECONFIG PROJECT_ID CLUSTER_NAME CLUSTER_REGION
}

Next an example test, to validate the flux installation on the cluster:

cluster/tests/flux.bats
bash
#!/usr/bin/env bats

bats_load_library bats-support
bats_load_library bats-assert
bats_load_library bats-detik/detik.bash

DETIK_CLIENT_NAME="kubectl"
DETIK_CLIENT_NAMESPACE="flux-system"

@test "Flux controllers are healthy" {
  flux check
}

@test "Flux Kustomization reconciled successfully" {
  verify "'status.conditions[*].reason' matches 'ReconciliationSucceeded' for kustomization named 'flux-system'"
}

@test "Given image automation is enabled, Then its CRDs are installed" {
  for crd in 
    imagerepositories.image.toolkit.fluxcd.io 
    imagepolicies.image.toolkit.fluxcd.io 
    imageupdateautomations.image.toolkit.fluxcd.io
  do
    verify "there is 1 crd named '$crd'"
  done
}

@test "When a managed label on flux-system namespace is tampered, Then Flux reconciles it back" {
  kubectl label namespace flux-system drift-test=temporary --overwrite
  flux reconcile kustomization flux-system
  try "at most 3 times every 1s 
      to get namespace named 'flux-system' 
      and verify that 'metadata.labels.drift-test' is '<none>'"
}

Assume the cluster directory is a terragrunt enabled terraform module which provisions a kubernetes cluster with flux also being bootstrapped via terraform. This test will verify that the flux controllers are healthy, that the flux kustomization is reconciled successfully, that the image automation CRDs are installed, and that the flux can reconciled its own configuration.

It’s strongly recommended to write tests on this level in a most high-level way: Focus on the desired behavior and not on concrete properties or state of the infrastructure.

For example, if you need to attach an IAM Role to a principal, don’t validate the exact role presence. Instead, test that the principal can perform the intended actions like uploading to a bucket. This avoids brittle tests coupled to implementation details that break on minor changes like role composition.

CI Integration and Reporting

Just a cherry on top: bats (and almost all other test runners) can report test results in JUnit XML or in another format supported by dorny/test-reporter@v2. This way, we can integrate the test results into our CI pipeline to provide a quick overview when failures and unexpected behavior occur. A short trimmed-down example of how to integrate reporting:

name: 'Infrastructure Apply'
on:
    push:
        branches:
            - 'main'
    workflow_dispatch:

env:
    TF_IN_AUTOMATION: 'true'
    TG_NON_INTERACTIVE: 'true'

permissions:
    # Minimal permissions required for reports
    contents: read
    actions: read
    checks: write

jobs:
    apply:
        name: 'Apply'
        environment: 'infrastructure'
        runs-on: ubuntu-latest
        concurrency:
            group: infrastructure
        steps:
            - uses: 'actions/checkout@v5'

            # Install bats and bats-detik

            # Add validation and plan if needed

            - id: apply-all
              name: '🚀 Apply All'
              run: terragrunt apply --all

            - name: Test Report
              uses: dorny/test-reporter@v2
              if: ${{ !cancelled() }}
              with:
                  name: Foundation Apply Tests
                  path: '**/test-results/report.xml'
                  reporter: java-junit

Result: a clean pass/fail report embedded in your GitHub Actions workflow.

A screenshot of a bats report in GitHub action workflow

Pattern Summary

  • Place tests in a tests directory at root of the Terraform module.
  • Hook Terragrunt to run them automatically after apply.
  • Write high-level behavior tests, not brittle state checks.
  • Integrate results into CI for instant visibility.

This pattern is lightweight, shell-native, and adaptable. You can extend it to any test runner.

As a bonus, the pattern allows combining all infrastructure tests into one single validation suite, which can help pinpoint infrastructure issues and help rule out a bunch of possible causes. A recommended discipline is to update and extend tests on every issue you encountered to continuously improve the validation power of the suite.

Ready to make your infrastructure iterations safer and faster?

I help teams build test-driven infrastructure practices—through consultations, architecture reviews, trainings and hands-on implementation.

Jan-Philip Loos's avatar
Jan-Philip Loos

References