Test-Driven Infrastructure

Most teams ship infrastructure without tests. That’s like writing application code with no CI and hoping for the best. Infrastructure is critical, complex, and fragile—but too often it’s left unchecked.
With Test-Driven Development (TDD), we can flip the script. Instead of praying our Terraform and IAM rules “just work,” we define what good looks like, write tests, and let automation keep us safe.
Why Test-Driven Development for Infrastructure?
In 15 years of building systems, I’ve never seen a project with comprehensive automated infrastructure tests. That gap is dangerous. Infrastructure touches everything: networking, IAM, deployments, storage. When something breaks, it often breaks catastrophically.
TDD forces us to ask “what does good look like?” before we change anything. The payoff:
Clear outcomes – we know what success means Fast feedback – catch issues in seconds, not hours Safe changes – refactor without fear Living documentation – tests show how the system works Built-in troubleshooting – validation suite ready when things go wrong
We test changes manually anyway. Why not automate them?
The Lightweight TDD Pattern
We don’t need heavyweight test frameworks. With Terragrunt hooks and bats, we can build lightweight, shell-native, and adaptable infrastructure tests.
Key idea: Assert behavior at the boundaries. For example, don’t test whether an IAM role is attached—test whether the service account can actually upload to a bucket.
Tool Stack
Terragrunt orchestrates Terraform and provides execution hooks. We run tests immediately after infrastructure changes.
Bats is Bash-native testing. With bats-detik, we get natural-language assertions for Kubernetes. Call kubectl, helm, flux, gcloud, or aws directly—no abstraction layers.
GitHub Actions runs everything consistently. dorny/test-reporter
turns JUnit XML into clean GitHub
reports.
Test Layout Convention and Hooking Up Test Execution
Keep it conventional:
- Place tests in a
tests
directory next to the module’sterragrunt.hcl
. - Terragrunt’s
root.hcl
defines a hook that runs all tests of a module afterapply
. - If no tests exist, it simply warns.
terraform {
after_hook "tests" {
commands = ["apply"]
execute = [
"bash", "-c", <<EOF
if [ -d tests ]; then
mkdir -p test-results
bats --report-formatter junit --output test-results tests/
else
echo '⚠️ No tests found'
fi
EOF
]
}
}
Writing Tests Style
Use Bats’ setup_suite
to fetch cluster credentials once before running tests.
#!/usr/bin/env bash
set -euo pipefail
function setup_suite() {
tf_output_json=$(terragrunt output -json)
PROJECT_ID=$(echo ${tf_output_json} | jq -r .platform_project.value.id)
CLUSTER_NAME=$(echo ${tf_output_json} | jq -r .platform_cluster.value.name)
CLUSTER_REGION=$(echo ${tf_output_json} | jq -r .platform_cluster.value.location)
KUBECONFIG=~/.kube/config
gcloud container clusters get-credentials "${CLUSTER_NAME}"
--region "${CLUSTER_REGION}" --project "${PROJECT_ID}"
kubectl version
export KUBECONFIG PROJECT_ID CLUSTER_NAME CLUSTER_REGION
}
Next an example test, to validate the flux installation on the cluster:
#!/usr/bin/env bats
bats_load_library bats-support
bats_load_library bats-assert
bats_load_library bats-detik/detik.bash
DETIK_CLIENT_NAME="kubectl"
DETIK_CLIENT_NAMESPACE="flux-system"
@test "Flux controllers are healthy" {
flux check
}
@test "Flux Kustomization reconciled successfully" {
verify "'status.conditions[*].reason' matches 'ReconciliationSucceeded' for kustomization named 'flux-system'"
}
@test "Given image automation is enabled, Then its CRDs are installed" {
for crd in
imagerepositories.image.toolkit.fluxcd.io
imagepolicies.image.toolkit.fluxcd.io
imageupdateautomations.image.toolkit.fluxcd.io
do
verify "there is 1 crd named '$crd'"
done
}
@test "When a managed label on flux-system namespace is tampered, Then Flux reconciles it back" {
kubectl label namespace flux-system drift-test=temporary --overwrite
flux reconcile kustomization flux-system
try "at most 3 times every 1s
to get namespace named 'flux-system'
and verify that 'metadata.labels.drift-test' is '<none>'"
}
Assume the cluster
directory is a terragrunt enabled terraform module which provisions a
kubernetes cluster with flux also being bootstrapped via terraform. This test will verify that the
flux controllers are healthy, that the flux kustomization is reconciled successfully, that the image
automation CRDs are installed, and that the flux can reconciled its own configuration.
It’s strongly recommended to write tests on this level in a most high-level way: Focus on the desired behavior and not on concrete properties or state of the infrastructure.
For example, if you need to attach an IAM Role to a principal, don’t validate the exact role presence. Instead, verify that the principal can perform its intended actions—for example, uploading to a bucket. This avoids brittle tests coupled to implementation details that break on minor changes like role composition.
CI Integration and Reporting
Just a cherry on top: bats (and almost all other test runners) can report test results in JUnit XML
or in another format supported by dorny/test-reporter@v2
.
This way, we can integrate the test results into our CI pipeline to provide a quick overview when
failures and unexpected behavior occur. A short trimmed-down example of how to integrate reporting:
name: 'Infrastructure Apply'
on:
push:
branches:
- 'main'
workflow_dispatch:
env:
TF_IN_AUTOMATION: 'true'
TG_NON_INTERACTIVE: 'true'
permissions:
# Minimal permissions required for reports
contents: read
actions: read
checks: write
jobs:
apply:
name: 'Apply'
environment: 'infrastructure'
runs-on: ubuntu-latest
concurrency:
group: infrastructure
steps:
- uses: 'actions/checkout@v5'
# Install bats and bats-detik
# Add validation and plan if needed
- id: apply-all
name: '🚀 Apply All'
run: terragrunt apply --all
- name: Test Report
uses: dorny/test-reporter@v2
if: ${{ !cancelled() }}
with:
name: Foundation Apply Tests
path: '**/test-results/report.xml'
reporter: java-junit
Result: a clean pass/fail report embedded in your GitHub Actions workflow.

Pattern Summary
- Place tests in a
tests
directory at root of the Terraform module. - Hook Terragrunt to run them automatically after
apply
. - Write high-level behavior tests, not brittle state checks.
- Integrate results into CI for instant visibility.
This pattern is lightweight, shell-native, and extends to any test runner. As a bonus, you build a validation suite that pinpoints infrastructure issues instantly. Every production incident becomes a new test case.
Ready to make your infrastructure iterations safer and faster?
I help teams implement test-driven infrastructure—through architecture reviews, training, and hands-on implementation.
Get in touch!