Avatar
I build tools that make managing infrastructure easier to run and safer to scale. Day-to-day I’m in the Terraform and Go trenches, maintaining widely used modules, authoring providers, and contributing to Terraform Best Practices and Terraform Proverbs. Beyond infrastructure, I architect data engineering pipelines and specialize in building scalable ML/AI platforms for companies in the cloud. I live in the CLI, am the creator of Red-Teaming TTPs, and am a MITRE, OWASP, and Debian contributor! If you enjoy my community code, blogs, or tools, feel free to reach out and connect!

Gang of Three: Pragmatic Operations Design Patterns

This blog is dedicated to arcaven, who initially made me aware of this observation and opened my eyes to the wild world of infrastructure and system operations patterns at scale.

I Can’t Unsee It

A few weeks ago, something clicked. Maybe the shorter, winter-approaching days slowed me down enough to notice, but suddenly threes were everywhere. Why do we split environments into development, staging, and production? Why do we stage upgrades across three clusters? Why do we run hot, warm, and cold storage tiers? Why does our CI/CD pipeline have build and test, staging deployment, and production deployment gates?

The number three keeps showing up in systems work, and surprisingly few people talk about it explicitly. As it turns out, this pattern is not coincidence. It represents the intersection of distributed systems theory and practical operations experience. Once you start looking for it, you’ll find the rule of three embedded in nearly every mature infrastructure decision.

Where Consensus Algorithms Meet Change Management

Distributed systems run on quorum-based decision making. What that means is that a majority of nodes have to agree before committing state changes (see Paxos and Raft). These consensus algorithms are designed to handle node failures, communication delays, and network partitions while ensuring the system can continue making progress even when failures occur. With three nodes, you can lose one and still have two nodes available to form a majority. This gives you fault tolerance and forward progress in the same architectural package.

Two nodes cannot lose anything without risking deadlock or split-brain scenarios. Four or five nodes provide more headroom for failures, but three is the minimum viable number that actually delivers reliable consensus. It is also practical from a cost and complexity perspective. This is why you see three-node clusters everywhere across the industry. This is not cargo culting or blind imitation, this is mathematics driving architecture.

The same logic drives traditional thinking around redundancy planning. Three instances means one for baseline capacity, one available during maintenance windows, and one ready for the surprise failure at 3am. Load balancers, database replicas, and availability zones all follow this pattern because it maps cleanly to how systems actually fail in production environments.

This pattern also extends to monitoring and alerting systems. Three data points allow you to establish a trend and distinguish between noise and signal. A single metric spike might be nothing, two consecutive spikes suggest investigation, but three consecutive anomalies typically trigger automated responses or pages. The threshold of three provides enough confidence to act without creating alert fatigue from false positives.

AWS Best Practices and Chaos Engineering

AWS regions typically ship with three or more availability zones, and the Well-Architected Framework encourages spreading workloads across them. This is not just resilience theater or checkbox compliance. It embodies that same quorum mathematics we discussed earlier. Lose one availability zone and your system continues running with consensus intact. Your application remains available, your data stays consistent, and your customers notice nothing.

Chaos engineering practices naturally gravitate toward threes as well. Kill one instance and observe what happens. You are testing real failure modes while keeping two healthy nodes as a safety net. This allows destructive testing that does not actually destroy your service. You gain confidence in your resilience mechanisms without risking a full outage. Tools like Chaos Monkey and Gremlin are built around this philosophy of controlled, incremental failure injection.

Rolling deployments across three clusters provide a built-in verification pattern that works remarkably well in practice. Deploy to the first cluster, verify correct behavior, then proceed to the second. Verify again, then move to the third. These two checkpoints before full rollout give you opportunities to catch unusual issues before they propagate everywhere. Your first cluster serves as your canary, detecting problems early. Your second cluster provides a confidence check that the issue was not environment-specific. Your third cluster represents your validated rollout to the remainder of your infrastructure.

Storage Hierarchies and Performance Tiers

Storage systems provide another compelling example of the rule of three in action. Hot storage serves frequently accessed data with low latency. Warm storage holds less frequently accessed data at moderate cost and performance. Cold storage archives rarely accessed data at minimal cost. This three-tier architecture balances performance requirements against budget constraints while providing clear migration paths as data ages.

Cloud providers have built entire product lines around this model. Amazon S3 offers Standard, Infrequent Access, and Glacier tiers. Azure provides Hot, Cool, and Archive tiers. Google Cloud offers Standard, Nearline, and Coldline storage classes. The consistency across providers suggests this is not arbitrary product segmentation but rather a natural reflection of how organizations actually use data over time.

Database systems follow similar patterns. Many databases implement a three-level caching strategy with L1 cache in memory, L2 cache on fast local storage, and L3 representing the authoritative data on persistent storage. Each level trades off speed for capacity and durability. This hierarchy allows databases to serve most queries from fast cache while maintaining data integrity through persistent storage.

The Practical Value of Three

Understanding why three works so well helps us make better infrastructure decisions. When designing a new system, starting with three of anything gives you a resilient foundation without over-engineering. Three availability zones, three environment tiers, three deployment stages, three monitoring thresholds. Each application of the pattern provides fault tolerance, verification opportunities, and practical operability.

This does not mean three is always the right answer. Some systems genuinely need more redundancy or more granular staging. However, three serves as an excellent default that you should consciously decide to deviate from rather than accidentally under-provision. If you find yourself choosing two of something, ask whether you are accepting unnecessary fragility. If you are choosing five, ask whether the additional complexity provides proportional value. Thanks for reading, and if you like this blog, you might like the code and tools in my Github.