Disposable code: chaos monkey for ai-native development

The Netflix chaos monkey lesson.

Netflix proved that infrastructure could be disposable if definitions were resilient. Could the same principle apply to code itself?

In 2011, Netflix did something that seemed reckless. They built a tool that randomly terminated their own production instances. They called it Chaos Monkey.^[1]

The logic was counterintuitive but sound: if your infrastructure cannot survive the loss of an instance, you will discover that fact eventually. The question is whether you discover it during a controlled test or during a critical outage at 3am. Netflix chose to find out on their own terms.

Chaos Monkey worked because Netflix had invested heavily in declarative infrastructure management. Their server configurations, network topologies, and deployment pipelines were all defined in code. When Chaos Monkey killed an instance, the infrastructure could rebuild itself from those definitions. The definitions were the source of truth, not the running servers.

This approach transformed how Netflix engineers thought about resilience. It became a first-class engineering concern, not an afterthought. Systems were designed from the start to handle failure gracefully.

Disposable code

Disposable code: Production code that can be safely deleted and regenerated from specifications.

Just as Netflix made infrastructure instances disposable by investing in their definitions, AI enabled development points toward a world where application code becomes equally disposable.

I think the same philosophical shift that Infrastructure as Code brought to operations will soon arrive in software development itself. This is becoming possible because AI can now generate working code from detailed specifications. The gap between intent and implementation has narrowed. When that gap is small enough, the economics change: investing in specifications yields returns that investing in code cannot.

This raises a challenging question: are your specifications resilient enough to be the single source of truth?

The thought experiment

Consider a simple test: what if you deleted your codebase tomorrow?

Not your specifications. Not your acceptance criteria. Not your architectural decision records. Just the implementation code.

Could you recreate the system from what remains?

For most organisations, this thought experiment reveals uncomfortable truths. The specifications describe a system that diverged from reality months or years ago. Critical business logic exists only in code that nobody fully understands. Decisions that shaped the architecture were never recorded.

Think about what your specifications do not capture. The validation logic that rejects certain edge cases because of an incident three years ago. The retry mechanism tuned to specific timing because of how a third-party API behaves under load. The field that must never be null, not because the schema says so, but because a downstream system crashes if it is. The ordering of operations that looks arbitrary but exists because of a race condition someone debugged over a painful weekend.

None of this is in the specifications. It lives in code comments that half the team has never read. It lives in commit messages that scroll past in a blur. It lives in the memory of the engineer who fixed the original bug, who may have since left the company. It lives in chat logs that are archived and unsearchable.

The code has become the specification, which defeats the entire purpose of having specifications.

Now imagine you have embraced AI enabled development. You are using agents to generate code from specifications. The question sharpens: do your specifications actually hold all of these decisions? Or were they made in prompts, in conversations with agents, in context windows that have long since scrolled away?

The tribal knowledge problem does not disappear with AI. It simply moves.

Instead of living in engineer memory and chat logs, it lives in prompt history and session context. The code may be generated, but if the reasoning behind it was never captured in the specification, you are no better off than before.

This is the equivalent of Netflix discovering that their infrastructure definitions were out of sync with their running servers. When that happens, Chaos Monkey does not prove resilience. It proves fragility.

The limits of the analogy

There is an important difference between Infrastructure as Code and specification driven development that deserves acknowledgement.

Infrastructure as Code is (almost) deterministic. Run the same Terraform configuration twice and you get identical infrastructure. The same inputs produce the same outputs, every time. This is what makes the Chaos Monkey test so clean: kill an instance, rebuild it, and verify that the result matches what was there before.

Specification driven development with AI is not deterministic. Run the same specification through a large language model twice and you will get functionally similar but structurally different code. Variable names may differ. Control flow may vary. The organisation of functions and modules may change entirely. The same inputs do not produce the same outputs.

This distinction matters because it changes what resilience means.

For Infrastructure as Code, resilience means reproducibility. You can verify resilience by comparing the rebuilt infrastructure to the original. Differences indicate a problem.

For specification driven development, resilience means behavioural equivalence. You cannot compare two generated codebases line by line and expect them to match. You can only verify that both implementations satisfy the specification. Acceptance criteria become the arbiter of correctness, not structural similarity.

This makes the challenge harder, not easier. When the output is non-deterministic, specifications must constrain behaviour with precision. Ambiguity that would be harmless in a deterministic system becomes a source of divergence in a non-deterministic one. Every underspecified edge case is an opportunity for two valid implementations to behave differently.

The degree of variance is not fixed. Tighter specifications produce more consistent outputs. A specification that mandates technology choices, naming conventions, and architectural patterns constrains the model's decisions. Specification precision becomes a lever: the more opinionated the spec, the more predictable the regeneration.

The non-determinism also explains why acceptance criteria are not optional. Without a clear definition of correct behaviour, you have no way to confirm that a regenerated codebase actually implements what you need. The specification must include verifiable criteria that any implementation, generated or otherwise, must satisfy.

In this light, the analogy to Chaos Monkey holds, but with a crucial refinement. Netflix could prove resilience by rebuilding and comparing. We can only prove resilience by rebuilding and verifying against behavioural contracts. The bar is higher.

Practical implications

If specifications become the source of truth and code becomes disposable, the implications are significant.

Investment shifts upstream. The economics change when code can be regenerated. Spending time on specification quality would yield compounding returns because those specifications can produce multiple implementations. Spending time on code polish would yield diminishing returns because that code may be regenerated and your refinements lost. This does not mean code quality ceases to matter. It means that the effort spent making code elegant could instead be spent making specifications precise. The elegance of generated code becomes a property of the generator, not something to handcraft in each output.

Verification validates the process, not just the product. Traditional development asks one question: does this code work correctly? Specification driven development would add a second: can this specification produce correct code? The latter requires verifying the generation process itself. In practice, this could mean periodically regenerating code from specifications and checking the output against your acceptance criteria. If the criteria are met, the specifications are sufficient. If not, the gap between specification and required behaviour has been exposed. This is the Chaos Monkey test applied to code: deliberate destruction to verify that reconstruction works.

Version control changes meaning. When code is generated from specifications, the specifications become the primary artefact under version control. Code may still be versioned, but it becomes a build artefact rather than a source artefact. This is analogous to how compiled binaries are treated today. You would not review a pull request by examining the machine code. Similarly, in a specification driven workflow, the review would focus on the specification changes, not the generated implementation. The generated code is a consequence, not a creation.

The definition of done evolves. A feature is not complete when the code works. It is complete when the specification is sufficient to recreate the code. This is a higher bar, but it is the bar that enables genuine resilience. It means that code review could become specification review. It means that onboarding new team members might focus on understanding specifications rather than navigating legacy code. It means that technical debt could live in ambiguous specifications, not in tangled implementations.

The real test

None of this is established practice yet. But the direction seems clear to me.

Netflix did not build Chaos Monkey because they enjoyed watching servers die. They built it because they understood that untested resilience is not resilience at all. It is hope.

I think the same principle will apply to specification driven development. If you have never deleted your code and regenerated it from specifications, you do not know whether your specifications are sufficient. You are hoping they are.

The teams that thrive in an AI enabled development world will likely be those that treat their specifications with the same rigour that Netflix treats their infrastructure definitions. They will test their specifications by using them. They will discover gaps before those gaps become problems. They will embrace Disposable Code.

The question is not whether to consider this shift. It is whether you will discover your specification gaps on your own terms or on someone else's.

Key takeaways

Disposable Code: production code that can be safely deleted and regenerated from specifications
Netflix's Chaos Monkey proved infrastructure resilience by deleting instances and rebuilding from definitions; the same principle could apply to code
The challenge is harder for code: non-deterministic generation means resilience is behavioural equivalence, not structural identity
Tribal knowledge does not disappear with AI. It moves from engineer memory to prompt history and session context
A possible new definition of done: a feature is complete when the specification is sufficient to recreate it

This is the first in a series of thought experiments on specification resilience. Next: The Code Simian Army, exploring how the full range of Netflix's resilience tools might translate to specification driven development.

Netflix Technology Blog, "The Netflix Simian Army", July 2011. https://netflixtechblog.com/the-netflix-simian-army-16e57fbab116 ↩︎