
This is not a metaphor
Docker layer caching and LLM prompt caching are different mechanisms.
They are the same optimization problem.
Docker reuses deterministic build outputs. LLM systems reuse computation over stable token prefixes. In both cases, an early change forces recomputation downstream. That is enough for the mental model to hold.
If you understand why COPY . . in the wrong place makes Docker slow, you already understand a large part of context engineering.
Current chat models behave like layered builds
The practical rule is simple:
change something early -> pay for everything after it again
That is how modern chat systems behave at the level that matters for cost and latency.
The prompt is not just text. It is an execution prefix. The model processes that prefix and builds state on top of it. If you change token 50, token 5000 is no longer sitting on the same foundation.
That has three direct consequences:
- stable prefixes are valuable
- early changes are expensive
- bloated contexts reduce focus as well as cache efficiency
This is why giant chats degrade. The system is not only carrying more information. It is carrying more recomputation and more noise.
The context window problem is mostly a build-design problem
Most teams deal with long chats by summarizing them with another model.
That works, but it is still a patch.
You are using one model to compress the consequences of a badly structured prompt pipeline. The better default is programmatic decomposition:
- keep long-lived instructions stable
- move volatile task input to the end
- isolate stages
- pass typed outputs instead of whole transcripts
This is exactly what good Dockerfiles do.
Bad Dockerfiles and bad agent systems fail in the same way
A bad Dockerfile looks like this:
COPY . .
RUN npm install
RUN npm test
RUN npm run build
Every small source change invalidates everything below it.
A bad agent workflow does the same thing:
- full repo context
- full conversation history
- every prior error
- every prior review
- every deployment log
- one giant prompt that plans, writes, debugs, reviews, and deploys
Then one new error arrives and the whole stack has to be reconsidered again.
This is the COPY . . anti-pattern in AI form.
The right pattern is multi-stage
The strongest version of the analogy is not cache. It is multi-stage build design.
A good agent system should look like this:
- classify the task
- plan the change
- generate the code
- transpile and test it
- repair failures
- review the diff
- deploy the result
Each stage should:
- receive the smallest valid context
- reuse a stable prefix where possible
- emit a typed artifact for the next stage
- rerun independently when its own inputs change
This is the same logic as a multi-stage Docker build. Each stage exists for one purpose. Each stage consumes only what it needs. Each stage produces a smaller, cleaner artifact for the next one.
That design has an important side effect: independent stages can run in parallel.
A review pass does not need the full planning transcript. A test runner does not need the entire design discussion. A repair agent needs the failing file, the error, and the constraints. Nothing else.
Parallelism becomes possible once the stages are real.
Good tool design removes the context problem by design
This is the part most coding agents still miss.
If the system is built around giant chat transcripts, the context window stays the bottleneck. The agent has to remember everything because the system gave it nowhere else to put state.
But if the workflow is built around small tools and typed artifacts, the problem changes.
State moves out of the prompt and into the system:
- the plan is a file
- the diff is a tool result
- the failing test is a tool result
- the repo is queryable on demand
- the deployment status is fetched when needed
Now the model does not need to carry the whole story forward in one ever-growing conversation.
That is a huge design shift.
You are not “managing” the context window anymore in the way current coding agents do. You are designing the system so the important state is addressable, externalized, and reloadable. Context becomes a narrow working set, not a permanent memory dump.
That is why tool design matters so much. A good tool surface does not just help the model act. It prevents context collapse by construction.
Smaller context is not only cheaper. It is better
This matters for quality, not just cost.
A smaller, narrower prompt gives the model a more focused search space. The model has fewer irrelevant paths to consider. The instruction hierarchy is clearer. The expected output is easier to verify.
That is why a focused cheap model can outperform a more expensive model trapped inside a noisy monolithic chat.
The gain comes from three places at once:
- better cache reuse
- less token waste
- tighter reasoning scope
This is also why context compression should be treated carefully. Compression can reduce token count, but it also introduces another generative step and another chance to lose signal. If you can replace transcript compression with typed intermediate artifacts and explicit stage boundaries, that is usually the better system.
The practical rules are the same as Docker
If you want a fast, cheap, reliable agent pipeline:
- put stable instructions first
- keep tool schemas stable
- move user-specific volatility to the end
- split the workflow into stages
- pass small typed artifacts between stages
- rerun only the invalidated stage
- parallelize what does not depend on shared state
This is not prompt poetry. It is build engineering.
The real point
Context engineering is often discussed like a writing skill.
It is closer to systems design.
Large chat transcripts are badly structured build graphs. Good agent systems are cache-aware execution pipelines.
That is the useful mental model:
- the stable prefix is your base layer
- the task-specific request is your top layer
- intermediate outputs are build artifacts
- summarization is a lossy rebuild step
- stage isolation is the real optimization
If you already know Docker, this should feel familiar.
Stable layers first. Volatile layers last. Small artifacts between stages. Rebuild only what changed.
That is how you make Docker fast.
That is also how you make LLM systems fast.