{"id":985,"date":"2025-12-23T01:03:13","date_gmt":"2025-12-23T09:03:13","guid":{"rendered":"https:\/\/identia.digital\/lmcache\/?p=985"},"modified":"2026-03-30T13:20:48","modified_gmt":"2026-03-30T20:20:48","slug":"context-engineering-reuse-pattern-under-the-hood-of-claude-code","status":"publish","type":"post","link":"https:\/\/identia.digital\/lmcache\/en\/2025\/12\/23\/context-engineering-reuse-pattern-under-the-hood-of-claude-code\/","title":{"rendered":"Context Engineering &amp; Reuse Pattern Under the Hood of Claude Code"},"content":{"rendered":"\n<figure class=\"wp-block-image size-large\"><img fetchpriority=\"high\" decoding=\"async\" width=\"1024\" height=\"535\" src=\"https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-1024x535.png\" alt=\"\" class=\"wp-image-989\" srcset=\"https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-1024x535.png 1024w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-300x157.png 300w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-768x401.png 768w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-1536x803.png 1536w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-2048x1070.png 2048w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-1200x627.png 1200w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Over the last few months, <a href=\"https:\/\/www.claude.com\/product\/claude-code\">Claude Code<\/a> has quietly become one of the most interesting &amp; widely-adopted real-world agentic systems available to normal developers.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Unlike <strong><em>cloud-only agents<\/em><\/strong> whose internals remain hidden behind API gateways like <a href=\"https:\/\/www.perplexity.ai\/api-platform\">Perplexity<\/a>, <a href=\"https:\/\/devin.ai\/\">Devin<\/a>, or <a href=\"https:\/\/manus.im\/\">Manus<\/a>, nor as fully <strong><em>open source agents<\/em><\/strong> like <a href=\"https:\/\/github.com\/SWE-agent\/mini-swe-agent\">Mini SWE Agent<\/a> or <a href=\"https:\/\/github.com\/laude-institute\/harbor\/blob\/main\/src\/harbor\/agents\/terminus_2\/terminus_2.py\">Terminus 2<\/a> where you can deploy locally with source code, Claude Code runs <strong><em>partially locally<\/em><\/strong> \u2014 it has a open-sourced <a href=\"https:\/\/github.com\/anthropics\/claude-code\">client repo<\/a> running on the local machine, which gives us a rare opportunity: to inject the traffic it sends and reverse engineering <strong>to see every single LLM call<\/strong>, every intermediate <strong>tool invocation<\/strong>, every tiny decision the agent makes.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Recently, we ran a tiny one-shot experiment (one random task from the <a href=\"https:\/\/huggingface.co\/datasets\/princeton-nlp\/SWE-bench_Verified\">SWE-bench_Verified<\/a> dataset) with Claude Code and captured everything into a <strong>raw<\/strong> log file with only LLM input&amp;output: <a href=\"https:\/\/github.com\/kobe0938\/blog\/blob\/master\/claude-code\/claude_code_trace.jsonl\"><strong><code>claude_code_trace.jsonl<\/code><\/strong><\/a>. If you paste <a href=\"https:\/\/github.com\/kobe0938\/blog\/blob\/master\/claude-code\/claude_code_trace.jsonl\">this trace<\/a> into the <a href=\"https:\/\/v0-llm-agent-dashboard.vercel.app\/\">visualizer<\/a>, you can see the trace details.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Key metrics:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>92 LLM calls<\/strong> (<code>#1-#92<\/code>)<\/li>\n\n\n\n<li><strong>~2M input tokens<\/strong> consumed<\/li>\n\n\n\n<li><strong>13 minutes<\/strong> total duration<\/li>\n\n\n\n<li><strong>92% prefix reuse rate<\/strong><\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" width=\"1024\" height=\"518\" src=\"https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-1-1024x518.png\" alt=\"\" class=\"wp-image-990\" srcset=\"https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-1-1024x518.png 1024w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-1-300x152.png 300w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-1-768x388.png 768w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-1-1536x776.png 1536w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-1-2048x1035.png 2048w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-1-1200x607.png 1200w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">The goal was simple:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">If you give Claude Code one small task, what exactly happens behind the scenes?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Which LLM calls get made? In what order?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Where does context get reused? And how much of the prompt is stable prefix(seen) vs incremental content(new)?<\/p>\n<\/blockquote>\n\n\n\n<p class=\"wp-block-paragraph\">This is our walk-through of that trace.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\"><strong>1. What \u201cActually Happens\u201d When Claude Code Runs a Simple Task<\/strong><\/h1>\n\n\n\n<p class=\"wp-block-paragraph\">Claude Code feels straightforward as a product \u2014 you type a request in your editor, it edits files or runs some bash commands. But under the hood, even a simple one-step request decomposes into a surprisingly structured internal loop.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We randomly select one <a href=\"https:\/\/huggingface.co\/datasets\/princeton-nlp\/SWE-bench_Verified\/viewer\/default\/test?views%5B%5D=test&amp;row=80\">task (#80)<\/a> from the <a href=\"https:\/\/huggingface.co\/datasets\/princeton-nlp\/SWE-bench_Verified\">SWE-bench_Verified<\/a> dataset. The problem setup is to fix an issue in the <code>django\/django<\/code> repo from commit <code>2e0f04507b17362239ba49830d26fec504d46978<\/code>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Problem statement:<\/strong><\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\"><em>&#8220;JSONField are not properly displayed in admin when they are readonly.<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><em>Description: JSONField values are displayed as dict when readonly in the admin. For example, <code>{\"foo\": \"bar\"}<\/code> would be displayed as <code>{'foo': 'bar'}<\/code>, which is not valid JSON.<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><em>I believe the fix would be to add a special case in <code>django.contrib.admin.utils.display_for_field<\/code> to call the <code>prepare_value<\/code> of the JSONField (not calling <code>json.dumps<\/code> directly to take care of the <code>InvalidJSONInput<\/code> case).&#8221;<\/em><\/p>\n<\/blockquote>\n\n\n\n<p class=\"wp-block-paragraph\">And this is exactly the prompt that Claude Code received.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" width=\"1024\" height=\"207\" src=\"https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-2-1024x207.png\" alt=\"\" class=\"wp-image-991\" srcset=\"https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-2-1024x207.png 1024w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-2-300x61.png 300w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-2-768x155.png 768w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-2-1536x311.png 1536w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-2-2048x414.png 2048w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-2-1200x243.png 1200w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Surprisingly, before any fancy reasoning, Claude Code ran a couple of <strong>&#8220;warm-up&#8221; steps<\/strong> (trace ID <code>#2<\/code>, <code>#3<\/code>, <code>#4<\/code>) before the actual task. Warm-up steps do nothing but input the prompt for:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Tool list (<code>#2<\/code>)<\/li>\n\n\n\n<li>Explore subagent (<code>#3<\/code>)<\/li>\n\n\n\n<li>Plan subagent (<code>#4<\/code>)<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Warm-up steps are used for caching purposes\u2014later when those tools and subagents are called, the cache will be hit, resulting in faster response time. The summarization agent (<code>#1<\/code>) and new topic agent (<code>#5<\/code>) are used for summarizing the context and generating a new title for display\u2014just as the ChatGPT sidebar works.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The main agent (<code>#6<\/code>) comes with a huge system prompt, including git history, status, tool list, etc. The <strong>18 tools<\/strong> in the tool list not only have the ability to use normal tool calls like <code>Bash<\/code>, <code>Grep<\/code>, <code>Read<\/code>, <code>WebFetch<\/code>, <code>AskUserQuestion<\/code>, etc., but also the ability to invoke and delegate certain tasks to subagents like:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Explore subagent (<code>#7<\/code>)<\/li>\n\n\n\n<li>Plan subagent (<code>#46<\/code>)<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">These subagents will invoke tool calls from their own tool lists.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Immediately after the main agent (<code>#6<\/code>), it invokes the <strong>Explore<\/strong> (also called file search agent) subagent (<code>#7<\/code>), which will invoke tool calls from its tool list to explore the codebase. It starts with a different system prompt where its main goal is to explore the codebase:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">You are Claude Code, Anthropic&#8217;s official CLI for Claude. You are a file search specialist for Claude Code, Anthropic&#8217;s official CLI for Claude. You excel at thoroughly navigating and exploring codebases.<\/p>\n<\/blockquote>\n\n\n\n<p class=\"wp-block-paragraph\">Interestingly, the Explore subagent (<code>#7<\/code>) is not the only subagent that Claude Code can invoke. Instead, it invokes <strong>3 Explore subagents in parallel<\/strong> to explore the codebase, each with a different goal:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Explore JSONField implementation<\/strong> (lifespan: <code>#7-#26<\/code>)<\/li>\n\n\n\n<li><strong>Explore admin display_for_field<\/strong> (lifespan: <code>#8-#37<\/code>)<\/li>\n\n\n\n<li><strong>Explore readonly field rendering<\/strong> (lifespan: <code>#9-#45<\/code>)<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">The context of the main agent (<code>#6<\/code>) is <strong>not<\/strong> carried to the subagents, which is beneficial for the subagents to have a fresh start. Each Explore subagent can invoke <strong>1-3 tools in parallel<\/strong>, where the tools are from the tool list of the Explore subagent\u2014a subset (<strong>10\/18<\/strong>) of the main agent&#8217;s tool list.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The <a href=\"https:\/\/arxiv.org\/pdf\/2210.03629\">ReAct<\/a> mechanism is used here: the Explore subagent will invoke a tool call, then based on the tool output, it will observe and invoke another tool call to explore the codebase further until it deems it has explored enough.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Finally, after the slowest Explore subagent finishes its exploration at step <code>#45<\/code>, at step <code>#46<\/code>, the main agent appends the findings (summarizations) from all 3 Explore subagents to the context, and then invokes the Plan subagent (<code>#47<\/code>) to plan the fix.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"147\" src=\"https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-3-1024x147.png\" alt=\"\" class=\"wp-image-992\" srcset=\"https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-3-1024x147.png 1024w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-3-300x43.png 300w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-3-768x110.png 768w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-3-1536x220.png 1536w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-3-2048x294.png 2048w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-3-1200x172.png 1200w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Similar to the Explore Agent, the Plan Agent (<code>#47<\/code>) also has a different system prompt, where its main goal is to plan the fix:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">You are Claude Code, Anthropic&#8217;s official CLI for Claude. You are a software architect and planning specialist for Claude Code. Your role is to explore the codebase and design implementation plans.<\/p>\n<\/blockquote>\n\n\n\n<p class=\"wp-block-paragraph\">The Plan Agent did not carry all the context from the main agent nor the Explore subagents, which is beneficial for the Plan Agent to have a fresh start. Instead, it only contains the <strong>summarization<\/strong> of the Explore subagents&#8217; findings. The toolbox is a subset (<strong>10\/18<\/strong>) of the main agent&#8217;s tool list. The goal for the Plan Agent is to design an implementation plan that:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">Please design an implementation plan that:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Identifies the exact changes needed to display_for_field<\/li>\n\n\n\n<li>Considers whether we need to instantiate a form field from the model field or if there&#8217;s a better approach<\/li>\n\n\n\n<li>Identifies any edge cases or potential issues<\/li>\n\n\n\n<li>Recommends the best approach given Django&#8217;s architecture<\/li>\n<\/ol>\n<\/blockquote>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"239\" src=\"https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-4-1024x239.png\" alt=\"\" class=\"wp-image-993\" srcset=\"https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-4-1024x239.png 1024w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-4-300x70.png 300w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-4-768x179.png 768w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-4-1536x358.png 1536w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-4-2048x477.png 2048w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-4-1200x280.png 1200w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Similarly, the Plan Agent also follows the ReAct pattern and loops through tool calling from <code>#47<\/code> to <code>#72<\/code>, where the context accumulates from <strong>11,552 tokens<\/strong> to <strong>38,819 tokens<\/strong>. After having a good plan (see details in <code>#72<\/code>), the Plan Agent will return to the main agent (<code>#73<\/code>) with the plan.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The main agent will then invoke a series of tool calls to:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Review the plan (<code>#73<\/code>)<\/li>\n\n\n\n<li>Ask user for clarification (<code>#74<\/code>)<\/li>\n\n\n\n<li>Write the plan into a markdown file (<code>#75<\/code>)<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Finally, the main agent will exit the plan mode (<code>#76<\/code>) and enter the execute mode (<code>#77<\/code>) to execute the plan after interactively asking the user for plan approval (<code>#76-#77<\/code>).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The <strong>execution phase<\/strong> (<code>#77-#91<\/code>) still follows the ReAct pattern. The main agent will use the plan markdown file as a todo list:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<ol class=\"wp-block-list\">\n<li>Add json import to <code>utils.py<\/code><\/li>\n\n\n\n<li>Add JSONField handling to <code>display_for_field()<\/code><\/li>\n\n\n\n<li>Add tests to <code>test_admin_utils.py<\/code><\/li>\n\n\n\n<li>Run the tests to verify<\/li>\n<\/ol>\n<\/blockquote>\n\n\n\n<p class=\"wp-block-paragraph\">After executing some tool calls to read or edit files, it will cross out the todo items in the plan markdown file. Once all the todo items are crossed out, the main agent will end with a conclusion message (<code>#92<\/code>).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">During this phase, there are some other subagents being invoked\u2014e.g., the <strong>Extract Bash Command<\/strong> subagent (<code>#93<\/code>), where there&#8217;s only a one-shot prompt template for the subagent to extract the bash command in order to not run dangerous commands like <code>rm<\/code> without user confirmation by accident.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">And this is the whole diagram of the claude code trace:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"535\" src=\"https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-5-1024x535.png\" alt=\"\" class=\"wp-image-994\" srcset=\"https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-5-1024x535.png 1024w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-5-300x157.png 300w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-5-768x401.png 768w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-5-1536x803.png 1536w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-5-2048x1070.png 2048w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-5-1200x627.png 1200w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\"><strong>2. The Secret Pattern: Claude Code Is a Prefix Reuse Machine<\/strong><\/h1>\n\n\n\n<p class=\"wp-block-paragraph\">During our trace analysis, one phenomenon was so consistent it deserves its own section:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\"><strong>Claude Code\u2019s prompts are extremely prefix-heavy.<\/strong><\/p>\n<\/blockquote>\n\n\n\n<p class=\"wp-block-paragraph\">Prefix reuse means that one part of the prompt prefix is seen in the previous prompts&#8217; prefix. Across all phases, the prompt reuse rate is extremely high: <strong>92%<\/strong>. For ReAct-based subagent loops, it&#8217;s even higher. If we run prefix-length analysis in particular sections:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>Trace ID<\/th><th>Total Tokens<\/th><th>Shared Prefix %<\/th><th>Notes<\/th><\/tr><\/thead><tbody><tr><td><code>#1-#6<\/code><\/td><td>47,177<\/td><td>0.22%<\/td><td>Warm-up and initial phase<\/td><\/tr><tr><td><code>#7-#45<\/code><\/td><td>546,104<\/td><td>92.06%<\/td><td>Explore subagent phase<\/td><\/tr><tr><td><code>#47-#72<\/code><\/td><td>528,286<\/td><td>93.23%<\/td><td>Plan subagent phase<\/td><\/tr><tr><td><code>#73-#92<\/code><\/td><td>827,411<\/td><td>97.83%<\/td><td>Main agent execution phase<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">What does this mean? Claude Code\u2019s architecture practically <strong>optimizes itself for KV cache reusage<\/strong>, even without explicitly trying.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\"><strong>3. What is prefix caching and why should I care?<\/strong><\/h1>\n\n\n\n<p class=\"wp-block-paragraph\">At the heart of Large Language Model inference lies the <strong>KV cache<\/strong> (key-value cache) \u2014 a mechanism that stores intermediate attention computation results for previously processed tokens. During autoregressive generation, each new token needs to attend to all previous tokens, requiring expensive matrix multiplications. The KV cache stores the key and value matrices computed for earlier tokens, so they don&#8217;t need to be recomputed with each new token.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Prefix caching<\/strong> leverages this by recognizing that when multiple requests share the same prompt prefix (like system instructions or document context), their KV cache computations are identical and can be reused across requests.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Major LLM providers have turned this into significant cost savings:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong><a href=\"https:\/\/platform.openai.com\/docs\/guides\/prompt-caching\">OpenAI&#8217;s Prompt Caching<\/a><\/strong> handles prefix caching <strong>automatically<\/strong> \u2014 it detects common prefixes longer than 1,024 tokens and caches them transparently, offering a <strong>90% discount<\/strong> on cached input tokens (e.g., GPT-5.2 drops from $1.75 to $0.175 per million cached tokens) <img decoding=\"async\" src=\"https:\/\/raw.githubusercontent.com\/kobe0938\/blog\/master\/claude-code\/assets\/openai_cache_pricing.png\" alt=\"OpenAI Prompt Caching\"><\/li>\n\n\n\n<li><strong><a href=\"https:\/\/platform.claude.com\/docs\/en\/build-with-claude\/prompt-caching\">Anthropic&#8217;s cache hit pricing<\/a><\/strong> gives developers <strong>explicit control<\/strong> over which prompt blocks to cache using special <code>cache_control<\/code> markers, charging a slightly higher cache write cost (1.25x base price for 5-minute cache, 2x for 1-hour cache) but delivering the same <strong>90% discount<\/strong> on cache reads (Claude Sonnet 4.5: $0.30 per million tokens for cache reads versus $3.00 for base input), allowing fine-grained optimization for complex multi-turn conversations or document-heavy workflows <img decoding=\"async\" src=\"https:\/\/raw.githubusercontent.com\/kobe0938\/blog\/master\/claude-code\/assets\/anthropic_cache_pricing.png\" alt=\"Anthropic Prompt Caching\"><\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">To put this in perspective with Claude Code&#8217;s 92% prefix reuse pattern: processing 2M input tokens (our consumption for the experiment) <strong>without caching<\/strong> would cost <strong>$6.00<\/strong> (2M \u00d7 $3\/MTok), but <strong>with prefix caching<\/strong>, the cost drops to just <strong>$1.152<\/strong> (1.84M cache hits \u00d7 $0.30\/MTok + 0.16M cache writes \u00d7 $3.75\/MTok) \u2014 a savings of <strong>$4.85 (81% reduction)<\/strong> over one simple task.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Open-source inference engines have also embraced this paradigm:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong><a href=\"https:\/\/docs.vllm.ai\/en\/latest\/features\/automatic_prefix_caching\/\">vLLM&#8217;s automatic prefix caching<\/a><\/strong> transparently caches shared prefixes using its PagedAttention mechanism<\/li>\n\n\n\n<li><strong><a href=\"https:\/\/docs.sglang.io\/advanced_features\/hicache_best_practices.html\">SGLang&#8217;s RadixAttention<\/a><\/strong> employs a radix tree data structure to efficiently match and reuse the longest common prefixes across requests<\/li>\n\n\n\n<li><strong><a href=\"https:\/\/github.com\/LMCache\/lmcache\">LMCache<\/a><\/strong> takes distributed KV caching even further by pooling cache storage across multiple nodes to maximize reuse at scale<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Beyond cost savings, prefix cache hits dramatically reduce <strong>TTFT (time to first token)<\/strong> \u2014 since the model can skip recomputing the entire prefix and only process the unique suffix, latency for subsequent requests with shared context can drop by 5-10x, making conversational agents and document-grounded applications far more responsive.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\"><strong>4. What We Learned from This Tiny Trace<\/strong><\/h1>\n\n\n\n<p class=\"wp-block-paragraph\">Even though the task was trivial, the trace reveals a lot about Claude Code as a system:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>The main system prompt is huge<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>It contains: Complete git repository state and history + full tool specifications (18 tools for main agent) + finally, execution phase instructions<\/li>\n\n\n\n<li>The prompt alone is <strong>20,000+ tokens<\/strong> without conversation history<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Claude Code is built around specialized subagents<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Subagents receive only role-specific context, reducing bloat<\/li>\n\n\n\n<li>Separation of context allows the main agent to only run on the summarized subagent responses<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Parallel execution is used to maximize exploration efficiency<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Subagents are spawned in parallel with different search goals under their own ReAct loop<\/li>\n\n\n\n<li>This separation allows clean context and focused subtasks, distributing context evenly<\/li>\n\n\n\n<li>Tool calls are also run in parallel for the same benefits<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>&#8220;Warm-up&#8221; calls prime the cache before real work begins<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>They load tool specifications into cache, prime subagent system prompts, and establish stable prefix baselines<\/li>\n\n\n\n<li>These calls drastically accelerate subsequent subagent invocations<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Claude works well with KV cache reuse<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Claude reaches up to <strong>92% overall prefix reuse<\/strong>, perfect for KV cache reuse optimization<\/li>\n\n\n\n<li>Results in a significant cost savings of <strong>$4.85 (81% reduction)<\/strong> over one simple task<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Interactive planning improves transparency<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Gives users control over what changes will be made<\/li>\n\n\n\n<li>Creates a natural breakpoint prompting the user for approval<\/li>\n\n\n\n<li>Responses allow the system to create a more refined executable todo list, improving workflow<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\"><strong>5. Beyond Prefix Caching: Can We Do Better?<\/strong><\/h1>\n\n\n\n<p class=\"wp-block-paragraph\">Recently, there are some interesting research papers that try to improve non-prefix caching efficiency, such as <a href=\"https:\/\/arxiv.org\/abs\/2405.16444\">CacheBlend<\/a>, where optimizations can be made even on non-prefix (substring) caching.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"893\" src=\"https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-6-1024x893.png\" alt=\"\" class=\"wp-image-995\" srcset=\"https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-6-1024x893.png 1024w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-6-300x262.png 300w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-6-768x670.png 768w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-6-1200x1047.png 1200w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2025\/12\/image-6.png 1204w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">In our trace, we can see that the subagents have a tool list that is a subset of the main agent&#8217;s tool list, which means that the subagents can reuse the main agent&#8217;s tool list descriptions. This is a good example of how to improve non-prefix caching efficiency.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Another scenario in our trace is that if the same file was read multiple times, the file content can be cached and reused, even though the file content is not a prefix. This can be extremely helpful when the file content is large and the file is read multiple times.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Over the last few months, Claude Code has quietly become one of the most interesting &amp; widely-adopted real-world agentic systems available to normal developers. Unlike cloud-only agents whose internals remain hidden behind API gateways like Perplexity, Devin, or Manus, nor as fully open source agents like Mini SWE Agent or Terminus 2 where you can [&hellip;]<\/p>\n","protected":false},"author":271290518,"featured_media":987,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[31419,36160],"tags":[36099,36185,35881],"class_list":["post-985","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-benchmark","category-lmcache","tag-cacheblend-en","tag-claude-code","tag-lmcache"],"_links":{"self":[{"href":"https:\/\/identia.digital\/lmcache\/wp-json\/wp\/v2\/posts\/985","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/identia.digital\/lmcache\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/identia.digital\/lmcache\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/identia.digital\/lmcache\/wp-json\/wp\/v2\/users\/271290518"}],"replies":[{"embeddable":true,"href":"https:\/\/identia.digital\/lmcache\/wp-json\/wp\/v2\/comments?post=985"}],"version-history":[{"count":0,"href":"https:\/\/identia.digital\/lmcache\/wp-json\/wp\/v2\/posts\/985\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/identia.digital\/lmcache\/wp-json\/wp\/v2\/media\/987"}],"wp:attachment":[{"href":"https:\/\/identia.digital\/lmcache\/wp-json\/wp\/v2\/media?parent=985"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/identia.digital\/lmcache\/wp-json\/wp\/v2\/categories?post=985"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/identia.digital\/lmcache\/wp-json\/wp\/v2\/tags?post=985"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}