{"id":725,"date":"2025-09-18T09:30:00","date_gmt":"2025-09-18T16:30:00","guid":{"rendered":"https:\/\/identia.digital\/lmcache\/en\/?p=725"},"modified":"2025-10-30T14:18:10","modified_gmt":"2025-10-30T21:18:10","slug":"nvidia-dynamo-integrates-lmcache-accelerating-llm-inference","status":"publish","type":"post","link":"https:\/\/identia.digital\/lmcache\/en\/2025\/09\/18\/nvidia-dynamo-integrates-lmcache-accelerating-llm-inference\/","title":{"rendered":"NVIDIA Dynamo integrates LMCache, Accelerating LLM Inference"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">We&#8217;re thrilled to announce that <a href=\"https:\/\/github.com\/ai-dynamo\/dynamo\"><strong>Nvidia Dynamo<\/strong><\/a> <strong>has integrated <a href=\"https:\/\/github.com\/LMCache\/LMCache\">LMCache<\/a> as a <a href=\"https:\/\/docs.nvidia.com\/dynamo\/latest\/components\/backends\/vllm\/LMCache_Integration.html\">KV caching layer solution<\/a><\/strong>. This is a big milestone: Dynamo gets a battle-tested caching solution, and LMCache becomes part of a data center-scale inference platform used by many developers worldwide to deploy AI at scale.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/github.com\/user-attachments\/assets\/84aa0337-6292-4f2c-aa3a-de12a0b61c22\" alt=\"image\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">For comprehensive details about Dynamo&#8217;s KV cache optimization capabilities, see the <strong><a href=\"https:\/\/developer.nvidia.com\/blog\/how-to-reduce-kv-cache-bottlenecks-with-nvidia-dynamo\/\">NVIDIA Developer Blog post on reducing KV cache bottlenecks<\/a><\/strong>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"why-kv-caching-matters\"><strong>Why KV Caching Matters<\/strong><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">KV caching is a foundational optimization for modern LLM inference. Instead of recomputing the expensive prefill phase for every new query, KV cache allows reuse of previously computed key\/value pairs. This reuse <strong>skips large portions of prefill computation<\/strong>, dramatically reducing end-to-end latency while increasing throughput and efficiency.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We\u2019ve explored this in detail in earlier posts, such as our <a href=\"https:\/\/identia.digital\/lmcache\/2025-05-16-release\/\">Turbocharging LLM Inference blog<\/a>, where we showed how KV cache reuse not only accelerates single-query latency but also enables more efficient multi-turn interactions and higher cluster utilization.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">With Dynamo now supporting LMCache as a caching layer, these benefits become <strong>first-class citizens<\/strong> in the Dynamo platform.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"what-this-collaboration-delivers\"><strong>What This Collaboration Delivers<\/strong><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">This collaboration focuses on two technical fronts:<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"1-kv-cache-offloading-and-reuse\"><strong>1. KV Cache Offloading and Reuse<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">By default, KV cache is stored in GPU memory, which limits scale and context persistence. With this integration, Dynamo can now <strong>offload KV cache to external storage layers<\/strong> using LMCache while maintaining efficient reuse across queries. This integration is available on Dynamo repository: <a href=\"https:\/\/github.com\/ai-dynamo\/dynamo\/pull\/2079\">ai-dynamo\/dynamo#2079<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This combination enables scenarios like:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Reusing KV cache across multiple sessions or even inference engines.  <\/li>\n\n\n\n<li>Freeing up GPU memory for active compute while keeping context cached externally.  <\/li>\n\n\n\n<li>Reducing prefill costs for long-context models by persisting and reloading KV segments.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"2-kv-cache-storage-backends\"><strong>2. KV Cache Storage Backends<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Beyond offloading KV cache, Dynamo and LMCache now support flexible storage backends. For example, the <a href=\"https:\/\/github.com\/LMCache\/LMCache\/blob\/dev\/lmcache\/v1\/storage_backend\/nixl_storage_backend.py?utm_source=chatgpt.com\">NiXL storage backend<\/a> offers high-throughput, low-latency access optimized for LLM workloads. NIXL support is now available in LMCache repository: <a href=\"https:\/\/github.com\/LMCache\/LMCache\/pull\/1223?utm_source=chatgpt.com\">LMCache\/LMCache#1223<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This unlocks more advanced workflows:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Persistent caches across application restarts.  <\/li>\n\n\n\n<li>Hybrid caching strategies (GPU memory + CPU memory + SSD) for balancing speed and cost.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"technical-reference\"><strong>Technical Reference<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">For a deeper dive into the motivation, design scope, and integration details, see the official <a href=\"https:\/\/docs.nvidia.com\/dynamo\/latest\/components\/backends\/vllm\/LMCache_Integration.html?utm_source=chatgpt.com\">Nvidia Dynamo documentation on LMCache integration<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For more technical details about how Dynamo reduces KV cache bottlenecks and the broader context of this integration, check out the <strong><a href=\"https:\/\/developer.nvidia.com\/blog\/how-to-reduce-kv-cache-bottlenecks-with-nvidia-dynamo\/\">NVIDIA Developer Blog post on KV Cache optimization with Dynamo<\/a><\/strong>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"looking-ahead\"><strong>Looking Ahead<\/strong><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">We\u2019re excited to see how developers and enterprises adopt this integration in production. With KV caching becoming a standard practice across the industry, LMCache and Dynamo integration ensures that the ecosystem can move faster, serve more users, and deliver lower-latency AI applications.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Together, with the Dynamo team, we\u2019re laying the foundation for a <strong>more efficient, flexible, and cost-effective KV caching layer for LLM inference at scale<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Acknowledgements<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Special thanks to Vikram Mailthody, Harry Kim, Ashutosh Malegaonkar, Suman Taitraju, Richard Huo, Omri Kahalon, Vishwanath Venkatesan, Adit Ranadive, Pen Chung Li, John Kim, and David Edelsohn, in close collaboration with LMCache contributors from TensorMesh. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>We&#8217;re thrilled to announce that Nvidia Dynamo has integrated LMCache as a KV caching layer solution. This is a big milestone: Dynamo gets a battle-tested caching solution, and LMCache becomes part of a data center-scale inference platform used by many developers worldwide to deploy AI at scale. For comprehensive details about Dynamo&#8217;s KV cache optimization [&hellip;]<\/p>\n","protected":false},"author":271290516,"featured_media":727,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[35979],"tags":[35981,35881,35983,35985],"class_list":["post-725","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-news-en","tag-dynamo-en","tag-lmcache","tag-nvidia-en","tag-vllm-en"],"_links":{"self":[{"href":"https:\/\/identia.digital\/lmcache\/wp-json\/wp\/v2\/posts\/725","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/identia.digital\/lmcache\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/identia.digital\/lmcache\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/identia.digital\/lmcache\/wp-json\/wp\/v2\/users\/271290516"}],"replies":[{"embeddable":true,"href":"https:\/\/identia.digital\/lmcache\/wp-json\/wp\/v2\/comments?post=725"}],"version-history":[{"count":0,"href":"https:\/\/identia.digital\/lmcache\/wp-json\/wp\/v2\/posts\/725\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/identia.digital\/lmcache\/wp-json\/wp\/v2\/media\/727"}],"wp:attachment":[{"href":"https:\/\/identia.digital\/lmcache\/wp-json\/wp\/v2\/media?parent=725"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/identia.digital\/lmcache\/wp-json\/wp\/v2\/categories?post=725"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/identia.digital\/lmcache\/wp-json\/wp\/v2\/tags?post=725"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}