{"id":1232,"date":"2026-04-03T11:17:22","date_gmt":"2026-04-03T18:17:22","guid":{"rendered":"https:\/\/identia.digital\/lmcache\/?p=1232"},"modified":"2026-04-03T12:29:53","modified_gmt":"2026-04-03T19:29:53","slug":"lmcaches-new-architecture-boosts-moe-inference-performance-by-10x","status":"publish","type":"post","link":"https:\/\/identia.digital\/lmcache\/en\/2026\/04\/03\/lmcaches-new-architecture-boosts-moe-inference-performance-by-10x\/","title":{"rendered":"LMCache\u2019s New Architecture Boosts MoE Inference Performance by 10\u00d7"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">Modern LLM serving workloads are defined by strict latency requirements, high concurrency, and rapidly growing context lengths. Applications such as multi-turn chat, AI agents, and retrieval-augmented generation continuously build on prior context, leading to substantial reuse of previously computed states. In production, systems must minimize time-to-first-token (TTFT) while maintaining stable decoding throughput under heavy concurrent load. <\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img fetchpriority=\"high\" decoding=\"async\" width=\"886\" height=\"734\" src=\"https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/04\/dp-model-1.png\" alt=\"Diagram illustrating a distributed processing system with multiple data paths (DP 0, DP 1, DP N) featuring attention modules and load balancer.\" class=\"wp-image-1352\" style=\"aspect-ratio:1.2070952772996033;width:594px;height:auto\" srcset=\"https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/04\/dp-model-1.png 886w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/04\/dp-model-1-300x249.png 300w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/04\/dp-model-1-768x636.png 768w\" sizes=\"(max-width: 886px) 100vw, 886px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">To meet these demands, vLLM commonly combines Data Parallelism (DP) with automatic Expert Parallelism (EP). Compared to pure tensor parallelism (TP) at a similar scale, this configuration consistently delivers better throughput (TPS) and lower TTFT in real-world deployments. In this design, attention layers are replicated across GPUs while expert layers are sharded, ensuring that latency-critical attention computation and KV-cache access remain local. This reduces per-GPU memory pressure and enables more efficient utilization of compute resources.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Each request is routed to a specific DP rank, where attention is executed locally. During MoE layers, tokens are dynamically dispatched to experts across GPUs and then aggregated. This architecture provides strong scalability, efficient memory usage, and minimal communication overhead\u2014making it well suited for serving large MoE models.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">However, a key limitation remains: each DP rank maintains its own isolated KV-cache buffer within its local process. Even with CPU offloading enabled, these caches are not shared across workers. As a result, identical or overlapping contexts processed by different ranks cannot be reused, leading to redundant prefill computation and inefficient memory utilization.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>LMCache\u2019s Multi-Process (MP) mode fundamentally addresses this limitation by introducing a unified KV-cache layer.<\/strong> Instead of maintaining fragmented, per-process KV buffers, MP mode centralizes KV-cache management into a shared memory layer that is accessible across all serving processes. This unified design enables true cross-process cache reuse, eliminating redundant computation and significantly improving memory efficiency. The benefits are especially pronounced in multi-turn workloads, where context grows incrementally and naturally creates high cache reuse opportunities across requests and processes.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><img decoding=\"async\" width=\"1024\" height=\"401\" src=\"https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/04\/offload-3-1024x401.png\" alt=\"Diagram illustrating the In-Process offload and Unified KV modes in a memory architecture, showing data pathways between DRAM and HBMs.\" class=\"wp-image-1350\" style=\"aspect-ratio:2.553666906072939;width:712px;height:auto\" srcset=\"https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/04\/offload-3-1024x401.png 1024w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/04\/offload-3-300x117.png 300w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/04\/offload-3-768x301.png 768w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/04\/offload-3.png 1078w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Deployment and Benchmarking<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Below, we present the deployment configurations for both in-process offloading and LMCache MP mode. We evaluate their effectiveness using a conversation-based benchmark that reflects realistic multi-turn workloads on the Qwen3-235B-A22B-Instruct-2507-FP8 model. All experiments are conducted on an 8\u00d7 NVIDIA H100 80GB GPU server using vLLM 0.18.1 and LMCache 0.4.3-dev (with the official 0.4.3 release forthcoming).<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>In-Process offload<\/strong><\/h4>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Step 1: Launch vLLM Serving  <\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We first launch a baseline vLLM instance with 8-way data parallelism and in-process KV-cache offloading:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"bash\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"false\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">export VLLM_USE_FLASHINFER_MOE_FP8=0 export VLLM_USE_DEEP_GEMM=1 \nvllm serve Qwen\/Qwen3-235B-A22B-Instruct-2507-FP8 \\ \n--data-parallel-size 8  --enable-expert-parallel --gpu-memory-utilization 0.8 \\ \n--max-num-batched-tokens 1024  --kv-offloading-size 50  \\\n--disable-hybrid-kv-cache-manager  \\\n--max-model-len auto\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">In this setup, each data-parallel rank is allocated 50 GB of host memory for KV-cache offloading, resulting in an aggregate 400 GB CPU memory pool. While this allows the system to support long-context workloads and maintain high GPU utilization, the KV cache remains <strong>process-local and fragmented<\/strong>, preventing reuse across workers.<br>(Note: &#8211;disable-hybrid-kv-cache-manager is required for HMA models with the native OffloadingConnector.)<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>LMCache MP offload<\/strong><\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Unlike the previous in-process offload, LMCache MP mode runs as an independent service. It dynamically detects and registers serving engines at runtime, enabling seamless integration across multiple serving processes.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Step 1: Start LMCache MP Server<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We first launch the LMCache multi-process server with 400GB of host memory:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"bash\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"false\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">lmcache server --l1-size-gb 400 --eviction-policy LRU<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Step 2: Launch vLLM with LMCache MP<\/strong> <\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"false\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">export VLLM_USE_FLASHINFER_MOE_FP8=0 export VLLM_USE_DEEP_GEMM=1  \nvllm serve Qwen\/Qwen3-235B-A22B-Instruct-2507-FP8 \\\n--data-parallel-size 8  --enable-expert-parallel   --gpu-memory-utilization 0.8 \\  \n--max-num-batched-tokens 1024  --max-model-len auto \\ \n--kv-transfer-config '{\"kv_connector\":\"LMCacheMPConnector\",\"kv_role\":\"kv_both\"}'<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">In this configuration, all GPU workers dynamically register with the LMCache server and access a <strong>shared KV-cache pool backed by host memory<\/strong>. This transforms KV-cache management from isolated per-process buffers into a <strong>unified, system-wide caching layer<\/strong>, enabling efficient reuse across DP ranks and even across independent serving instances.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><strong><strong>Multi-Turn Benchmark<\/strong><\/strong> <strong> <\/strong><\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">To capture realistic behavior, we evaluate using a multi-round conversation benchmark:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"bash\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"false\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">lmcache bench engine --engine-url http:\/\/localhost:8000  --workload multi-round-chat  \\\n--mrc-qps 2.0 --mrc-duration 120<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This workload naturally exposes repeated prefixes and growing context, making KV-cache reuse critical for performance.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Key Difference<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>In-process offload:<\/strong> KV cache is fragmented across DP ranks ? no cross-process reuse<\/li>\n\n\n\n<li><strong>LMCache MP mode:<\/strong> KV cache is unified at the host layer ? shared across all processes<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">This architectural shift\u2014from <strong>isolated buffers to a unified KV-cache layer<\/strong>\u2014is the core reason behind the performance gains.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img decoding=\"async\" width=\"785\" height=\"297\" src=\"https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/04\/output.png\" alt=\"Bar chart comparing average TTFT in seconds for In-Process Offload and LMCache MP, highlighting a 13.6x reduction for LMCache MP.\" class=\"wp-image-1363\" style=\"width:1000px;height:auto\" srcset=\"https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/04\/output.png 785w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/04\/output-300x114.png 300w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/04\/output-768x291.png 768w\" sizes=\"(max-width: 785px) 100vw, 785px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><tbody><tr><th class=\"has-text-align-center\" data-align=\"center\"><strong>Metric<\/strong><\/th><th class=\"has-text-align-center\" data-align=\"center\"><strong>Statistic<\/strong><\/th><th class=\"has-text-align-center\" data-align=\"center\"><strong>LMCache MP Mode<\/strong><\/th><th class=\"has-text-align-center\" data-align=\"center\"><strong>In-Process Offload<\/strong><\/th><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\"><strong>TTFT (s)<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>Mean<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\">0.29<\/td><td class=\"has-text-align-center\" data-align=\"center\">3.98<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\"><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>p99<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\">1.30<\/td><td class=\"has-text-align-center\" data-align=\"center\">13.55<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\"><strong>Decoding Speed<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>Mean<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\">37.47 tok\/s<\/td><td class=\"has-text-align-center\" data-align=\"center\">9.81 tok\/s<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\"><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>p99<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\">45.14 tok\/s<\/td><td class=\"has-text-align-center\" data-align=\"center\">34.27 tok\/s<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><br>LMCache MP mode delivers substantial improvements in both latency and system efficiency. It reduces TTFT by approximately <strong>13\u00d7 on average<\/strong> (0.29s vs. 3.98s) and improves tail latency by over <strong>10\u00d7 at p99<\/strong> (1.30s vs. 13.55s). Decoding throughput is also increased by nearly <strong>4\u00d7<\/strong> (37.5 vs. 9.8 tokens\/s).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">These gains stem directly from the <strong>unified host-side KV-cache layer<\/strong>, which dramatically increases cache hit rates and eliminates redundant prefill computation. By avoiding repeated work and reducing memory fragmentation, MP mode frees up GPU resources for decoding and delivers more stable, predictable performance under high concurrency.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>LMCache MP Mode Roadmap <\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Currently, LMCache MP mode operates at the node level, where multiple serving processes\u2014across DP ranks and independent instances\u2014share a unified KV-cache layer backed by host memory (L1). This design centralizes cache management, reduces per-process overhead, and integrates seamlessly with L2 storage backends, including local storage and remote connectors.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Looking ahead, MP mode is being extended beyond a single node. Upcoming features such as peer-to-peer (P2P) cache sharing and prefill\u2013decode (PD) disaggregation will enable cross-node KV reuse and distributed cache orchestration. These advancements will evolve the unified KV-cache layer from a node-local optimization into a cluster-wide caching system, unlocking even greater scalability and efficiency. Enhanced observability is also planned to provide deeper visibility into cache behavior and system performance.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For more details, refer to the documentation: https:\/\/docs.lmcache.ai\/mp\/index.html<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">  <\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Modern LLM serving workloads are defined by strict latency requirements, high concurrency, and rapidly growing context lengths. Applications such as multi-turn chat, AI agents, and retrieval-augmented generation continuously build on prior context, leading to substantial reuse of previously computed states. In production, systems must minimize time-to-first-token (TTFT) while maintaining stable decoding throughput under heavy concurrent [&hellip;]<\/p>\n","protected":false},"author":271290526,"featured_media":1363,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[31419,36160,35872,35987,35876],"tags":[35881,35985],"class_list":["post-1232","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-benchmark","category-lmcache","category-new-features","category-performance-en","category-tutorial","tag-lmcache","tag-vllm-en"],"_links":{"self":[{"href":"https:\/\/identia.digital\/lmcache\/wp-json\/wp\/v2\/posts\/1232","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/identia.digital\/lmcache\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/identia.digital\/lmcache\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/identia.digital\/lmcache\/wp-json\/wp\/v2\/users\/271290526"}],"replies":[{"embeddable":true,"href":"https:\/\/identia.digital\/lmcache\/wp-json\/wp\/v2\/comments?post=1232"}],"version-history":[{"count":0,"href":"https:\/\/identia.digital\/lmcache\/wp-json\/wp\/v2\/posts\/1232\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/identia.digital\/lmcache\/wp-json\/wp\/v2\/media\/1363"}],"wp:attachment":[{"href":"https:\/\/identia.digital\/lmcache\/wp-json\/wp\/v2\/media?parent=1232"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/identia.digital\/lmcache\/wp-json\/wp\/v2\/categories?post=1232"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/identia.digital\/lmcache\/wp-json\/wp\/v2\/tags?post=1232"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}