{"id":1039,"date":"2026-01-21T17:39:32","date_gmt":"2026-01-22T01:39:32","guid":{"rendered":"https:\/\/identia.digital\/lmcache\/?p=1039"},"modified":"2026-01-22T00:39:53","modified_gmt":"2026-01-22T08:39:53","slug":"p2p-1","status":"publish","type":"post","link":"https:\/\/identia.digital\/lmcache\/en\/2026\/01\/21\/p2p-1\/","title":{"rendered":"LMCache Multi-node P2P CPU Memory Sharing &amp; Control: From Experimental Feature to Production"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">Baolong Mao (Tencent), Chunxiao Zheng (Tencent), Weishu Deng (Tensormesh), Darren Peng (Tensormesh), Samuel Shen (Tensormesh)<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>What is P2P and what does it promise? <\/strong><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">In this blog post, we will go over: <\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>a short <strong>motivation<\/strong> of the P2PBackend in LMCache and how it differs from existing KV Caching solutions<\/li>\n\n\n\n<li>how to run and <strong>benchmark performance<\/strong> on the P2PBackend<\/li>\n\n\n\n<li><strong>design decisions<\/strong> and pain points in making P2P <strong>lightweight<\/strong> and <strong>production-ready<\/strong><\/li>\n<\/ol>\n\n\n\n<figure class=\"wp-block-image size-full\"><img fetchpriority=\"high\" decoding=\"async\" width=\"754\" height=\"765\" src=\"https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/Screenshot-2026-01-21-at-1.46.29-PM.png\" alt=\"Diagram illustrating KV caching solutions and trade-offs, featuring three sections: Local CPU Backend, Remote KV Pool, and P2P CPU Sharing. Each section includes a visual representation of components and capacity ratings, transfer latency, and fault tolerance levels.\" class=\"wp-image-1089\" srcset=\"https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/Screenshot-2026-01-21-at-1.46.29-PM.png 754w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/Screenshot-2026-01-21-at-1.46.29-PM-296x300.png 296w\" sizes=\"(max-width: 754px) 100vw, 754px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Most production vLLM deployments run multiple identical instances behind a load balancer. Each instance builds its own KV cache only from the traffic it sees, creating <strong>cache silos.<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Concrete example (cache silo problem):<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>4 vLLM instances behind a load balancer<\/li>\n\n\n\n<li>Each instance has 10 GB CPU KV cache (Total capacity: 40 GB)<\/li>\n\n\n\n<li>Each request can only benefit from the cache of the instance it lands on ? <strong>effectively<\/strong> ~10 GB usable per request<\/li>\n\n\n\n<li>Result: <strong>duplicated compute<\/strong>, lower KV Cache reuse, <strong>wasted RAM and<\/strong> <strong>RAM Contention for instances on the same host<\/strong><\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image is-resized\"><img decoding=\"async\" src=\"https:\/\/www.agrivi.com\/wp-content\/uploads\/2021\/05\/The-Art-of-Managing-Grain-Quality-with-Silos-1200x565.jpeg\" alt=\"The Art of Managing Grain Quality with Silos - AGRIVI\" style=\"aspect-ratio:2.1239172999701323;width:567px;height:auto\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">A solution LMCache currently provides is to act as a socket for serving engines (e.g. vLLM, SGLang) to connect to a remote shared cache (Redis\/S3\/custom KV store), in theory providing persistent and <strong>infinitely scalable KV storage<\/strong>. The current downsides of this solution are:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>deployment\/resource overhead<\/strong> of introducing a new stateful service<\/li>\n\n\n\n<li><strong>non-trivial transfer latency<\/strong> (typically TCP-bound)<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Tencent<\/strong> and the <strong>LMCache team<\/strong>, over the course of two months, implemented a production-grade solution within the LMCache open-source project for multi-node CPU P2P sharing.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">P2P is a <strong>read-sharing supplement layered on top of the LocalCPU backend<\/strong>, not a replacement. Unlike an isolated shared cache service, the total storage capacity will be <strong>limited by available RAM<\/strong> in your cluster. However, the underlying assumption regarding <strong>scalability<\/strong> is that the <strong>memory demands<\/strong> of your inference workload <strong>will directly scale with the number of active instances<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">LMCache implements this using a <strong>controller based architecture<\/strong> and a <strong>NIXL<\/strong> based transfer path for high performance KV movement.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Quickstart Benchmark: Proof of Concept + Performance<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">A quickstart for running a two instance P2P setup can be found in the <strong>LMCache Documentation<\/strong> at:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/docs.lmcache.ai\/kv_cache\/p2p_sharing.html\">https:\/\/docs.lmcache.ai\/kv_cache\/p2p_sharing.html<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The example assumes that the host has an <strong>RDMA-enabled NIC<\/strong> (if not, performance may be slightly worse).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">After launching the <strong>controller<\/strong>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>PYTHONHASHSEED=123 lmcache_controller --host localhost --port 9000 --monitor-ports '{\"pull\": 8300, \"reply\": 8400, \"heartbeat\": 8082}'<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">and the two vllm instances with LMCache <strong>workers<\/strong> attached:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>PYTHONHASHSEED=123  CUDA_VISIBLE_DEVICES=0 LMCACHE_CONFIG_FILE=\/path\/to\/example1.yaml \\\nvllm serve meta-llama\/Meta-Llama-3.1-8B-Instruct \\\n    --gpu-memory-utilization 0.8 \\\n    --port 8010 \\\n    --kv-transfer-config '{\"kv_connector\":\"LMCacheConnectorV1\", \"kv_role\":\"kv_both\"}'<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>PYTHONHASHSEED=123  CUDA_VISIBLE_DEVICES=1 LMCACHE_CONFIG_FILE=\/path\/to\/example2.yaml \\\nvllm serve meta-llama\/Meta-Llama-3.1-8B-Instruct \\\n    --gpu-memory-utilization 0.8 \\\n    --port 8011 \\\n    --kv-transfer-config '{\"kv_connector\":\"LMCacheConnectorV1\", \"kv_role\":\"kv_both\"}'<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">We use our <strong>Long Doc QA workload generator<\/strong> to first query instance 1 with 50 contexts of context length 10k tokens for a total of 500k tokens on Llama 3.1 8B (<strong>~62 GB of unique KV Cache<\/strong>) followed by the same workload to instance 2, both of whom have registered with the Cache Controller.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Because instance 2 can retrieve KV Cache from its peer (instance 1), it performs with a <strong>4x improvement in TTFT<\/strong> and a <strong>5x improvement in total round time<\/strong>.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># instance 1 (cold peer: prefill i.e. no reuse of context)\nQuery round mean TTFT: 2.028s\nQuery round time: 38.323s<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code># instance 2 (start after instance 1 i.e. consuming KV Cache from instance 1)\nQuery round mean TTFT: 0.490s\nQuery round time: 7.964s<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Granted, this is a contrived benchmark with full workload reuse but suffices for a proof of concept. After a period of battle testing and experimentation, we plan to revisit this architecture and discuss best practices for deployment at scale and what kind of performance improvements might be expected on various real production workloads.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We&#8217;ll now go over some architectural decisions behind the P2PBackend.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Coming up with a Controller Metadata Architecture<\/h3>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" width=\"1024\" height=\"683\" src=\"https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/image-1-1024x683.png\" alt=\"A diagram illustrating a Registry Tree structure with three instances labeled 'rw workers'. Each instance consists of WorkerNodes, which are connected to fast locations leading to kvpools.\" class=\"wp-image-1053\" srcset=\"https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/image-1-1024x683.png 1024w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/image-1-300x200.png 300w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/image-1-768x512.png 768w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/image-1-1200x800.png 1200w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/image-1.png 1536w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">To make P2P efficient, the controller needs to <strong>remember<\/strong>:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>which <strong>instances<\/strong> exist<\/li>\n\n\n\n<li>which <strong>workers<\/strong> each instance has<\/li>\n\n\n\n<li>what <strong>KV chunks<\/strong> live in which storage locations for each worker<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Metadata Designs We Considered (S0, S1, S2)<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Before committing to a production ready controller metadata layout, we evaluated three approaches for tracking <strong>KV cache ownership <\/strong>across the cluster. In all experiments, we simulated 100 instances, each containing 1 million KV chunks, and measured four dimensions that matter in production: l<strong>ookup latency, memory footprint, full-state reporting time (for controller recovery), and worker de-registration time (for elastic scaling)<\/strong>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>S0 &#8211; Flat Indexing:<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Idea:<\/strong> keep a single global mapping:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>chunk_hash_to_worker: dict&#91;int, (instance_id, worker_id, location)]<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This kept lookups fast but came at a massive cost for writes \u2014 every admit\/evict had to update flat tables, IP groupings, and nested structures. <strong>Deregistering a worker meant rebuilding huge portions of these flat indexes<\/strong>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>S1 &#8211; RegistryTree:&nbsp;<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Idea:<\/strong> store metadata hierarchically<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img decoding=\"async\" width=\"1024\" height=\"683\" src=\"https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/image-1-1024x683.png\" alt=\"A diagram illustrating a Registry Tree structure with three instances labeled 'rw workers'. Each instance consists of WorkerNodes, which are connected to fast locations leading to kvpools.\" class=\"wp-image-1053\" style=\"aspect-ratio:1.4992854967554525;width:443px;height:auto\" srcset=\"https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/image-1-1024x683.png 1024w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/image-1-300x200.png 300w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/image-1-768x512.png 768w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/image-1-1200x800.png 1200w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/image-1.png 1536w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">RegistryTree ? InstanceNode ? WorkerNode ? Location ? Chunks<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">While S1&#8217;s lookup time (8 microseconds) is higher than S0, S1 achieves:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>3\u00d7 <strong>memory reduction<\/strong>: From 19.7 GB to 5.9 GB \u2014 crucial for cost-effective scaling<\/li>\n\n\n\n<li>4,000\u00d7 <strong>faster deregistration<\/strong>: From 32 seconds to 8 milliseconds \u2014 enabling rapid elastic scaling<\/li>\n\n\n\n<li>23\u00d7 <strong>faster full reporting<\/strong>: From 1.3 seconds to 60 milliseconds \u2014 essential for controller recovery<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Our evaluation was that these improvements are better suited for production due to:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Elastic scaling<\/strong>: Workers can join and leave the cluster in milliseconds, not minutes<\/li>\n\n\n\n<li><strong>Fast recovery:<\/strong> Controller restarts trigger full sync that completes in seconds<\/li>\n\n\n\n<li><strong>Cost efficiency:<\/strong> Lower memory footprint means more instances per machine<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>S2 &#8211; Reverse Chunk to Worker Index<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Idea:<\/strong> keep S1\u2019s hierarchical structure, but add a global reverse map for near O(1) lookup:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">key_to_worker_index: dict[int, (instance_id, worker_id, location)]<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">However, the new issues:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>2\u00d7 memory: storing every chunk-to-worker mapping twice (once in tree, once in index)<\/li>\n\n\n\n<li>285\u00d7 slower deregistration (8ms ? 2,425ms): Removing a worker requires cleaning up thousands\/millions of index entries<\/li>\n\n\n\n<li>87\u00d7 slower full reports (60ms ? 5,259ms): Reporting state requires walking both structures<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">After these experiments, the RegistryTree was chosen as the most scalable and robust architecture for centralized metadata storage.&nbsp;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Decision: Why we chose RegistryTree (S1)<\/strong><\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><tbody><tr><td><strong>Solution<\/strong><\/td><td><strong>Lookup (ms)<\/strong><\/td><td><strong>Memory (GB)<\/strong><\/td><td><strong>FullReport (ms)<\/strong><\/td><td><strong>Deregister (ms)<\/strong><\/td><\/tr><tr><td>S0 (Flat Indexing)<\/td><td>0.000089<\/td><td>19.720<\/td><td>1.373<\/td><td>32.597<\/td><\/tr><tr><td>S1 (RegistryTree)<\/td><td>0.008197<\/td><td>5.870<\/td><td>60<\/td><td>8<\/td><\/tr><tr><td>S2 (S1 + reverse key-to-worker index)<\/td><td>0.00006<\/td><td>19.720<\/td><td>5.259<\/td><td>2.425<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Even though S0 and S2 achieve very fast lookups, <strong>S1 is the most robust and scalable<\/strong> in the scenarios that dominate real operations: worker churn, controller recovery, and keeping controller memory bounded. For that reason, <strong>RegistryTree (S1)<\/strong> was chosen as the base metadata architecture.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Fine-Grained Locking for High Performance<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Rather than using a single global lock that would create bottlenecks, we implemented a <strong>layered locking<\/strong> strategy with <strong>read-write locks<\/strong> at each level:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><tbody><tr><td><strong>Level<\/strong><\/td><td><strong>Lock Type<\/strong><\/td><td><strong>Purpose<\/strong><\/td><\/tr><tr><td>RegistryTree<\/td><td>Read-Write Lock<\/td><td>Protects the <strong>instances<\/strong> dictionary<\/td><\/tr><tr><td>InstanceNode<\/td><td>Read-Write Lock<\/td><td>Protects the <strong>workers<\/strong>&#8216; dictionary<\/td><\/tr><tr><td>WorkerNode<\/td><td>FastLock (non-blocking)<\/td><td>Protects <strong>individual KV store<\/strong> <strong>operations<\/strong><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">This solution provides:&nbsp;<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Concurrent reads:<\/strong> Multiple threads can query different instances simultaneously<\/li>\n\n\n\n<li><strong>Parallel operations:<\/strong> Different instances can be modified in parallel without blocking each other<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Fine-grained Isolation:<\/strong> Adding or removing workers in one instance doesn&#8217;t affect operations on other instances<\/li>\n\n\n\n<li>The non-blocking fast lock on the admit\/evict operations for KV chunks <strong>avoids expensive syscalls and context switches entirely.<\/strong><\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Fault Tolerance: Heartbeat-Driven Recovery<\/h3>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"274\" src=\"https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/Screenshot-2026-01-20-at-8.38.58-PM-1024x274.png\" alt=\"A graphical representation of an electrocardiogram (ECG) showing various heart rhythms on grid paper.\" class=\"wp-image-1061\" srcset=\"https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/Screenshot-2026-01-20-at-8.38.58-PM-1024x274.png 1024w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/Screenshot-2026-01-20-at-8.38.58-PM-300x80.png 300w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/Screenshot-2026-01-20-at-8.38.58-PM-768x205.png 768w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/Screenshot-2026-01-20-at-8.38.58-PM.png 1108w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">The Controller maintains centralized metadata <strong>in memory only<\/strong> for performance reasons. So how do we implement <strong>fault tolerance<\/strong>? The number of operations excludes the possibility of using a <strong>WAL.<\/strong>&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Two Critical Failure Scenarios:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Controller crashes: All metadata lost, RegistryTree becomes empty<\/li>\n\n\n\n<li>Worker dies: Controller holds stale metadata pointing to a dead worker<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">We use a <strong>two-layer approach<\/strong>:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Heartbeat<\/strong>: Detects failures and maintains liveness<\/li>\n\n\n\n<li><strong>Full Sync<\/strong>: Recovers state after Controller restart<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Layer 1: Heartbeat &#8211; The Health Check<\/strong><\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Every 10 seconds, the workers and the controller perform a REQ-REP cycle as a health check. The heartbeat mechanism doubles as a command channel \u2014 the Controller can send commands through heartbeat responses.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"911\" height=\"1024\" src=\"https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/image-2-911x1024.png\" alt=\"A flowchart illustrating the communication between Worker, Controller, and Registry components in a system, detailing heartbeat messages sent every 10 seconds, including instance ID, worker ID, IP address, port, and peer initialization URL. The Controller updates heartbeat time and checks synchronization needs before allowing the Worker to continue normal operations.\" class=\"wp-image-1062\" srcset=\"https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/image-2-911x1024.png 911w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/image-2-267x300.png 267w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/image-2-768x864.png 768w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/image-2.png 1124w\" sizes=\"(max-width: 911px) 100vw, 911px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><tbody><tr><td><strong>Condition<\/strong><\/td><td><strong>Detection<\/strong><\/td><td><strong>Action<\/strong><\/td><\/tr><tr><td>Worker alive<\/td><td>Heartbeat <strong>arrives on time<\/strong><\/td><td>Update timestamp ? Normal<\/td><\/tr><tr><td>Worker dead<\/td><td><strong>No heartbeat for 30s+<\/strong><\/td><td>Deregister worker ? Remove from cluster<\/td><\/tr><tr><td>Controller restarted<\/td><td>Success = false (worker unknown)<\/td><td>Auto re-register ? Trigger Full Sync<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Layer 2: Full Sync &#8211; State Recovery<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">When the Controller restarts, it has no memory of which workers exist or what chunks they hold. The heartbeat mechanism <strong>triggers<\/strong> <strong>Full Sync<\/strong> to<strong> rebuild this state<\/strong>.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Controller Restart Flow:<\/strong><\/h4>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"606\" height=\"1024\" src=\"https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/image-3-606x1024.png\" alt=\"Flowchart illustrating the communication process between Worker, Controller, and Registry in a heartbeat and state recovery system.\" class=\"wp-image-1065\" style=\"width:606px;height:auto\" srcset=\"https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/image-3-606x1024.png 606w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/image-3-178x300.png 178w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/image-3.png 734w\" sizes=\"(max-width: 606px) 100vw, 606px\" \/><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\"><em>What is Freeze Mode?<\/em><\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">During <strong>Full Sync<\/strong>, workers enter <strong>Freeze Mode<\/strong> to prevent data inconsistencies:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>All store operations will be SKIPPED (no new data stored)<\/li>\n\n\n\n<li>Only Local CPU will be used for retrieval (no peers or remote storages)<\/li>\n\n\n\n<li>No admit\/evict messages will be generated<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Cluster Monitoring<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">To better observe the state of the Controller, we also provide a <strong>Dashboard<\/strong> that offers improved visibility into Controller health and behavior. Please see: <a href=\"https:\/\/docs.lmcache.ai\/controller\/index.html\">https:\/\/docs.lmcache.ai\/controller\/index.html<\/a><\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"490\" src=\"https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/image-4-1024x490.png\" alt=\"Screenshot of LMCache Controller Dashboard showing system status, total keys as 4684898, and recent activities.\" class=\"wp-image-1070\" style=\"width:731px;height:auto\" srcset=\"https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/image-4-1024x490.png 1024w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/image-4-300x143.png 300w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/image-4-768x367.png 768w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/image-4-1200x574.png 1200w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/image-4.png 1238w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">The <strong>Instances<\/strong> page displays information such as instance IPs, the number of workers per instance, and key counts.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"317\" src=\"https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/image-5-1024x317.png\" alt=\"Screenshot of an Instance Management interface displaying a table with details of various instances, including Instance ID, IP Address, Status, Worker Count, Key Count, Last Heartbeat, and action buttons for viewing or removing instances.\" class=\"wp-image-1071\" style=\"aspect-ratio:3.2303654350898445;width:730px;height:auto\" srcset=\"https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/image-5-1024x317.png 1024w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/image-5-300x93.png 300w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/image-5-768x238.png 768w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/image-5-1200x371.png 1200w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/image-5.png 1234w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">The <strong>Workers<\/strong> view allows you to inspect a specific worker\u2019s key count, IP address, port number, and related information.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"366\" src=\"https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/image-6-1024x366.png\" alt=\"Screenshot of the LMCache Controller Dashboard showing the Worker Management section, including instance IDs, worker IDs, IP addresses, ports, statuses, key counts, last heartbeats, and action buttons.\" class=\"wp-image-1073\" style=\"width:737px;height:auto\" srcset=\"https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/image-6-1024x366.png 1024w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/image-6-300x107.png 300w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/image-6-768x274.png 768w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/image-6-1200x428.png 1200w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/image-6.png 1238w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">In addition, it supports general capabilities such as retrieving metrics, thread information, and environment variables.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"898\" src=\"https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/image-7-1024x898.png\" alt=\"Screenshot of the LMCache Controller Dashboard displaying performance metrics, including garbage collection statistics and worker counts.\" class=\"wp-image-1074\" style=\"aspect-ratio:1.1403376818735897;width:740px;height:auto\" srcset=\"https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/image-7-1024x898.png 1024w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/image-7-300x263.png 300w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/image-7-768x674.png 768w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/image-7-1200x1053.png 1200w, https:\/\/identia.digital\/lmcache\/wp-content\/uploads\/2026\/01\/image-7.png 1238w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">If you encounter any issues, please leave an issue inside of LMCache! We love to hear about production use-cases and always welcome contributions to our open source community.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Baolong Mao (Tencent), Chunxiao Zheng (Tencent), Weishu Deng (Tensormesh), Darren Peng (Tensormesh), Samuel Shen (Tensormesh) What is P2P and what does it promise? In this blog post, we will go over: Most production vLLM deployments run multiple identical instances behind a load balancer. Each instance builds its own KV cache only from the traffic it [&hellip;]<\/p>\n","protected":false},"author":271290519,"featured_media":1089,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[36160,35872,35987],"tags":[],"class_list":["post-1039","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-lmcache","category-new-features","category-performance-en"],"_links":{"self":[{"href":"https:\/\/identia.digital\/lmcache\/wp-json\/wp\/v2\/posts\/1039","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/identia.digital\/lmcache\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/identia.digital\/lmcache\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/identia.digital\/lmcache\/wp-json\/wp\/v2\/users\/271290519"}],"replies":[{"embeddable":true,"href":"https:\/\/identia.digital\/lmcache\/wp-json\/wp\/v2\/comments?post=1039"}],"version-history":[{"count":0,"href":"https:\/\/identia.digital\/lmcache\/wp-json\/wp\/v2\/posts\/1039\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/identia.digital\/lmcache\/wp-json\/wp\/v2\/media\/1089"}],"wp:attachment":[{"href":"https:\/\/identia.digital\/lmcache\/wp-json\/wp\/v2\/media?parent=1039"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/identia.digital\/lmcache\/wp-json\/wp\/v2\/categories?post=1039"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/identia.digital\/lmcache\/wp-json\/wp\/v2\/tags?post=1039"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}