Cache Aware LLM Inference Request Routing

LLMs have a particularly gnarly resource utilization shape. The "prefill" stage, when processing a prompt is [[Graphics Processing Unit|GPU]] compute heavy while the "decode" stage (generating tokens) is memory bandwidth heavy. Mixing these workloads in an unbalanced way can cause stalls, particularly on token generation. To make request handling the most efficient and economical, requests who have blocks already resident in the KV-cache of a GPU, such as one that has recently processed a prompt with the same prefix, is dramatically less resource intensive. From this standpoint, routing requests becomes incredibly important, as most inference providers will have large fleets of many GPU clusters able to serve inference requests. It's generally pretty easy to handle the routing part, fortunately. You establish "block size" in number of tokens as a unit of cached values. I'm not entirely certain if this is informed by the GPU/inference server or if it's arbitrary. As requests come in, you tokenize the "context" or input of the request, splitting up the resulting token IDs into block-sized groups. You can then hash the sequence of block IDs and some metadata about the model and the previous block hash if not the "root" block, and store it for future routing. The smart routing algorithm then becomes a standard load/capacity based one biased by any matching cached block hashes. Matched hashes imply the block is resident in the KV cache with relatively high certainty. # Cache Eviction Cache eviction becomes a distributed data synchronization problem in this model. As new requests are processing, the KV-cache must evict unused records (I'm unclear what the nature of the data in these records are or if it this is even the correct mental model for the KV-cache). The routing system needs to be aware of these cache evictions as quickly as possible, as it can incorrectly "score" a requests cache factor and potentially overload a inference server. To mitigate this, it should be sufficient to use LRU-style timestamps stored in the cache residency records and expire records older than some timestamp. The miner can explicitly notify when records get evicted, keeping the cache more accurate but with the tradeoff of delay. It is not yet clear if this will be sufficient for real-world workloads.