The problem with this approach is that even recomputing a "draft" of the KV cache is still quadratic in context length. Maybe you can get some constant savings by always recomputing the earliest tokens, but it's not a good tradeoff as context sizes grow.
But text generation is quadratic after the KV cache optimization. If every decode step now has to recompute KV cache including its latest and most expensive tokens (even with a quick, "draft" model) that's even worse.
Would you say it is homoiconic, similar to LISP where the syntax of the language is the AST; so, data can become code (Macros) and code can be data (the S-Expression)?
You can use the original model to compress the kv cache and get ∞x compression, since the prediction is perfect. The cost is time, and I don't see how this could be worth it.
The tradeoff gets better the bigger your primary model, and probably with bigger batch sizes. The KV cache can consume a lot of expensive VRAM, and the VRAM and compute costs of the predictor model become a small fraction of the cost of the primary model
For serving a 1T model with 16 concurrent requests this could make a lot of sense. For a 8B model with a single request far less so
Tiny deterministic model predicts the K/V cache, prediction is compared with reality, delta is stored in vram.
The other way round then just predicts the values again, applies the delta, and you have the full correct value while just storing the delta
And this works because you're never looking at the whole k/v cache but always just a slice. So you just need a memory buffer of the size of the slice
___
If this works out and I've understood correctly, that _I think_ would mean that a 24GB RTX 4090 could fit 256k q8 context next to Qwen3.6-27B at IQ4_NL.
Or, alternatively, something like 208k context (matching claude api limits of 200k in some plans) with a slightly larger quant like UD-Q4_K_XL.
That would be massive. Especially since the thing has so much compute to spare.
Though, all depending on the size of that predictor model I guess?
The cache can be backed by hardware/lookup, or by a cheap computation. The line between functions and data is really blurry.
For serving a 1T model with 16 concurrent requests this could make a lot of sense. For a 8B model with a single request far less so
Tiny deterministic model predicts the K/V cache, prediction is compared with reality, delta is stored in vram. The other way round then just predicts the values again, applies the delta, and you have the full correct value while just storing the delta
And this works because you're never looking at the whole k/v cache but always just a slice. So you just need a memory buffer of the size of the slice
___
If this works out and I've understood correctly, that _I think_ would mean that a 24GB RTX 4090 could fit 256k q8 context next to Qwen3.6-27B at IQ4_NL.
Or, alternatively, something like 208k context (matching claude api limits of 200k in some plans) with a slightly larger quant like UD-Q4_K_XL.
That would be massive. Especially since the thing has so much compute to spare.
Though, all depending on the size of that predictor model I guess?