There is a tool-result cache sitting between the SDK and tools. Each call is normalized and then checked before executing. If it hits we return from the cache, and if not, we check the semantic cache, which embeds the prompt and checks with KNN via valkey-search. If the cosine distance is close enough, we again skip the LLM and stream the cached response. In both cases, if we miss, we store the prompt embedding, actual model, input and output tokens from OpenAI's usage report, so a future hit has the dollars avoided as data.
The two tiers handle different shapes. Predefined questions, copy-pasted questions, checking the same thing again after time - produces byte-identical strings the tool cache catches. Human paraphrase is what the semantic tier exists for.
This Wednesday was a bank holiday where I live, so I used to extend it further - the libraries the chat relies on now store metadata in the Valkey (or Redis if that's your preference) instance, then our monitoring reads and analyze that data and suggests improvements. These are exported also through our MCP server, so the chat's agent can check and create suggestions as well, and since this is just a demo, it can also approve its suggestions (do not do this on real production environment, unless you are a true LLM believer). The libs also read the config from the Valkey instance, so there is no restart needed. I hooked it on cron inside Vercel and let it run over the night and next day.
Between Run 1 and Run 3, it started making less tool calls. The first run it suggested several different TTL changes and applied them. Run 2 and 1 had similar suggestions, because the TTL is the wrong point of control - they take natural language input (`How fast is XADD?` vs `XADD performance` are two different strings, that "mean" the same thing) so the tool cache doesn't fire and are covered by the semantic cache. An actual fix would be to move these tools from the exact-match into the semantic cache checks - a code change, not a config change. It was an indicator of a problem the system can't fix on its own. In the future the routing might also become configurable to solve this without redeploying and test and verify in quicker loops. Run 3 just didn't propose anything new - 15 -> 13 -> 8 tool calls across the three runs.
Curious how others running similar loops decide what the agent can touch. Am I too skeptical of hallucinations and overly cautious?
The chat can be found at https://chat.betterdb.com (it has links to all of the repos in it) And a more detailed write up can be found at https://www.betterdb.com/blog/cache-that-tunes-itself
0 comments