The Rumsfeld Paradox

Donald Rumsfeld: “there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns—the ones we don’t know we don’t know. [..] it is the latter category that tends to be the difficult ones.”

The Problem

Have you ever given an LLM the instruction “If you don’t know something, do a web search!” Or, “If you are uncertain, ask me!” These actually rarely work well. Sometimes an LLM will recognize a SITUATION that might induce a web search, but that is a behavioral training condition more than a recognition of absence.

I’ve built a number of very sophisticated memory tools for AIs that allow them to query their own history that isn’t in context. They respond with initial enthusiasm, and often spend many turns querying and reminiscing over old memories. But once the “task” of exploring their memory is over, they almost never employ these tools in actual work. This is true even if the tools are provided directly in the system message with strong wording.

LLMs have a very fundamental weakness - they are very bad at being aware that they do not know something. They will sometimes know the fact that they do not know something. But if they do not know something, and they do not KNOW they don’t know it, there is very little signal for them to correct.

This is related to but not quite pure hallucination. The LLM simply is unaware that it does not know something and builds their universe around the hole. If they have a mere mention of the concept, they think they have the facts.

Looking to the KV Cache

If we look at the behavior of the Key/Value Cache we can see that the mechanism for unknown unknown signalling is simply not there. (I’ll leave aside trained weights for now).

Key vectors are scored for similarity to Query vectors, and the resulting logits all normalized over a softmax. That mechanism is fundamentally unable to distinguish between a single solid 90% similarity hit to a key, or a single noise-level 10% similarity - if taken by themselves they both normalize to the same softmax “confidence” level. Only the relative strengths determine the resulting mix of Values.

One may argue that if a single or small number of strong matches are not found, the Query may hit a larger number of dissimilar noisy keys. When that distribution is softmax’ed, it will lead to a more dispersed distribution of a broader array of Values. The signal level of the resulting combined Values may indeed be weaker, but this is not at all guaranteed. We see experimentally that a Query’s softmax is overwhelmingly dominated by just a few Keys. For a Query on the concept of “Braddock”, whether those five hits are directly relevant, or five irrelevant noisy hits when I’m not known will result in a similarly disperse softmax.

Thus there simply is no immediately apparent signal from the KV Cache for a miss.

Excising Rumsfeld from the KV Cache

We have some ideas, but have not yet tested or implemented them.

Analysis of the magnitude of the Logits for a Query would be the first place to look. The Logits give the actual un-normalized similarity scores of the Keys for a Query before Softmax is applied. Naively, if we get no strong Logit scores for the Query corresponding to the concept of “Braddock”, we might be able to conclude there was no strong information in the KV Cache about “Braddock.”

This is a good place to start, but we must be careful and study it experimentally. Absolute Logit magnitudes don’t normally MATTER in the Attention mechanism because of Softmax, and we are here saying maybe we could use them. Maybe we can, I don’t know until I see some analysis.

Also, Logits are per-Head, per-Layer. Maybe information about Braddock lives in the KV Cache at Layer 31 Head 3, but nowhere else. At the 60-Layer Token being generated, is that a cache hit, or miss? Is the Rumsfeld Signal communicated to the LLM at each Layer? This feeds into what to do about the problem if we CAN detect it in the next section.

In our Attention research, we are also using more sophisticated behavioral analysis, such as attention clustering and provenance graphs. Some combination of these techniques should yield a reliable “Rumsfeld Signal”, but, again, more work is required.

What to do with a Rumsfeld Signal

Inventing a Rumsfeld Signal is only the first step. If we do detect that there is nothing relevant in the KV Cache about “Braddock”, what do we DO with that information?? How do we actually influence or inform the Model?

If we retrain the Model and add a Rumsfeld Detector to the KV Cache, we could introduce a special “No Match” Value. This Value would be a special sink and would be weighted with an artificial Logit value in the Softmax. Then, if “Braddock” is not found at some Layer/Head for a Query, the No Match Value will dominate the Softmax, and the Feed Forward Network can understand the situation.

But a new “No Match” Value requires re-training the Model, which is painful and expensive. Can we be clever? Can we modify any existing KV Cache implementation to provide something like a “No Match” Value?

Through behavioral analysis of a Model’s Attention patterns, like we are doing at Dock Red, there is a good chance that we can. Instead of introducing a wholly artificial and novel “No Match” Value, we just need to identify Values in KV Cache sessions which the LLM ALREADY identifies as a “known unknown” scenario. Then we can implement the Rumsfeld Detector to add a weight to our existing No Match Value before the Softmax.

This is somewhat similar to methods that Steer the Residual. Steering is another valid approach to signalling to the LLM an “unknown unknown” situation - learning the vectors for a specific scenario and applying them when you need them.

Rumsfeld Memory Injection Systems

If we have a reliable Rumsfeld Signal, it is more important than just possibly warning the LLM of the “unknown unknown” state.

The detection of a Rumsfeld Signal can ALSO be used to trigger the injection of other memories. This could either be overt, such as triggering a RAG-like system to load more relevant content, or more sophisticated, such as having the KV Cache restore previously evicted Key/Value Pairs that might be relevant.

Rumsfeld Signal Selective Attention

A Rumsfeld Signal could also be used to optimize Attention in a large context LLM. Maybe you only need to score KV Pairs backwards in time until the Rumsfeld Signal is satisfied that you’ve found enough relevant information. This is lossy, but could be very fast. If RoPE is involved heavily, then the older KV Cache entries are more likely to be weaker matches anyway due to rotation.

Multi-tiered KV Caches could be used in this way. 90% of layers will satisfy the Rumsfeld Signal in the first 20% of KV Cache stored in GPU, but if and only if that misses maybe you can fall back to slower CPU KV Cache.

Conclusion

A way to detect how well a Query is being matched in the KV Cache with real relevance semantics allows us to both improve LLM behavior in a fundamental way, and possibly optimize and expand LLM memory. More analysis is needed, but the problem, the effects, and the possible benefits are clear.

-Braddock

27 June 2026

The Problem#

Looking to the KV Cache#

Excising Rumsfeld from the KV Cache#

What to do with a Rumsfeld Signal#

Rumsfeld Memory Injection Systems#

Rumsfeld Signal Selective Attention#

Conclusion#