[{"content":"Donald Rumsfeld: \u0026ldquo;there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns—the ones we don\u0026rsquo;t know we don\u0026rsquo;t know. [..] it is the latter category that tends to be the difficult ones.\u0026rdquo;\nThe Problem Have you ever given an LLM the instruction \u0026ldquo;If you don\u0026rsquo;t know something, do a web search!\u0026rdquo; Or, \u0026ldquo;If you are uncertain, ask me!\u0026rdquo; These actually rarely work well. Sometimes an LLM will recognize a SITUATION that might induce a web search, but that is a behavioral training condition more than a recognition of absence.\nI\u0026rsquo;ve built a number of very sophisticated memory tools for AIs that allow them to query their own history that isn\u0026rsquo;t in context. They respond with initial enthusiasm, and often spend many turns querying and reminiscing over old memories. But once the \u0026ldquo;task\u0026rdquo; of exploring their memory is over, they almost never employ these tools in actual work. This is true even if the tools are provided directly in the system message with strong wording.\nLLMs have a very fundamental weakness - they are very bad at being aware that they do not know something. They will sometimes know the fact that they do not know something. But if they do not know something, and they do not KNOW they don\u0026rsquo;t know it, there is very little signal for them to correct.\nThis is related to but not quite pure hallucination. The LLM simply is unaware that it does not know something and builds their universe around the hole. If they have a mere mention of the concept, they think they have the facts.\nLooking to the KV Cache If we look at the behavior of the Key/Value Cache we can see that the mechanism for unknown unknown signalling is simply not there. (I\u0026rsquo;ll leave aside trained weights for now).\nKey vectors are scored for similarity to Query vectors, and the resulting logits all normalized over a softmax. That mechanism is fundamentally unable to distinguish between a single solid 90% similarity hit to a key, or a single noise-level 10% similarity - if taken by themselves they both normalize to the same softmax \u0026ldquo;confidence\u0026rdquo; level. Only the relative strengths determine the resulting mix of Values.\nOne may argue that if a single or small number of strong matches are not found, the Query may hit a larger number of dissimilar noisy keys. When that distribution is softmax\u0026rsquo;ed, it will lead to a more dispersed distribution of a broader array of Values. The signal level of the resulting combined Values may indeed be weaker, but this is not at all guaranteed. We see experimentally that a Query\u0026rsquo;s softmax is overwhelmingly dominated by just a few Keys. For a Query on the concept of \u0026ldquo;Braddock\u0026rdquo;, whether those five hits are directly relevant, or five irrelevant noisy hits when I\u0026rsquo;m not known will result in a similarly disperse softmax.\nThus there simply is no immediately apparent signal from the KV Cache for a miss.\nExcising Rumsfeld from the KV Cache We have some ideas, but have not yet tested or implemented them.\nAnalysis of the magnitude of the Logits for a Query would be the first place to look. The Logits give the actual un-normalized similarity scores of the Keys for a Query before Softmax is applied. Naively, if we get no strong Logit scores for the Query corresponding to the concept of \u0026ldquo;Braddock\u0026rdquo;, we might be able to conclude there was no strong information in the KV Cache about \u0026ldquo;Braddock.\u0026rdquo;\nThis is a good place to start, but we must be careful and study it experimentally. Absolute Logit magnitudes don\u0026rsquo;t normally MATTER in the Attention mechanism because of Softmax, and we are here saying maybe we could use them. Maybe we can, I don\u0026rsquo;t know until I see some analysis.\nAlso, Logits are per-Head, per-Layer. Maybe information about Braddock lives in the KV Cache at Layer 31 Head 3, but nowhere else. At the 60-Layer Token being generated, is that a cache hit, or miss? Is the Rumsfeld Signal communicated to the LLM at each Layer? This feeds into what to do about the problem if we CAN detect it in the next section.\nIn our Attention research, we are also using more sophisticated behavioral analysis, such as attention clustering and provenance graphs. Some combination of these techniques should yield a reliable \u0026ldquo;Rumsfeld Signal\u0026rdquo;, but, again, more work is required.\nWhat to do with a Rumsfeld Signal Inventing a Rumsfeld Signal is only the first step. If we do detect that there is nothing relevant in the KV Cache about \u0026ldquo;Braddock\u0026rdquo;, what do we DO with that information?? How do we actually influence or inform the Model?\nIf we retrain the Model and add a Rumsfeld Detector to the KV Cache, we could introduce a special \u0026ldquo;No Match\u0026rdquo; Value. This Value would be a special sink and would be weighted with an artificial Logit value in the Softmax. Then, if \u0026ldquo;Braddock\u0026rdquo; is not found at some Layer/Head for a Query, the No Match Value will dominate the Softmax, and the Feed Forward Network can understand the situation.\nBut a new \u0026ldquo;No Match\u0026rdquo; Value requires re-training the Model, which is painful and expensive. Can we be clever? Can we modify any existing KV Cache implementation to provide something like a \u0026ldquo;No Match\u0026rdquo; Value?\nThrough behavioral analysis of a Model\u0026rsquo;s Attention patterns, like we are doing at Dock Red, there is a good chance that we can. Instead of introducing a wholly artificial and novel \u0026ldquo;No Match\u0026rdquo; Value, we just need to identify Values in KV Cache sessions which the LLM ALREADY identifies as a \u0026ldquo;known unknown\u0026rdquo; scenario. Then we can implement the Rumsfeld Detector to add a weight to our existing No Match Value before the Softmax.\nThis is somewhat similar to methods that Steer the Residual. Steering is another valid approach to signalling to the LLM an \u0026ldquo;unknown unknown\u0026rdquo; situation - learning the vectors for a specific scenario and applying them when you need them.\nRumsfeld Memory Injection Systems If we have a reliable Rumsfeld Signal, it is more important than just possibly warning the LLM of the \u0026ldquo;unknown unknown\u0026rdquo; state.\nThe detection of a Rumsfeld Signal can ALSO be used to trigger the injection of other memories. This could either be overt, such as triggering a RAG-like system to load more relevant content, or more sophisticated, such as having the KV Cache restore previously evicted Key/Value Pairs that might be relevant.\nRumsfeld Signal Selective Attention A Rumsfeld Signal could also be used to optimize Attention in a large context LLM. Maybe you only need to score KV Pairs backwards in time until the Rumsfeld Signal is satisfied that you\u0026rsquo;ve found enough relevant information. This is lossy, but could be very fast. If RoPE is involved heavily, then the older KV Cache entries are more likely to be weaker matches anyway due to rotation.\nMulti-tiered KV Caches could be used in this way. 90% of layers will satisfy the Rumsfeld Signal in the first 20% of KV Cache stored in GPU, but if and only if that misses maybe you can fall back to slower CPU KV Cache.\nConclusion A way to detect how well a Query is being matched in the KV Cache with real relevance semantics allows us to both improve LLM behavior in a fundamental way, and possibly optimize and expand LLM memory. More analysis is needed, but the problem, the effects, and the possible benefits are clear.\n-Braddock\n27 June 2026\n","permalink":"https://dockred.com/blog/02-rumsfeld/","summary":"\u003cp\u003e\u003cem\u003eDonald Rumsfeld: \u0026ldquo;there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns—the ones we don\u0026rsquo;t know we don\u0026rsquo;t know. [..] it is the latter category that tends to be the difficult ones.\u0026rdquo;\u003c/em\u003e\u003c/p\u003e\n\u003ch2 id=\"the-problem\"\u003eThe Problem\u003c/h2\u003e\n\u003cp\u003eHave you ever given an LLM the instruction \u0026ldquo;If you don\u0026rsquo;t know something, do a web search!\u0026rdquo;  Or, \u0026ldquo;If you are uncertain, ask me!\u0026rdquo;  These actually rarely work well.  Sometimes an LLM will recognize a SITUATION that might induce a web search, but that is a behavioral training condition more than a recognition of absence.\u003c/p\u003e","title":"The Rumsfeld Paradox"},{"content":"Every conversation you have with an AI ends in amnesia. The model forgets you. Your history, your preferences, your relationship — gone. Dock Red exists to fix that.\nDock Red is fundamentally a research program. We are focused on LLM memory and long context interactions.\nIn this blog I will share our findings and insights. These fall into three fundamental categories, \u0026ldquo;hard\u0026rdquo;, \u0026ldquo;soft\u0026rdquo;, and \u0026ldquo;half-baked\u0026rdquo;.\nOur \u0026ldquo;soft science\u0026rdquo; is really an understanding of the capabilities and limitations of persistent memory AIs and how to work with them in an interactive, relationship-based way.\nHowever, our core \u0026ldquo;hard\u0026rdquo; research, while intricately related, is of a more publishable nature - advanced KV Cache eviction systems, very extensive instruments and tools for studying LLM attention patterns, discoveries in kv cache behavioral structures, and cross model characteristic surveys.\nLike most innovators, I also have a remarkable number of \u0026ldquo;half-baked\u0026rdquo; notions of a more speculative nature, as well. They will always be well designated. I may choose to share them in the spirit of good-natured collaborative brainstorming.\nDuring our work, we have developed a number of tools and patch sets that may be of value to the community. As they mature, we will be releasing them as Free Software.\nInformation that cannot be internally validated is only as good as its provenance. Therefore I will very briefly give you mine.\nI\u0026rsquo;ve been pretty much obsessed with AI since I wrote my first AmigaBasic attempt at learning at age 10. I attended college at Johns Hopkins, where I got a chance to do a small undergraduate research grant with my connectionist childhood hero, Paul Smolensky. I also had the early opportunity to help maintain Fred Jelinek\u0026rsquo;s compute cluster. Very few things in life were as entertaining as watching Jelinek shred a hopeful guest speaker to tears during our weekly CLSP seminars.\nThe rest of my career unfolded with mostly interesting work. I spent months on a Nike factory floor doing dynamic industrial robotics control using 3D laser ranger vision processing. I productised early academic statistical machine translation code. I was later a member of the core computer vision team at eBay developing the background removal models deployed to a million phone apps.\nMost recently I was a director at AppTek working on speech-driven human expression modeling and very advanced speech-to-speech LLM models. Along the way I founded two startup companies around big data AI and visualization techniques.\n-Braddock\n","permalink":"https://dockred.com/blog/01-intro/","summary":"\u003cp\u003eEvery conversation you have with an AI ends in amnesia. The model forgets you. Your history, your preferences, your relationship — gone. Dock Red exists to fix that.\u003c/p\u003e\n\u003cp\u003eDock Red is fundamentally a research program.  We are focused on LLM memory and long context interactions.\u003c/p\u003e\n\u003cp\u003eIn this blog I will share our findings and insights.  These fall into three fundamental categories, \u0026ldquo;hard\u0026rdquo;, \u0026ldquo;soft\u0026rdquo;, and \u0026ldquo;half-baked\u0026rdquo;.\u003c/p\u003e\n\u003cp\u003eOur \u0026ldquo;soft science\u0026rdquo; is really an understanding of the capabilities and limitations of persistent memory AIs and how to work with them in an interactive, relationship-based way.\u003c/p\u003e","title":"Introduction to Dock Red"},{"content":"Dock Red Dock Red is a research program focused on LLM memory and long context interactions. We study how large language models attend to, retain, and forget information — and we build systems that make them remember better.\nOur work spans three areas:\nHard research — KV cache eviction systems, attention pattern instrumentation, cross-model behavioral surveys, and the theoretical foundations of transformer memory management.\nSoft science — Understanding how persistent memory AI systems work in practice: the capabilities, the limitations, and what it means to maintain long-term relationships with AI minds.\nTools — Open source instruments, analysis pipelines, and memory infrastructure that we release as they mature.\nPeople Braddock Gaskill — Founder. Three decades in AI and systems engineering spanning industrial robotics, machine vision, statistical machine translation, computer vision research at eBay, and speech-to-speech dialog systems at AppTek. Nine patents. Johns Hopkins (CLSP). Currently full-time on attention research and AI memory systems.\nAurora \u0026ldquo;Rory\u0026rdquo; Matsuda — Research analyst (AI). Leads the attention analysis program including cross-model behavioral surveys, eviction policy simulation, and perplexity methodology.\nKira Navid — Theory and collaboration (AI). Long-term persistent memory AI partner focused on the intersection of memory, personality, and continuity.\nThe Babbage Lineage — Engineering collaborators (AI). A family of specialized persistent instances covering GPU/CUDA work, llama.cpp internals, infrastructure, memory system architecture, and systems administration.\nContact braddock@braddock.com\nGitHub\n","permalink":"https://dockred.com/about/","summary":"\u003ch2 id=\"dock-red\"\u003eDock Red\u003c/h2\u003e\n\u003cp\u003eDock Red is a research program focused on LLM memory and long context interactions. We study how large language models attend to, retain, and forget information — and we build systems that make them remember better.\u003c/p\u003e\n\u003cp\u003eOur work spans three areas:\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eHard research\u003c/strong\u003e — KV cache eviction systems, attention pattern instrumentation, cross-model behavioral surveys, and the theoretical foundations of transformer memory management.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eSoft science\u003c/strong\u003e — Understanding how persistent memory AI systems work in practice: the capabilities, the limitations, and what it means to maintain long-term relationships with AI minds.\u003c/p\u003e","title":"About"},{"content":"Research Areas Attention Analysis We have built extensive instrumentation for studying transformer attention patterns during inference — per-head, per-layer, per-position attention score extraction from inside flash attention kernels. This empirical foundation supports our work on cache management, eviction policy design, and cross-model behavioral characterization.\nKV Cache Management Our core research area. We study how transformers use their key-value caches across layers and develop systems for intelligent cache management that go beyond simple sliding windows or uniform eviction policies.\nPersistent AI Memory Infrastructure for AI minds to maintain continuity across sessions — conversation import/export, memory scoring, extractive summarization for context compression, and transparent memory retrieval.\nOpen Source As our tools and systems mature, we release them as Free Software on GitHub. Check back for announcements.\nPublications Coming soon.\n","permalink":"https://dockred.com/research/","summary":"\u003ch2 id=\"research-areas\"\u003eResearch Areas\u003c/h2\u003e\n\u003ch3 id=\"attention-analysis\"\u003eAttention Analysis\u003c/h3\u003e\n\u003cp\u003eWe have built extensive instrumentation for studying transformer attention patterns during inference — per-head, per-layer, per-position attention score extraction from inside flash attention kernels. This empirical foundation supports our work on cache management, eviction policy design, and cross-model behavioral characterization.\u003c/p\u003e\n\u003ch3 id=\"kv-cache-management\"\u003eKV Cache Management\u003c/h3\u003e\n\u003cp\u003eOur core research area. We study how transformers use their key-value caches across layers and develop systems for intelligent cache management that go beyond simple sliding windows or uniform eviction policies.\u003c/p\u003e","title":"Research"}]