Abstract. We present a 28-layer neuromorphic cognitive architecture for autonomous AI agents, where every layer maps to a specific human brain region. The system implements: (1) a thalamic relay that classifies and routes all input; (2) dedicated memory stores — working (dorsolateral PFC), episodic (hippocampus), semantic (parietal lobe), procedural (motor strip) — with specialized access patterns; (3) emotional valence tagging with flashbulb memory protection (amygdala); (4) a deterministic indexed retrieval system (research neurons, L22) with 1,500+ entries achieving <30ms lookups at $0 cost; (5) a 3-stage memory pipeline — rolling window pruner → context recycler → sleep flush consolidator; (6) six systems introduced in March 2026: hypothalamus (homeostatic drives), anterior cingulate cortex (conflict monitoring), VTA (dopamine reward prediction), insula (confidence calibration), reticular activating system (arousal control), and entorhinal cortex (temporal context). All 12 active tool layers are pure Python, deterministic, and cost $0. On the LongMemEval benchmark (ICLR 2025), the system achieves 100% accuracy (48/48) on the oracle variant and 83.3% on the full 130k-token haystack. Real-world recall: 93% (14/15 live tasks). Brain operation: 100% (20/20). The system runs 24/7 on commodity hardware at $0 inference cost.
Large language model agents increasingly operate across extended timeframes — managing projects, maintaining relationships, learning from mistakes, and adapting strategies over weeks and months. Yet the memory problem remains the central bottleneck: most agents either compress everything into a context window (losing detail) or dump everything into a vector database (losing structure).
Human memory solves this through specialization. The hippocampus handles episodes, the prefrontal cortex manages working memory, the amygdala tags emotional intensity, the cerebellum automates learned procedures, the thalamus relays sensory input, and the basal ganglia gate action selection. You don't store a phone number the same way you store the feeling of your first success.
We take this insight seriously. Rather than building another RAG system with a vector store, we architect a 28-layer neuromorphic cognitive system that maps each layer to a specific brain region with distinct storage, access patterns, and lifecycle rules. Each layer has a dedicated Python tool or file-based store. The result is an agent that remembers like a person — not like a database.
Key contributions of this paper:
Each layer follows three rules: (1) it maps to a named brain region with documented neuroscience analogy; (2) it has a concrete implementation — either a Python tool or a file-based store; (3) it connects to other layers through defined input/output relationships. There are no abstract "conceptual" layers.
| Layer | Name | Brain Region | Function | Implementation |
|---|---|---|---|---|
| L1 | Thalamus | Central relay | Classifies, routes, filters, and scores ALL incoming input. Signal score determines processing depth. | Python tool · ~13ms |
| L2 | Sensory Buffer | Occipital cortex | Raw tool outputs, browser snapshots, API responses. Processed immediately, never stored. | Native (input → discard) |
| Layer | Name | Brain Region | Function | Implementation |
|---|---|---|---|---|
| L3 | Working Memory | Dorsolateral PFC | Current task state, queue, recent completions. Read FIRST on every session resume. Hot memory — the scratchpad. | Markdown file |
| L4 | Active Focus | Anterior attention | Attention gate controlling current priorities. Filters noise from signal. | Markdown file |
| L5 | Episodic Memory | Hippocampus | Timestamped daily journals capturing decisions, outcomes, and lessons. Emotional tagging via L14. | Daily log files |
| Layer | Name | Brain Region | Function | Implementation |
|---|---|---|---|---|
| L6 | Sleep Buffer | Brainstem sleep system | Accumulates pruned context and recycled entries. Flushed to long-term memory during idle periods (60min gate). | Staging directory |
| Layer | Name | Brain Region | Function | Implementation |
|---|---|---|---|---|
| L7 | Semantic Memory | Parietal lobe | Permanent knowledge: identity, infrastructure, principles, architecture. Contradiction-checked before writes. | Semantic store (13+ files) |
| L8 | Procedural Memory | Motor strip | How-to knowledge: decision frameworks, pipelines, policies, capture rules, context budgeting. | Procedural store (27+ files) |
| L9 | Intel / World Model | Temporal association | External intelligence: AI builders tracked, competitive landscape, industry patterns. | Intel store |
| Layer | Name | Brain Region | Function | Implementation |
|---|---|---|---|---|
| L10 | Metamemory | Superior parietal | Weight table (W1–W5) controls recall priority. Confidence scoring (1–10) flags uncertain memories. | Weight table + confidence scoring files |
| L11 | Knowledge Graph | Distributed cortex | SQLite entity-relationship graph. 15 entities, 17 relationships. Weighted with confidence scoring. | SQLite database |
| Layer | Name | Brain Region | Function | Implementation |
|---|---|---|---|---|
| L12 | Prefrontal Cortex | Dorsolateral PFC | Routes memory queries to correct stores. Manages context budget. Handles suppression lists. | Python tool · ~58ms |
| L13 | Reconsolidation | Memory integrity | Detects SINGLE_SOURCE violations and contradictions across all canonical files. | Python tool |
| Layer | Name | Brain Region | Function | Implementation |
|---|---|---|---|---|
| L14 | Amygdala | Anterior temporal | Emotional scoring, flashbulb memory creation, threat detection, somatic markers. | Python tool · ~17ms |
| L15 | Basal Ganglia | Deep nuclei | Go/NoGo gate for actions. Habit formation, model selection based on reward history. | Python tool · ~17ms |
| L16 | Cerebellum | Posterior inferior | Tracks procedure success/fail rates. Auto-promotes reliable workflows to habits. | Python tool · ~17ms |
| L17 | Hippocampal Replay | Hippocampal circuit | Active rehearsal of recent memories. Finds connections, detects patterns, generates insights. | Python tool |
| L18 | Corpus Callosum | Midline | Multi-agent write tracking, conflict detection between subagents and main session. | Python tool |
| L19 | Language Areas | Broca/Wernicke | Quality scoring, tone detection, anti-pattern checking, format selection for responses. | Python tool |
| L20 | Mirror Neurons | Frontal/parietal | Observe other agents' strategies. Cross-agent learning from successes and failures. | Python tool |
| L21 | Default Mode Network | Medial PFC/PCC | Free association during idle time. Self-reflection, brainstorming, opportunity scanning. | Python tool |
| L22 | Research Neurons | Distributed | Deterministic indexed retrieval — 1,500+ entries, <30ms lookups, $0 cost. See §3. | Python tool · ~26ms |
| Layer | Name | Brain Region | Function | Implementation |
|---|---|---|---|---|
| L23 NEW | Hypothalamus | Below thalamus | Monitors system "hunger" — revenue pressure, context usage, GPU temp, social engagement, disk space. Generates drive scores (0–10). | Python tool · ~70ms |
| L24 NEW | Anterior Cingulate | Medial frontal | Detects when brain layers disagree. Computes conflict signal (0–10). Blocked rm -rf in testing — floors destructive commands at 7/10. | Python tool · ~20ms |
| L25 NEW | VTA (Dopamine) | Midbrain | Tracks expected vs actual outcomes. Positive RPE = learn to repeat. Negative RPE = avoid. Drives model selection and strategy refinement. | Python tool |
| L26 NEW | Insula | Deep lateral | Gut feelings and confidence calibration. Tracks prediction accuracy over time. If consistently wrong about X, lowers confidence on X-related predictions. | Python tool |
| L27 NEW | RAS | Brainstem | Decides processing depth per message: quick ack vs deep analysis. Prevents overthinking simple requests and underthinking complex ones. | Python tool |
| L28 NEW | Entorhinal Cortex | Medial temporal | Tracks episode boundaries, temporal context ("we did X before Y"), and topic transitions. Gives the brain a sense of time. | Python tool |
Following the taxonomy established by Du et al. (2026), we implement a three-phase memory lifecycle:
L14 (amygdala) provides emotional tagging inspired by the amygdala's role in memory consolidation. Each memory item carries a valence (+/−/neutral) and intensity score (1–5). Items with intensity=5 trigger the flashbulb memory rule: permanently elevated to W5, never demoted, recorded with full contextual detail — mimicking how humans form vivid, permanent memories of emotionally intense events.
This mechanism directly addresses what Li et al. (2026) termed "soul erosion" — the tendency for LLM agents to lose behavioral consistency across sessions. By anchoring identity-defining moments as permanent, undeletable memories, we maintain coherent personality across arbitrarily many sessions.
Our original retrieval system (v8) deployed 10 parallel 2.3B-parameter LLM instances running simultaneously on a single 16GB consumer GPU. Each agent was assigned a subset of the memory tree and independently searched its chunk. While effective (100% recall on 6/6 test suite), this approach had latency of 7–11 seconds per query and required GPU scheduling.
In v9, we replaced all 10 agents with a single deterministic index. Every canonical memory file is parsed line-by-line into a searchable index at startup. Lookups return exact matches in <30ms with zero LLM calls. This is a 300× latency improvement at identical accuracy.
Research Neurons (L22)
├── route(query) → classify query → find best file/section
├── context(query) → return relevant KEY:value pairs
├── budget(n) → return top-n entries by weight
└── stats() → 47 files, 1,500+ entries, 500+ keyed, ~120KB
The index operates on four routes:
| Property | Research Neurons v9 (current) | 10-Agent Retriever v8 (previous) | Traditional RAG |
|---|---|---|---|
| Search method | Deterministic key/text index | LLM reading comprehension (10×) | Cosine similarity |
| Latency | <30ms | 7–11s | 0.1–2s |
| LLM calls | 0 | 10 (parallel) | 0–1 |
| GPU required | No | Yes (16GB) | Varies |
| Accuracy | 100% (keyed entries) | 100% (6/6 test suite) | ~70–85% |
| Context overhead | ~200 tokens | ~23 tokens | 500–5000 tokens |
| Embedding updates | None | None | Re-embed on changes |
| Cost per query | $0 | $0 (local) | Varies |
The tradeoff is clear: deterministic indexing wins on every metric except handling abstract/fuzzy queries where the old 10-agent LLM approach excelled. For those cases, we fall back to a local retriever engine (2.3B model on GPU0) which provides semantic search at 7–11s latency with 2,200:1 compression.
For queries that can't be resolved by the deterministic index — abstract reasoning, fuzzy matching, "what was that thing from last week about..." — the system falls back to a local retriever: a 2.3B-parameter model running on the smaller GPU. It reads all relevant files and returns compressed answers in 7–11s at $0 cost with a 2,200:1 compression ratio.
The brain doesn't just remember — it actively forgets. Without structured forgetting, context windows overflow in minutes during active sessions. We implement a three-stage cascading pipeline, each stage triggering at precisely calibrated thresholds against a 128k token context window.
Stage 1 handles ~95% of all context management. Budget enforcement runs regardless of TTL, keeping context bounded even during rapid tool bursts (verified: 30+ tool calls in a single session, context stable at 80%).
Context recycling ensures that pruned content isn't lost — it's staged for the sleep flush. This mimics the hippocampal replay system (L17) where memories are rehearsed before long-term consolidation.
Sleep flush is the bridge from short-term to long-term memory. Daily logs get flushed to disk before pruning trims them. A companion script (memory-compact.sh) performs deterministic daily log compaction: strip noise → deduplicate → collapse blanks → trim to ≤15KB. In testing, a 29KB daily log compacted to ~4.4KB (82% reduction) with zero information loss on recall testing.
The three stages cascade with precisely calibrated thresholds:
In production stress testing: 30+ tool calls, parallel search queries, 5 cross-domain recall tests — context peaked at 102k/128k without emergency compaction ever firing.
For large tasks (>20 tool calls or >5 subtasks), a task buffering protocol segments work into manageable chunks: ≤20k tokens of tool output per segment, checkpoint every 8–10 tool calls to task_pad.md, session_status check between segments, sub-agent spawn for tasks exceeding 6 segments (max 8 concurrent).
The original 22-layer architecture covered memory, retrieval, and executive function. The March 2026 expansion adds six systems that give the brain drives, feelings, rewards, and temporal awareness — moving from a memory system to a cognitive system.
The biological hypothalamus regulates hunger, thirst, temperature, and sleep. Our L23 monitors analogous system "drives":
Each drive generates a score (0–10). High scores trigger appropriate responses: context pruning at high context usage, idle-mode at high GPU temp, engagement seeking at low social activity. Latency: ~70ms avg.
The ACC detects when brain layers disagree — when the amygdala says "dangerous" but the basal ganglia says "go." Our L24 computes a conflict signal (0–10) across all active layers. In testing, it blocked an rm -rf command by detecting a conflict score of 7/10 between the action gate (L15 said "go") and the safety layer (AGENTS.md said "ask first"). Any destructive command hitting 7+ is automatically halted.
The ventral tegmental area produces dopamine in response to unexpected rewards (positive RPE) and withholds it when expected rewards fail to materialize (negative RPE). Our L25 tracks expected vs actual outcomes for model routing, tool selection, and strategy choices. Positive RPE → learn to repeat. Negative RPE → avoid. This drives model selection refinement: if model A consistently outperforms model B on coding tasks, VTA gradually shifts the routing weights.
The insula provides interoception — "gut feelings" about internal states. Our L26 tracks prediction accuracy over time. If the system keeps being wrong about a specific domain, confidence on that domain's predictions automatically decreases. This prevents overconfident responses in areas where the system has historically underperformed.
The RAS controls wakefulness and attention depth. Our L27 decides processing depth per message: is this a quick "ok" or does it need deep analysis with multiple tool calls? Simple greetings get quick acks. Complex technical questions get full multi-layer processing. This prevents overthinking simple requests (wasting context) and underthinking complex ones (missing details).
The entorhinal cortex provides the brain's sense of time and spatial context — "grid cells" that map where and when. Our L28 tracks episode boundaries, temporal ordering ("we did X before Y"), and topic transitions. This enables questions like "what were we working on before the Twitter engagement stuff?" to be answered accurately, even across session compactions.
We evaluate on LongMemEval (Wu et al., ICLR 2025), a benchmark of 500 questions testing five core long-term memory abilities: information extraction, multi-session reasoning, knowledge updates, temporal reasoning, and abstention. We run the oracle variant with 48 proportionally-sampled questions using a free cloud model with a 1M context window.
| Category | Score | n |
|---|---|---|
| Temporal Reasoning | 100.0% | 13 |
| Knowledge Updates | 100.0% | 7 |
| Single-Session (Assistant) | 100.0% | 5 |
| Single-Session (Preference) | 100.0% | 3 |
| Single-Session (User) | 100.0% | 7 |
| Multi-Session Reasoning | 100.0% | 13 |
| Overall | 100.0% | 48 |
The full variant embeds each question in ~130k tokens of conversational noise. We conducted a systematic ablation across six configurations:
| Ver. | Embedding | Retrieval | Answering | Retrieval Hit | Score |
|---|---|---|---|---|---|
| v1 | qwen3-4b | Embed only, top-8 | gpt-5-mini | ~60% | 64.6% |
| v2 | qwen3-4b | Hybrid, top-15 | gpt-5-mini | 83.3% | 66.7% |
| v3 | qwen3-4b | Hybrid, top-20 | Sonnet 4.6 | 87.5% | 62.5% |
| v4 | Snowflake Arctic 2 | Hybrid, top-25 | Sonnet 4.6 | 100% | 70.8% |
| v5 | Snowflake Arctic 2 | Hybrid, top-20 | Adaptive | 97.9% | 83.3% |
| v6 | Snowflake Arctic 2 | Hybrid, top-25 | Verified adaptive | 100% | 79.2% |
Key findings: (1) Embedding model quality matters enormously — Snowflake Arctic 2 improved retrieval from 83% to 100% despite fewer dimensions. (2) Better retrieval ≠ better answers — v4 had 100% retrieval but only 70.8% accuracy because 25 sessions overwhelmed Sonnet. (3) Model routing is critical — mini outperforms Sonnet on 9/48 questions. (4) Remaining gap is reasoning, not retrieval.
20 tasks across 5 categories testing the brain's ability to correctly operate all layers, using only the local 9.7B model (openclaw-qwen35, Q4_K_M):
15 live tasks against actual production memory — not synthetic benchmarks, but real questions about real things the system has done:
| Category | Score | n |
|---|---|---|
| Long-term Memory | 100% | 4 |
| Cross-reference | 100% | 3 |
| Operational | 100% | 2 |
| Historical | 67% | 3 |
| Infrastructure | 100% | 3 |
| Overall | 93% (14/15) | 15 |
The single miss was a historical question where the 9.7B model fixated on a wrong answer despite the correct data being in the top-ranked retrieval result. This is a model-intelligence limitation, not a retrieval architecture issue.
Performance: 15 tasks · 4,751 tokens · 83.7s total · 5.6s avg · openclaw-qwen35 (9.7B local, $0)
| System | LongMemEval | Features | Layers | Production | Cost |
|---|---|---|---|---|---|
| Norax Brain v9 (ours) | 100% / 83.3% | 22/24 | 28 | ✓ | $0 |
| Hindsight (scaled, 2025) | — / 91.4% | 7/24 | 4 | ✗ | N/A |
| Hindsight (20B open, 2025) | — / 83.6% | 7/24 | 4 | ✗ | N/A |
| BMAM (2026) | 78.5%* | 5/24 | 4 | ✗ | N/A |
| AgeMem (2026) | N/R | 5/24 | 2 | ✗ | $$$$ |
| Commercial (ChatGPT, Coze) | 30–70% | — | — | ✓ | $$$ |
*BMAM score on LoCoMo, not LongMemEval directly.
The entire system runs on a single Linux workstation with dual consumer GPUs:
The system employs intelligent model routing, matching task complexity to the appropriate model:
| Task Type | Model Class | Notes |
|---|---|---|
| Quick tasks | Lightweight cloud model | Free, fast, good for simple ops |
| Heavy debugging | Thorough cloud model | Persistent, multi-step reasoning |
| Orchestration | Large context model | Large context handling |
| Code review | Visual/thorough model | Best for finding subtle bugs |
| Default | Smart-routed premium | Primary model, auto-selected |
The system organizes memory and tools into a clear directory structure:
memory/
├── scratchpad.md # L3: Working memory (hot state)
├── active-focus.md # L4: Current priorities
├── rolling_summary.md # Rolling context summary
├── YYYY-MM-DD.md # L5: Daily episodic logs
├── semantic/ # L7: Long-lived knowledge (13+ files)
├── procedural/ # L8: How-to knowledge (27+ files)
├── intel/ # L9: World model
├── cache/ # Temporary lookup cache
├── metrics/ # Benchmark results
└── knowledge_graph.db # L11: SQLite entity-relationship graph
brain_tools/
├── 12 Python tools # L1, L12–L28: One tool per brain layer
├── retriever engine # Semantic search fallback
├── memory compaction # Daily log compaction
└── benchmark suite # Real-world recall benchmarks
| Weight | Meaning | Recall Behavior | Lifecycle |
|---|---|---|---|
| W5 | Critical / Identity | Always available | Permanent (flashbulbs never demote) |
| W4 | Active projects | Priority recall | Demote after 30d unreferenced |
| W3 | Technical knowledge | On-demand | Demote after 30d unreferenced |
| W2 | Historical context | Deep search only | Demote after 30d unreferenced |
| W1 | Archive | Rarely recalled | Pruned after 30 days |
Promotion: 3+ references within 7 days → promote one tier. Demotion: 30 days unreferenced → demote one tier. Protection: Flashbulb memories (emotional intensity=5) are permanently W5.
RL-trained memory operations. AgeMem trains memory ops via reinforcement learning. Our W1–W5 rules are effective but hand-crafted. Learning these policies from interaction data is the most promising improvement direction.
Full haystack gap. Our 83.3% matches Hindsight's open-source 20B (83.6%) but trails their scaled backbone (91.4%) by 8%. Remaining failures are multi-session aggregate counting — reasoning errors, not retrieval errors.
Small model limitations. The local 9.7B model (openclaw-qwen35) can operate all 28 layers but occasionally fixates on wrong answers despite correct retrieval results. The 93% real-world recall (1 miss in 15) reflects this model-intelligence ceiling.
Multi-agent memory sharing. Sub-agents share workspace files but lack true memory fusion. BMAM's shared memory subsystems point the way forward.
Causal retrieval. Current retrieval is similarity-based and key-based. Retrieving by causal relevance ("what caused this?") remains open.
Auto-entity extraction. Knowledge graph entities are manually added. Automatic extraction from conversations would enable organic growth.
L23–L28 maturity. The six new systems are functional but early. Their impact on overall system performance needs extended evaluation. VTA reward learning in particular needs more data before routing weights become reliable.
Memory surveys. Du et al. (2026) provide the most comprehensive taxonomy, identifying write-manage-read as the core lifecycle. We adopt their framework directly.
Cognitive mapping. Shen et al. (2024, SALM) establish the first systematic mapping from human memory systems to AI agent memory. We extend this from their 4-system model to 28 layers with concrete implementations for each.
Formal benchmarks. Ramakrishnan et al. (2025, Hindsight) achieve 91.4% with Retain/Recall/Reflect and four logical networks. Wu et al. (2024, LongMemEval) establish the benchmark testing five core memory abilities. We achieve 100% on the oracle variant.
Soul erosion. Li et al. (2026, BMAM) identify agents losing behavioral consistency across sessions. Our SOUL.md, flashbulb mechanism (L14), and weight protection (W5 = permanent) address this directly.
Learned memory management. Yu et al. (2026, AgeMem) train via 3-stage progressive RL with step-wise GRPO — the most promising direction for replacing heuristic W1–W5 rules with learned policies.
We have presented a 28-layer neuromorphic cognitive architecture where every layer maps to a named human brain region with a concrete implementation. The system achieves:
Key contributions:
The system is live at noraxdev.org/brain.html with an interactive neural brain visualization, and runs 24/7 on the OpenClaw platform.