Deconstructing Huginn-0125

An interactive exploration of the recurrent-depth architecture for large language models.

A New Architecture for Reasoning

Traditional Transformers scale by adding more unique layers, fixing their computational depth. Huginn introduces a "recurrent-depth" approach, using a single, shared block of layers that can be applied iteratively. This decouples model size from its reasoning capacity.

Interactive Controls

Anatomy of the Recurrent Block

The core of Huginn's recurrence is the `SandwichBlock`. It's a sophisticated variant of a standard Transformer block, specifically engineered with four normalization layers to ensure stability when applied repeatedly. This design prevents the model's internal state from becoming chaotic during deep "thinking".

Input (x)
RMSNorm (norm_1)
Causal Self-Attention
RMSNorm (norm_2)
Gated MLP
RMSNorm (norm_3)
RMSNorm (norm_4)
Output (x')

Proposed MoRE Architecture Variants (Synthesis I & II)

Investigating three primary designs for integrating recurrence, categorized by the level at which recurrence is applied: the individual expert level (Synthesis I) or the entire MoE layer level (Synthesis II).

Synthesis I: Recurrence at the Expert Level

This family of architectures integrates recurrence at the most granular level: within the experts of an MoE layer. This approach conceptualizes the experts not as simple non-linear transformations but as self-contained reasoning modules.

Option A: Independent Recurrent Experts

Architecture & Layer Order: In this design, each of the N experts in an MoE layer is a complete, independent, Huginn-style recurrent block. When the router selects an expert, the token is processed by that expert for a fixed number of internal recurrence steps (r). This embodies a "deep specialization" model, where each expert can learn a unique, complex internal algorithm.

Token (x)
Router
Expert 1 (R₁) 🔁xr
Expert 2 (R₂) 🔁xr
Expert N (Rₙ) 🔁xr

Option B: Shared Recurrent Block with Projections

Architecture & Layer Order: To address the cost of Option A, this design uses a single recurrent block, R, whose parameters are shared across all N experts. Each expert, E_i, is composed of a unique pair of input and output linear projection layers that "wrap" the shared block. This represents a "shallow specialization" model.

Token (x)
Router
Proj_in_1
Proj_in_2
Proj_in_N
Shared Recurrent Block (R) 🔁xr
Proj_out_1
Proj_out_2
Proj_out_N

Synthesis II: Recurrence at the MoE Layer Level

This architecture elevates the scope of recurrence, applying it to the entire Mixture of Experts layer, transforming it into a dynamic, iterative processing unit.

Option C: The MoE Layer as the Recurrent Unit

Architecture & Layer Order: This is the most direct fusion of the two paradigms. The *entire MoE layer*—including its gating network and the full set of N experts—is treated as the single recurrent block. The output of the layer is fed back as input for the next iteration, typically with a residual connection.

Prelude

MoE Layer

Router
Expert
Expert
...
Coda
🔁xr

For all of these I will be using 8 Experts with one shared expert, as this architecture for MoE has been verified to be the best possible one by DeepSeek in DeepSeek v3 implementation and Kimi's Kimi K2 model implementation.