A New Architecture for Reasoning
Traditional Transformers scale by adding more unique layers, fixing their computational depth. Huginn introduces a "recurrent-depth" approach, using a single, shared block of layers that can be applied iteratively. This decouples model size from its reasoning capacity.
Interactive Controls
Anatomy of the Recurrent Block
The core of Huginn's recurrence is the `SandwichBlock`. It's a sophisticated variant of a standard Transformer block, specifically engineered with four normalization layers to ensure stability when applied repeatedly. This design prevents the model's internal state from becoming chaotic during deep "thinking".
Proposed MoRE Architecture Variants (Synthesis I & II)
Investigating three primary designs for integrating recurrence, categorized by the level at which recurrence is applied: the individual expert level (Synthesis I) or the entire MoE layer level (Synthesis II).
Synthesis I: Recurrence at the Expert Level
This family of architectures integrates recurrence at the most granular level: within the experts of an MoE layer. This approach conceptualizes the experts not as simple non-linear transformations but as self-contained reasoning modules.
Option A: Independent Recurrent Experts
Architecture & Layer Order: In this design, each of the N experts in an MoE layer is a complete, independent, Huginn-style recurrent block. When the router selects an expert, the token is processed by that expert for a fixed number of internal recurrence steps (r). This embodies a "deep specialization" model, where each expert can learn a unique, complex internal algorithm.
Option B: Shared Recurrent Block with Projections
Architecture & Layer Order: To address the cost of Option A, this design uses a single recurrent block, R, whose parameters are shared across all N experts. Each expert, E_i, is composed of a unique pair of input and output linear projection layers that "wrap" the shared block. This represents a "shallow specialization" model.
Synthesis II: Recurrence at the MoE Layer Level
This architecture elevates the scope of recurrence, applying it to the entire Mixture of Experts layer, transforming it into a dynamic, iterative processing unit.
Option C: The MoE Layer as the Recurrent Unit
Architecture & Layer Order: This is the most direct fusion of the two paradigms. The *entire MoE layer*—including its gating network and the full set of N experts—is treated as the single recurrent block. The output of the layer is fed back as input for the next iteration, typically with a residual connection.
MoE Layer
For all of these I will be using 8 Experts with one shared expert, as this architecture for MoE has been verified to be the best possible one by DeepSeek in DeepSeek v3 implementation and Kimi's Kimi K2 model implementation.