OpenMythos: An Open-Source Theoretical Reconstruction of the Mythos Architecture-AI Topic

OpenMythos: An open-source Theoretical Reconstruction of the Mythos Architecture

The OpenMythos project has been officially open-sourced, presenting a theoretical reconstruction of the Claude Mythos architecture. This implementation utilizes a Recurrent-Depth Transformer (RDT) design, offering a novel solution for Deep Reasoning tasks.

What is OpenMythos?

OpenMythos is an open-source, theoretical model implementation designed to explore the Recurrent-Depth Transformer (RDT) architecture. Built upon publicly avAIlable research literature, it implements a three-stage architecture: Prelude, a Recurrent Block, and Coda.

Core Features:

Switchable attention mechanisms (MLA/GQA).
Sparse MoE (Mixture of Experts) feed-forward networks.
Compute-adaptive, variable-depth inference capabilities.

Core Architecture Design

OpenMythos employs an innovative three-stage structural design:

Prelude

Consists of Standard Transformer layers.
Runs once during a single forward propagation.
Generates the initial hidden state and encodes input vectors.

Recurrent Block

Updates the hidden state during each cycle.
Injects the original encoded input at every cycle.

Coda

Consists of standard Transformer layers.
Runs once during a single forward propagation.
Converts the final hidden state into ouTPUt logits.

Recurrent Update Rule
In each recurrent step $t$, the hidden state is updated according to the following rule:

Where:

The Transformer block APPlies attention and MLP.

This design ensures the original input signal remains ACTive throughout the recurrent depth, preventing the model from drifting during iterations.

Key Technical Innovations

LTI Stability Constraints
training recurrent models is notoriously unstable, primarily facing two failure modes: Residual Explosion (unbounded growth of hidden state $h_t$) and Loss Spikes (sudden divergence due to large spectral norms of injection parameters). OpenMythos employs Linear Time-Invariant (LTI) system theory to guarantee stability by constraining injection parameters:

Use ZOH/Euler discretization.
Enforce negativity by learning a scalar $\Delta t$: $A := \text{Diag}(-\exp(\log_A))$.
This ensures the spectral radius $\rho(A) < 1$, making the recurrent model more robust to hyperparameter choices and allowing stable training even at high learning rates.

Adaptive Computation Time (ACT)
More loops are not always better. Beyond a certain depth, excessive looping degrades predictive performance—a failure mode known as "over-thinking." OpenMythos adopts an Adaptive Computation Time (ACT) mechanism:

Learns a scalar for each position.
Dynamically decides when to stop looping.
Difficult positions receive more compute; simple tokens stop early.
This allows the model to learn a halting signal when the answer converges, rendering the model Turing complete under certain assumptions.

Mixture of Experts (MoE) Design
Each FFN in the recurrent block is replaced by a fine-grained MoE layer:

Splits the FFN into many small experts (1/m of normal size).

Continuous Depth Batching
Since all tokens share the Same recurrent block, the model can exit the loop at different depths—processing simple inputs quickly within the same batch while using more iterations for difficult inputs. Theoretical analysis suggests inference throughput can be improved by 2-3x.

Systematic Generalization Capability

Recurrent models exhibit unique advantages in handling knowledge combinations never seen during training, achieving this through a three-stage epiphany process:

MEMOrization Phase: The model fits the training distribution.
In-Distribution Generalization: The model handles known combinations.
Systematic Generalization: The model handles novel, out-of-distribution combinations.

This is why recurrent models feel qualitatively different from others on new problems—capabilities emerge via phase transition rather than gradual appearance. When training on 5-hop reasoning chains and testing on 10-hop chains, standard Transformers fail, whereas recurrent models succeed by running more loops at inference time.

Inference-Time Scaling

More loops at inference time improve quality, following a predictable saturated exponential decay—gains are real but diminishing. This reflects the inference-time scaling characteristics of Chain-of-Thought. At 770M parameters, the recurrent model achieves the downstream quality of a 1.3B fixed-depth Transformer trained on the same data—achieving similar quality with roughly half the parameters.

Implementation Features

Pre-configured Scales
OpenMythos provides pre-configured scales ranging from 1B to 1T parameters:

Variant	Dim	Experts	Expert Dim	Recurrent Iters	Context	Max Output
mythos_1b	2048	64	2048	16	4K	4k
mythos_3b	3072	64	4096	16	4k	4k
mythos_10b	4096	128	5632	24	8K	4k
mythos_50b	6144	256	9728	32	8k	4k
mythos_100b	8192	256	13568	32	1M	128k
mythos_500b	12288	512	23040	48	1M	128k
mythos_1t	16384	512	34560	64	1M	128k

Quick Start

Install OpenMythos:

pip install open-mythos

Basic Usage Example:

import torch
from open_mythos.main import OpenMythos, MythosConfig

# Configure model
cfg = MythosConfig(
    vocab_size=1000,
    dim=256,
    n_heads=8,
    max_seq_len=128,
    max_loop_iters=4,
    prelude_layers=1,
    coda_layers=1,
    n_experts=8,
    n_shared_experts=1,
    n_experts_per_tok=2,
    expert_dim=64,
    attn_type="mla"  # or "gqa"
)

# Initialize model
model = OpenMythos(cfg)

# Forward propagation
ids = torch.randint(0, cfg.vocab_size, (2, 16))
logits = model(ids, n_loops=4)

# Generate text
out = model.generate(ids, max_new_tokens=8, n_loops=8)

Training Configuration

OpenMythos provides complete training scripts supporting single and multi-GPU setups. Key design choices include:

Optimizer: Muon for 2D weight matrices, AdamW for embeddings and norms.
Dataset: HuggingFaceFW/fineweb-edu (default sample-10BT).
Tokenizer: Uses OpenAI/gpt-oss-20b via MythosTokenizer.
Parallelization: PyTorch DDP via torchrun.
Precision: bfloat16 on H100/A100, float16 + GradScaler on older GPUs.

Technical Advantages Summary

Inference Scaling: Inference-time compute scales with loop count, not model size.
Deep Reasoning: Parameter-wise deep reasoning is "free."
Training Stability: LTI constraints guarantee training stability.
Adaptive Compute: ACT mechanism enables on-demand computation.

Application Scenarios

The OpenMythos architecture is particularly suitable for:

Multi-step mathematical reasoning.
Long-horizon planning.
Hierarchical argumentation.
code generation and debugging.
Scientific reasoning.

Open Source Statement

OpenMythos is an independent, community-driven theoretical reconstruction project built solely on publicly available research and speculation. It is not affiliated with anthropic or any of its proprietary systems. The project is open-sourced under the MIT license, and community contributions are welcome.

★★★★★

Be the first to rate this article.

OpenMythos: An Open-Source Theoretical Reconstruction of the Mythos Architecture