A 51.1M-parameter language model designed by Konpep, built on the RWKV recurrent architecture — drawing on the linear-attention formulation introduced in RWKV v4 and later refinements from RWKV v7 — and trained from scratch on a 500MB synthetic English & Greek conversational corpus. Linear-time recurrence, no attention matrix, runs on CPU.
Aether is configured and trained by train.py in this repository using the RWKV v4 architecture defined in model.py. It is a sequence-to-sequence language model with no self-attention and no KV cache: every layer carries a fixed-size recurrent state forward through the sequence, so memory and compute per token stay constant regardless of context length.
The model is trained on aether_dataset_mixed.jsonl, a synthetic corpus of roughly 500MB mixing English and Greek conversational text, tokenized with an 8,192-vocabulary byte-level BPE tokenizer built on raw UTF-8 bytes — meaning any language or symbol is representable without an unknown-token fallback.
Each of the 14 blocks is a pre-norm residual unit containing two sub-layers: Time-Mix (the recurrent attention replacement) and Channel-Mix (a gated feed-forward layer). Token embeddings are tied to the output projection head.
The WKV operation replaces softmax attention with a numerator/denominator running state, updated once per token at O(1) cost — no quadratic attention matrix is ever formed.
w — learned per-channel decay · u — learned per-channel bonus for the current token · kt, vt, rt — key, value, receptance projections of token t · σ — sigmoid
| Parameter | Value |
|---|---|
| Hidden size | 640 |
| Layers | 14 |
| Head size | 64 |
| Vocabulary | 8,192 (byte-level BPE) |
| Dropout | 0.1 |
| Max sequence length | 256 tokens |
| Micro-batch size | 36 |
| Gradient accumulation | 4 steps |
| Effective batch size | 144 |
| Epochs | 3 |
| Learning rate | 3e-4, cosine decay to 1e-5 |
| Warmup steps | 100 |
| Optimizer | AdamW (decoupled weight decay, 0.1) |
| Gradient clipping | 1.0 (global norm) |
| Loss function | Cross-entropy, ignore_index = pad (0) |
| Precision | AMP fp16 on CUDA, fp32 on CPU |
Training source: aether_dataset_mixed.jsonl, ~505MB of synthetic, bilingual English/Greek text produced by a rule-based generator built on a curated knowledge base rather than fill-in-the-blank templates. 40 subjects (20 English, 20 Greek) each carry 10–15 hand-grounded facts and 2–4 real-world applications, which the generator recombines into flowing paragraphs, Q&A pairs, and multi-turn conversations.
Each line is a JSON object with a single text field; sequences are clipped to 256 tokens with <bos>/<eos> markers and right-padded for batching.
| Property | Value |
|---|---|
| Subjects | 40 (20 English + 20 Greek) |
| Facts per subject | 10–15 |
| Total unique facts | ~500–600 |
| Applications per subject | 2–4 |
| Total entries | 695,447 |
| Raw entries | 278,178 (40%) |
| Q&A entries | 417,269 (60%) |
| File size | ~505 MB |
| Interleave order | 2 raw : 3 Q&A (no shuffle) |
| Language split | ~50% English / ~50% Greek |
| Total unique facts | ~500–600 |
| Facts sampled per paragraph | 3–7, shuffled |
| Connector / closer variety | ~30 connectors, ~20 closers |
Diversity comes from combinatorics rather than volume of source text: each paragraph draws a random 3–7 facts from its subject's pool, shuffles their order, and joins them with one of ~30 connectors ("Moreover", "In addition", "What's more"), closing 35–55% of the time with one of ~20 natural closers. A capitalization-aware joiner keeps acronyms (DNA, CRISPR, GPS) and proper nouns (Einstein, Athens, Renaissance) correctly cased regardless of where they land in the sentence.
A byte-level BPE tokenizer (tokenizer.py) built directly on UTF-8 bytes as the base vocabulary, with learned merges on top. Because the base alphabet is raw bytes rather than a fixed character set, the tokenizer covers any Unicode language — including the mixed English/Greek corpus — without an out-of-vocabulary fallback.
| Property | Value |
|---|---|
| Base alphabet | 256 raw bytes |
| Learned BPE merges | 7,936 |
| Total vocabulary | 8,192 |
| Special tokens | <pad>=0, <bos>=1, <eos>=2, <trn>=3 |
RWKV — Receptance, Weight, Key, Value — was designed to keep the parallel trainability of Transformer attention while replacing its quadratic O(T²) cost with a linear O(T) recurrence. Instead of every token attending to every previous token through a softmax score matrix, RWKV carries a small, fixed-size running state forward through the sequence — much like an RNN's hidden state, but built from the same R/K/V vocabulary as attention.
The practical consequence: generating token 10,000 costs exactly as much as generating token 10 — no growing cache, no growing compute. This is what lets a 51M-parameter model run inference on a CPU at constant speed.
w. This is the entire "memory" of every token seen so far, compressed into one fixed-size vector per channel.Because steps 3–4 only ever touch the current token plus a constant-size running state, RWKV can be unrolled two ways: in parallel across a whole training sequence (fast on GPU, just like a Transformer), or one token at a time during inference, carrying forward only (numerator, denominator) per channel per layer — never re-reading the past. Across 14 layers and 640 channels, that state is small enough to update on a CPU in real time, which is what makes a model with no GPU dependency and no attention cache possible at this size.
Breaking the model down by sub-layer shows where capacity is spent. Channel-Mix carries the most weight per block (it projects up to a 1,280-wide hidden layer and back), Time-Mix is leaner, and the tied embedding/output table accounts for roughly a tenth of the model.
The structural payoff of the recurrence shows up directly in how compute scales with context length. A standard Transformer's attention cost grows with the square of sequence length; RWKV's stays flat per token.
The full parameter count follows directly from the config above: vocabulary 8,192, hidden 640, 14 layers, FFN hidden 1,280 (2×640). Weight tying between the embedding table and the output head — the two largest single tensors in the model — saves another 5.24M parameters rather than storing them twice.
| Component | Shape | Parameters |
|---|---|---|
| Token embedding | 8,192 × 640 | 5,242,880 |
| Per block — Time-Mix | ||
| time_decay (w) | 640 | 640 |
| time_first (u) | 640 | 640 |
| receptance, key, value weights | 3 × 640 × 640 | 1,228,800 |
| LayerNorm (pre Time-Mix) | 2 × 640 | 1,280 |
| Time-Mix subtotal | 1,231,360 | |
| Per block — Channel-Mix | ||
| key projection (up) | 640 × 1,280 | 819,200 |
| value projection (down) | 1,280 × 640 | 819,200 |
| receptance weight | 640 × 640 | 409,600 |
| LayerNorm (pre Channel-Mix) | 2 × 640 | 1,280 |
| Channel-Mix subtotal | 2,049,280 | |
| Per block, total | 3,280,640 | |
| × 14 blocks | 45,928,960 | |
| Output | ||
| Final LayerNorm | 2 × 640 | 1,280 |
| Output head (tied to embedding) | shared | +0 |
| TOTAL | 51,173,120 | |
At inference, Aether never re-reads past tokens. Each of the 14 layers keeps exactly two vectors — a numerator and a denominator, each 640-wide — and that's the entire memory of everything generated so far.
| Quantity | Size |
|---|---|
| State per layer (num + den) | 2 × 640 = 1,280 floats |
| State, full model (× 14 layers) | 17,920 floats |
| State memory, FP32 | ≈ 70 KB |
| State memory, FP16 | ≈ 35 KB |
That ~70KB is constant — it does not grow whether the model has generated 10 tokens or 100,000. A Transformer's KV cache, by contrast, grows linearly with every generated token.
Inference cost per token is dominated by the block's matrix multiplications — not by the WKV recurrence itself, which is comparatively cheap:
| Stage | Cost |
|---|---|
| Time-Mix projections (R, K, V) | O(d²) ≈ 3 × 640² = 1.23M ops |
| WKV state update & read | O(d) ≈ 640 ops |
| Channel-Mix projections | O(d²) ≈ 640×1,280×2 = 1.64M ops |
| Per block, per token | ≈ 2.9M ops |
| All 14 blocks, per token | ≈ 41M ops |
| Output head (640 → 8,192) | ≈ 5.2M ops |
| Total, per generated token | ≈ 46M ops |
| Full forward pass, T=256 (training) | ≈ 13.8 billion FLOPs |
Because inference cost per step never depends on how many tokens came before, generating the 50,000th token costs the same ~46M operations as generating the 5th — the defining property that separates RWKV from quadratic-attention Transformers at long context lengths.
Aether's blocks follow the v4 formulation. RWKV v7 keeps the same numerator/denominator recurrence but replaces v4's single scalar decay and static projections with token-dependent, low-rank ("LoRA-style") modulation of the decay, gating, and value paths — trading roughly 10–15% more parameters for materially better long-context recall.
| RWKV v4 (this model) | RWKV v7 | |
|---|---|---|
| Time-decay | Single learned vector w ∈ ℝ^d, static per channel | Token-dependent decay via low-rank projection (w0, w1, w2) |
| Token mixing | None — direct projection of x_t | Data-dependent blend x_t + μ⊙(x_{t-1}-x_t), learned per channel |
| Value path | Static value projection | Value residual mixing across layers (v0, v1, v2) |
| Gating | Single sigmoid receptance gate | Additional learned gate (g1, g2) on top of receptance |
| Extra params / layer | — (baseline) | ~10–15% more, via rank-64 to rank-128 LoRA factors |