Model Card · Build Spec

Aether

A 51.1M-parameter language model designed by Konpep, built on the RWKV recurrent architecture — drawing on the linear-attention formulation introduced in RWKV v4 and later refinements from RWKV v7 — and trained from scratch on a 500MB synthetic English & Greek conversational corpus. Linear-time recurrence, no attention matrix, runs on CPU.

51,173,120 params ~195 MB fp32 14 layers 640 hidden head 64 vocab 8,192
01 — Overview

What this build is

Aether is configured and trained by train.py in this repository using the RWKV v4 architecture defined in model.py. It is a sequence-to-sequence language model with no self-attention and no KV cache: every layer carries a fixed-size recurrent state forward through the sequence, so memory and compute per token stay constant regardless of context length.

The model is trained on aether_dataset_mixed.jsonl, a synthetic corpus of roughly 500MB mixing English and Greek conversational text, tokenized with an 8,192-vocabulary byte-level BPE tokenizer built on raw UTF-8 bytes — meaning any language or symbol is representable without an unknown-token fallback.

51.17M
Total Parameters
14
Transformer Blocks
640
Hidden Size
64
Head Size
8,192
Vocabulary
~195 MB
Weights (fp32)
256
Max Sequence Length
O(T)
Inference Complexity
02 — Architecture

RWKV v4 block stack

Each of the 14 blocks is a pre-norm residual unit containing two sub-layers: Time-Mix (the recurrent attention replacement) and Channel-Mix (a gated feed-forward layer). Token embeddings are tied to the output projection head.

Input
Token IDs Embedding (8,192 × 640)
Block × 14
LayerNorm Time-Mix Residual Add LayerNorm Channel-Mix Residual Add
Time-Mix: token-shift blend, per-channel time-decay, WKV linear-attention recurrence, sigmoid receptance gate.
Channel-Mix: squared-ReLU key projection (gated MLP), sigmoid receptance gate, zero-initialized output for stable early training.
Output
Final LayerNorm Linear Head (tied weights) Logits (8,192)

Time-Mix recurrence

The WKV operation replaces softmax attention with a numerator/denominator running state, updated once per token at O(1) cost — no quadratic attention matrix is ever formed.

x′ = x · μ + xprev · (1 − μ) token-shift blend
numt = w · numt−1 + ekt · vt numerator state (weighted running total)
dent = w · dent−1 + ekt denominator state (weighted count)
wkvt = ( eu+kt · vt + numt−1 ) ⁄ ( eu+kt + dent−1 ) WKV read-out — weighted average of all past values
yt = σ(rt) · wkvt receptance gate applied to the output

w — learned per-channel decay  ·  u — learned per-channel bonus for the current token  ·  kt, vt, rt — key, value, receptance projections of token t  ·  σ — sigmoid

03 — Training Configuration

Hyperparameters (train.py)

ParameterValue
Hidden size640
Layers14
Head size64
Vocabulary8,192 (byte-level BPE)
Dropout0.1
Max sequence length256 tokens
Micro-batch size36
Gradient accumulation4 steps
Effective batch size144
Epochs3
Learning rate3e-4, cosine decay to 1e-5
Warmup steps100
OptimizerAdamW (decoupled weight decay, 0.1)
Gradient clipping1.0 (global norm)
Loss functionCross-entropy, ignore_index = pad (0)
PrecisionAMP fp16 on CUDA, fp32 on CPU
04 — Dataset & Tokenizer

Corpus: Aether Dataset Generator v6

Training source: aether_dataset_mixed.jsonl, ~505MB of synthetic, bilingual English/Greek text produced by a rule-based generator built on a curated knowledge base rather than fill-in-the-blank templates. 40 subjects (20 English, 20 Greek) each carry 10–15 hand-grounded facts and 2–4 real-world applications, which the generator recombines into flowing paragraphs, Q&A pairs, and multi-turn conversations.

Each line is a JSON object with a single text field; sequences are clipped to 256 tokens with <bos>/<eos> markers and right-padded for batching.

Dataset composition
278K Raw (40%) 417K Q&A (60%)
695,447 total entries, interleaved 2 raw : 3 Q&A, no shuffle. ~505MB total.
~695K entries
Raw paragraphs 40%
Q&A pairs 60%
PropertyValue
Subjects40 (20 English + 20 Greek)
Facts per subject10–15
Total unique facts~500–600
Applications per subject2–4
Total entries695,447
Raw entries278,178 (40%)
Q&A entries417,269 (60%)
File size~505 MB
Interleave order2 raw : 3 Q&A (no shuffle)
Language split~50% English / ~50% Greek
Total unique facts~500–600
Facts sampled per paragraph3–7, shuffled
Connector / closer variety~30 connectors, ~20 closers

Diversity comes from combinatorics rather than volume of source text: each paragraph draws a random 3–7 facts from its subject's pool, shuffles their order, and joins them with one of ~30 connectors ("Moreover", "In addition", "What's more"), closing 35–55% of the time with one of ~20 natural closers. A capitalization-aware joiner keeps acronyms (DNA, CRISPR, GPS) and proper nouns (Einstein, Athens, Renaissance) correctly cased regardless of where they land in the sentence.

Tokenizer

A byte-level BPE tokenizer (tokenizer.py) built directly on UTF-8 bytes as the base vocabulary, with learned merges on top. Because the base alphabet is raw bytes rather than a fixed character set, the tokenizer covers any Unicode language — including the mixed English/Greek corpus — without an out-of-vocabulary fallback.

PropertyValue
Base alphabet256 raw bytes
Learned BPE merges7,936
Total vocabulary8,192
Special tokens<pad>=0, <bos>=1, <eos>=2, <trn>=3
05 — How RWKV Works

An RNN that trains like a Transformer

RWKV — Receptance, Weight, Key, Value — was designed to keep the parallel trainability of Transformer attention while replacing its quadratic O(T²) cost with a linear O(T) recurrence. Instead of every token attending to every previous token through a softmax score matrix, RWKV carries a small, fixed-size running state forward through the sequence — much like an RNN's hidden state, but built from the same R/K/V vocabulary as attention.

The practical consequence: generating token 10,000 costs exactly as much as generating token 10 — no growing cache, no growing compute. This is what lets a 51M-parameter model run inference on a CPU at constant speed.

Transformer Attention
Each token computes a score against every previous token, then softmax-weights all their values together.
O(T²) compute · O(T) memory cache
RWKV (this model)
Each token updates one running numerator/denominator state, carried from the previous step — no score matrix is ever formed.
O(T) compute · O(1) memory state

The four letters

R
Receptance
A learned sigmoid gate that decides how much of the newly mixed information is allowed to pass through to the output — the model's "how much do I accept this" valve.
W
Weight (decay)
A per-channel, learned decay rate. Each of the 640 hidden channels forgets its running state at its own trained pace — some channels hold information longer than others.
K
Key
Plays the same role as the key vector in standard attention — it determines how strongly the current token's content should weigh into the running state.
V
Value
The content actually being carried forward and blended — the payload that the key vector weights and the receptance gate ultimately releases.

Step by step, inside one block

1
Token-shift. Each channel blends the current token's hidden value with the previous token's — a cheap, built-in sense of "what just happened" before any learned weights are applied.
2
Project to R, K, V. Three independent linear layers turn the shifted hidden state into a receptance, key, and value vector — exactly as a Transformer would, just without forming a score matrix.
3
Update the WKV state. The key/value pair is folded into two running totals — a numerator and a denominator — weighted by the learned per-channel decay w. This is the entire "memory" of every token seen so far, compressed into one fixed-size vector per channel.
4
Read the state, gate it. The numerator divided by the denominator gives a weighted average of all past values — this is the WKV output. The receptance gate (sigmoid of R) then decides how much of it actually passes through.
5
Channel-Mix. A second sub-layer — a small gated feed-forward network with a squared-ReLU nonlinearity — mixes information across the 640 channels themselves, refining the representation before it moves to the next block.

Why it stays cheap forever

Because steps 3–4 only ever touch the current token plus a constant-size running state, RWKV can be unrolled two ways: in parallel across a whole training sequence (fast on GPU, just like a Transformer), or one token at a time during inference, carrying forward only (numerator, denominator) per channel per layer — never re-reading the past. Across 14 layers and 640 channels, that state is small enough to update on a CPU in real time, which is what makes a model with no GPU dependency and no attention cache possible at this size.

06 — Parameters & Cost, Visualized

Where the 51.1M parameters live

Breaking the model down by sub-layer shows where capacity is spent. Channel-Mix carries the most weight per block (it projects up to a 1,280-wide hidden layer and back), Time-Mix is leaner, and the tied embedding/output table accounts for roughly a tenth of the model.

51.1M parameters
Channel-Mix (×14) 56.2%
Time-Mix (×14) 33.7%
Embedding / Head, tied 10.1%
Per-block parameter count
2.05M Channel-Mix 1.23M Time-Mix ~0M LayerNorm
Per single block, of 14 total — Channel-Mix's 640→1280→640 projection costs more than Time-Mix's three 640×640 matrices.

Inference cost: RWKV vs. self-attention

The structural payoff of the recurrence shows up directly in how compute scales with context length. A standard Transformer's attention cost grows with the square of sequence length; RWKV's stays flat per token.

relative cost 256 1,024 4,096 16,384 sequence length (tokens)
Self-attention · O(T²)
RWKV WKV state · O(T)
Bar heights are illustrative of relative growth rate, not measured wall-clock — the point is the curve shape: quadratic vs. linear.
07 — Exact Parameter Ledger

Every weight, accounted for

The full parameter count follows directly from the config above: vocabulary 8,192, hidden 640, 14 layers, FFN hidden 1,280 (2×640). Weight tying between the embedding table and the output head — the two largest single tensors in the model — saves another 5.24M parameters rather than storing them twice.

ComponentShapeParameters
Token embedding8,192 × 6405,242,880
Per block — Time-Mix
time_decay (w)640640
time_first (u)640640
receptance, key, value weights3 × 640 × 6401,228,800
LayerNorm (pre Time-Mix)2 × 6401,280
Time-Mix subtotal1,231,360
Per block — Channel-Mix
key projection (up)640 × 1,280819,200
value projection (down)1,280 × 640819,200
receptance weight640 × 640409,600
LayerNorm (pre Channel-Mix)2 × 6401,280
Channel-Mix subtotal2,049,280
Per block, total3,280,640
× 14 blocks45,928,960
Output
Final LayerNorm2 × 6401,280
Output head (tied to embedding)shared+0
TOTAL51,173,120
195.2 MB
Weights, FP32
97.6 MB
Weights, FP16
5.24M saved
Via Weight Tying
1/√640
Embedding Init Scale
08 — State, FLOPs & Scaling Behavior

What the model actually carries between tokens

At inference, Aether never re-reads past tokens. Each of the 14 layers keeps exactly two vectors — a numerator and a denominator, each 640-wide — and that's the entire memory of everything generated so far.

QuantitySize
State per layer (num + den)2 × 640 = 1,280 floats
State, full model (× 14 layers)17,920 floats
State memory, FP32≈ 70 KB
State memory, FP16≈ 35 KB

That ~70KB is constant — it does not grow whether the model has generated 10 tokens or 100,000. A Transformer's KV cache, by contrast, grows linearly with every generated token.

Compute per token vs. per sequence

Inference cost per token is dominated by the block's matrix multiplications — not by the WKV recurrence itself, which is comparatively cheap:

StageCost
Time-Mix projections (R, K, V)O(d²) ≈ 3 × 640² = 1.23M ops
WKV state update & readO(d) ≈ 640 ops
Channel-Mix projectionsO(d²) ≈ 640×1,280×2 = 1.64M ops
Per block, per token≈ 2.9M ops
All 14 blocks, per token≈ 41M ops
Output head (640 → 8,192)≈ 5.2M ops
Total, per generated token≈ 46M ops
Full forward pass, T=256 (training)≈ 13.8 billion FLOPs

Because inference cost per step never depends on how many tokens came before, generating the 50,000th token costs the same ~46M operations as generating the 5th — the defining property that separates RWKV from quadratic-attention Transformers at long context lengths.

RWKV v4 vs. v7 — what changes

Aether's blocks follow the v4 formulation. RWKV v7 keeps the same numerator/denominator recurrence but replaces v4's single scalar decay and static projections with token-dependent, low-rank ("LoRA-style") modulation of the decay, gating, and value paths — trading roughly 10–15% more parameters for materially better long-context recall.

RWKV v4 (this model)RWKV v7
Time-decaySingle learned vector w ∈ ℝ^d, static per channelToken-dependent decay via low-rank projection (w0, w1, w2)
Token mixingNone — direct projection of x_tData-dependent blend x_t + μ⊙(x_{t-1}-x_t), learned per channel
Value pathStatic value projectionValue residual mixing across layers (v0, v1, v2)
GatingSingle sigmoid receptance gateAdditional learned gate (g1, g2) on top of receptance
Extra params / layer— (baseline)~10–15% more, via rank-64 to rank-128 LoRA factors