Model Card · Build Spec

Aether

A 51.1M-parameter language model designed by Konpep, built on the RWKV recurrent architecture — drawing on the linear-attention formulation introduced in RWKV v4 and later refinements from RWKV v7 — and trained from scratch on a 500MB synthetic English & Greek conversational corpus. Linear-time recurrence, no attention matrix, runs on CPU.

By Konpep · RWKV v4 / v7-derived architecture

51,173,120 params ~195 MB fp32 14 layers 640 hidden head 64 vocab 8,192

01 — Overview

What this build is

Aether is configured and trained by train.py in this repository using the RWKV v4 architecture defined in model.py. It is a sequence-to-sequence language model with no self-attention and no KV cache: every layer carries a fixed-size recurrent state forward through the sequence, so memory and compute per token stay constant regardless of context length.

The model is trained on aether_dataset_mixed.jsonl, a synthetic corpus of roughly 500MB mixing English and Greek conversational text, tokenized with an 8,192-vocabulary byte-level BPE tokenizer built on raw UTF-8 bytes — meaning any language or symbol is representable without an unknown-token fallback.

51.17M

Total Parameters

Transformer Blocks

640

Hidden Size

Head Size

8,192

Vocabulary

~195 MB

Weights (fp32)

256

Max Sequence Length

O(T)

Inference Complexity

02 — Architecture

RWKV v4 block stack

Each of the 14 blocks is a pre-norm residual unit containing two sub-layers: Time-Mix (the recurrent attention replacement) and Channel-Mix (a gated feed-forward layer). Token embeddings are tied to the output projection head.

Input

Token IDs→ Embedding (8,192 × 640)

Block × 14

LayerNorm→ Time-Mix→ Residual Add→ LayerNorm→ Channel-Mix→ Residual Add

Time-Mix: token-shift blend, per-channel time-decay, WKV linear-attention recurrence, sigmoid receptance gate.
Channel-Mix: squared-ReLU key projection (gated MLP), sigmoid receptance gate, zero-initialized output for stable early training.

Output

Final LayerNorm→ Linear Head (tied weights)→ Logits (8,192)

Time-Mix recurrence

The WKV operation replaces softmax attention with a numerator/denominator running state, updated once per token at O(1) cost — no quadratic attention matrix is ever formed.

x′ = x · μ + x_prev · (1 − μ) token-shift blend

num_t = w · num_t−1 + e^k_t · v_t numerator state (weighted running total)

den_t = w · den_t−1 + e^k_t denominator state (weighted count)

wkv_t = ( e^u+k_t · v_t + num_t−1 ) ⁄ ( e^u+k_t + den_t−1 ) WKV read-out — weighted average of all past values

y_t = σ(r_t) · wkv_t receptance gate applied to the output

w — learned per-channel decay · u — learned per-channel bonus for the current token · k_t, v_t, r_t — key, value, receptance projections of token t · σ — sigmoid

03 — Training Configuration

Hyperparameters (train.py)

Parameter	Value
Hidden size	640
Layers	14
Head size	64
Vocabulary	8,192 (byte-level BPE)
Dropout	0.1
Max sequence length	256 tokens
Micro-batch size	36
Gradient accumulation	4 steps
Effective batch size	144
Epochs	3
Learning rate	3e-4, cosine decay to 1e-5
Warmup steps	100
Optimizer	AdamW (decoupled weight decay, 0.1)
Gradient clipping	1.0 (global norm)
Loss function	Cross-entropy, ignore_index = pad (0)
Precision	AMP fp16 on CUDA, fp32 on CPU

04 — Dataset & Tokenizer

Corpus: Aether Dataset Generator v6

Training source: aether_dataset_mixed.jsonl, ~505MB of synthetic, bilingual English/Greek text produced by a rule-based generator built on a curated knowledge base rather than fill-in-the-blank templates. 40 subjects (20 English, 20 Greek) each carry 10–15 hand-grounded facts and 2–4 real-world applications, which the generator recombines into flowing paragraphs, Q&A pairs, and multi-turn conversations.

Each line is a JSON object with a single text field; sequences are clipped to 256 tokens with <bos>/<eos> markers and right-padded for batching.

Dataset composition

695,447 total entries, interleaved 2 raw : 3 Q&A, no shuffle. ~505MB total.

Raw paragraphs 40%

Q&A pairs 60%

Property	Value
Subjects	40 (20 English + 20 Greek)
Facts per subject	10–15
Total unique facts	~500–600
Applications per subject	2–4
Total entries	695,447
Raw entries	278,178 (40%)
Q&A entries	417,269 (60%)
File size	~505 MB
Interleave order	2 raw : 3 Q&A (no shuffle)
Language split	~50% English / ~50% Greek
Total unique facts	~500–600
Facts sampled per paragraph	3–7, shuffled
Connector / closer variety	~30 connectors, ~20 closers

Diversity comes from combinatorics rather than volume of source text: each paragraph draws a random 3–7 facts from its subject's pool, shuffles their order, and joins them with one of ~30 connectors ("Moreover", "In addition", "What's more"), closing 35–55% of the time with one of ~20 natural closers. A capitalization-aware joiner keeps acronyms (DNA, CRISPR, GPS) and proper nouns (Einstein, Athens, Renaissance) correctly cased regardless of where they land in the sentence.

Tokenizer

A byte-level BPE tokenizer (tokenizer.py) built directly on UTF-8 bytes as the base vocabulary, with learned merges on top. Because the base alphabet is raw bytes rather than a fixed character set, the tokenizer covers any Unicode language — including the mixed English/Greek corpus — without an out-of-vocabulary fallback.

Property	Value
Base alphabet	256 raw bytes
Learned BPE merges	7,936
Total vocabulary	8,192
Special tokens	<pad>=0, <bos>=1, <eos>=2, <trn>=3

05 — How RWKV Works

An RNN that trains like a Transformer

RWKV — Receptance, Weight, Key, Value — was designed to keep the parallel trainability of Transformer attention while replacing its quadratic O(T²) cost with a linear O(T) recurrence. Instead of every token attending to every previous token through a softmax score matrix, RWKV carries a small, fixed-size running state forward through the sequence — much like an RNN's hidden state, but built from the same R/K/V vocabulary as attention.

The practical consequence: generating token 10,000 costs exactly as much as generating token 10 — no growing cache, no growing compute. This is what lets a 51M-parameter model run inference on a CPU at constant speed.

Transformer Attention

Each token computes a score against every previous token, then softmax-weights all their values together.

O(T²) compute · O(T) memory cache

RWKV (this model)

Each token updates one running numerator/denominator state, carried from the previous step — no score matrix is ever formed.

O(T) compute · O(1) memory state

The four letters

Receptance

A learned sigmoid gate that decides how much of the newly mixed information is allowed to pass through to the output — the model's "how much do I accept this" valve.

Weight (decay)

A per-channel, learned decay rate. Each of the 640 hidden channels forgets its running state at its own trained pace — some channels hold information longer than others.

Key

Plays the same role as the key vector in standard attention — it determines how strongly the current token's content should weigh into the running state.

Value

The content actually being carried forward and blended — the payload that the key vector weights and the receptance gate ultimately releases.

Step by step, inside one block

Token-shift. Each channel blends the current token's hidden value with the previous token's — a cheap, built-in sense of "what just happened" before any learned weights are applied.

Project to R, K, V. Three independent linear layers turn the shifted hidden state into a receptance, key, and value vector — exactly as a Transformer would, just without forming a score matrix.

Update the WKV state. The key/value pair is folded into two running totals — a numerator and a denominator — weighted by the learned per-channel decay w. This is the entire "memory" of every token seen so far, compressed into one fixed-size vector per channel.

Read the state, gate it. The numerator divided by the denominator gives a weighted average of all past values — this is the WKV output. The receptance gate (sigmoid of R) then decides how much of it actually passes through.

Channel-Mix. A second sub-layer — a small gated feed-forward network with a squared-ReLU nonlinearity — mixes information across the 640 channels themselves, refining the representation before it moves to the next block.

Why it stays cheap forever

Because steps 3–4 only ever touch the current token plus a constant-size running state, RWKV can be unrolled two ways: in parallel across a whole training sequence (fast on GPU, just like a Transformer), or one token at a time during inference, carrying forward only (numerator, denominator) per channel per layer — never re-reading the past. Across 14 layers and 640 channels, that state is small enough to update on a CPU in real time, which is what makes a model with no GPU dependency and no attention cache possible at this size.

06 — Parameters & Cost, Visualized

Where the 51.1M parameters live

Breaking the model down by sub-layer shows where capacity is spent. Channel-Mix carries the most weight per block (it projects up to a 1,280-wide hidden layer and back), Time-Mix is leaner, and the tied embedding/output table accounts for roughly a tenth of the model.

Channel-Mix (×14) 56.2%

Time-Mix (×14) 33.7%

Embedding / Head, tied 10.1%

Per-block parameter count

Per single block, of 14 total — Channel-Mix's 640→1280→640 projection costs more than Time-Mix's three 640×640 matrices.

Inference cost: RWKV vs. self-attention

The structural payoff of the recurrence shows up directly in how compute scales with context length. A standard Transformer's attention cost grows with the square of sequence length; RWKV's stays flat per token.

Self-attention · O(T²)

RWKV WKV state · O(T)

Bar heights are illustrative of relative growth rate, not measured wall-clock — the point is the curve shape: quadratic vs. linear.

07 — Exact Parameter Ledger

Every weight, accounted for

The full parameter count follows directly from the config above: vocabulary 8,192, hidden 640, 14 layers, FFN hidden 1,280 (2×640). Weight tying between the embedding table and the output head — the two largest single tensors in the model — saves another 5.24M parameters rather than storing them twice.

Component	Shape	Parameters
Token embedding	8,192 × 640	5,242,880
Per block — Time-Mix
time_decay (w)	640	640
time_first (u)	640	640
receptance, key, value weights	3 × 640 × 640	1,228,800
LayerNorm (pre Time-Mix)	2 × 640	1,280
Time-Mix subtotal		1,231,360
Per block — Channel-Mix
key projection (up)	640 × 1,280	819,200
value projection (down)	1,280 × 640	819,200
receptance weight	640 × 640	409,600
LayerNorm (pre Channel-Mix)	2 × 640	1,280
Channel-Mix subtotal		2,049,280
Per block, total		3,280,640
× 14 blocks		45,928,960
Output
Final LayerNorm	2 × 640	1,280
Output head (tied to embedding)	shared	+0
TOTAL		51,173,120

195.2 MB

Weights, FP32

97.6 MB

Weights, FP16

5.24M saved

Via Weight Tying

1/√640

Embedding Init Scale

08 — State, FLOPs & Scaling Behavior

What the model actually carries between tokens

At inference, Aether never re-reads past tokens. Each of the 14 layers keeps exactly two vectors — a numerator and a denominator, each 640-wide — and that's the entire memory of everything generated so far.

Quantity	Size
State per layer (num + den)	2 × 640 = 1,280 floats
State, full model (× 14 layers)	17,920 floats
State memory, FP32	≈ 70 KB
State memory, FP16	≈ 35 KB

That ~70KB is constant — it does not grow whether the model has generated 10 tokens or 100,000. A Transformer's KV cache, by contrast, grows linearly with every generated token.

Compute per token vs. per sequence

Inference cost per token is dominated by the block's matrix multiplications — not by the WKV recurrence itself, which is comparatively cheap:

Stage	Cost
Time-Mix projections (R, K, V)	O(d²) ≈ 3 × 640² = 1.23M ops
WKV state update & read	O(d) ≈ 640 ops
Channel-Mix projections	O(d²) ≈ 640×1,280×2 = 1.64M ops
Per block, per token	≈ 2.9M ops
All 14 blocks, per token	≈ 41M ops
Output head (640 → 8,192)	≈ 5.2M ops
Total, per generated token	≈ 46M ops
Full forward pass, T=256 (training)	≈ 13.8 billion FLOPs

Because inference cost per step never depends on how many tokens came before, generating the 50,000th token costs the same ~46M operations as generating the 5th — the defining property that separates RWKV from quadratic-attention Transformers at long context lengths.

RWKV v4 vs. v7 — what changes

Aether's blocks follow the v4 formulation. RWKV v7 keeps the same numerator/denominator recurrence but replaces v4's single scalar decay and static projections with token-dependent, low-rank ("LoRA-style") modulation of the decay, gating, and value paths — trading roughly 10–15% more parameters for materially better long-context recall.

	RWKV v4 (this model)	RWKV v7
Time-decay	Single learned vector `w ∈ ℝ^d`, static per channel	Token-dependent decay via low-rank projection (`w0, w1, w2`)
Token mixing	None — direct projection of `x_t`	Data-dependent blend `x_t + μ⊙(x_{t-1}-x_t)`, learned per channel
Value path	Static value projection	Value residual mixing across layers (`v0, v1, v2`)
Gating	Single sigmoid receptance gate	Additional learned gate (`g1, g2`) on top of receptance
Extra params / layer	— (baseline)	~10–15% more, via rank-64 to rank-128 LoRA factors