Architecture

CPU vs GPU computation map. Every component — tensors, autograd, model, tokenizers, training loop — is hand-written TypeScript with zero ML dependencies.

CPU — TypeScriptGPU — Vulkan / SPIR-VBackend DispatchData Transfer

CPU-Only Operations

Always run on CPU regardless of tensor size. Orchestration, I/O, and control flow.

Data Loading

Tokenization (BPE / char / word), batch sampling, text I/O. DataLoader.nextBatch() picks random offsets → [B, T] token windows.

RNG & Seeding

Deterministic seeded RNG (seed=42). Weight init N(0, 0.02), dropout masks, data sampling order.

LR Schedule

Warmup: linear ramp over ~10% of steps. Cosine decay: lr × 0.5(1+cos(π·decay)).

Checkpoint I/O

Binary v2: ALPH magic + JSON header + packed Float32. Saves params, optimizer m/v buffers, RNG state.

Metrics & Logging

JSONL step metrics (loss, lr, grad norm, tokens/sec). Remote sync to alpha.omegaai.dev.

Backend

Dispatch

≥ 1,000,000 elements → GPU | < 1M → CPU

← CPU path TypeScriptGPU path → SPIR-V

CPUuploadBuffer — staging → device-localGPU

&triangleright; Forward Pass

CPU — TypeScript

Token Embedding[B,T] → [B,T,nEmbd]

embedding(wte, tokens) — gather rows from weight table. Tape records backward for gradient scatter.

Position Embedding[B,T] → [B,T,nEmbd]

embedding(wpe, 0..T-1) — positional encoding lookup.

Add Embeddings[B,T,nEmbd]

tokEmb + posEmb — element-wise sum. Small models stay on CPU.

Causal Mask[T,T]

causalMask(T) — lower triangle = 0, upper = −∞. Always CPU (small, created once).

Reshape / Transpose

View operations: reshape to multi-head [B, nHead, T, headDim], transpose for attention, concat heads. No computation — pointer reinterpretation.

Residual Connections

x = x + attnOut, x = x + mlpOut — identity shortcuts stabilizing deep nets.

Tape Recording

Every op appends to global Tape: output Variable, input Variables, backward closure. Enables reverse-mode autodiff.

GPU — Vulkan Compute

Transformer Block×N layers

— Multi-Head Attention —

Q, K, V Projections[B·T,nEmbd]²

3× tiled matmul: x @ Wq, x @ Wk, x @ Wv. 16×16 shared memory tiles.

Attention Scores[B,nHead,T,T]

Q @ Kᵀ / √headDim — batched matmul + fused scale. Largest tensor in the forward pass.

Softmax[B·nHead,T,T]

Fused row-wise kernel: max → subtract → exp → sum → normalize. One workgroup per row, tree reduction in shared memory.

Value Weighting[B,nHead,T,headDim]

scores @ V — tiled matmul combining attention weights with value vectors.

Output Projection[B·T,nEmbd]

concat(heads) @ Wo — final attention matmul.

— Feed-Forward MLP —

LayerNorm×2 per layer

Fused kernel: mean → variance → normalize → γx+β. One workgroup per token. Pre-norm architecture.

MLP FC1[B·T,nEmbd] → [B·T,4·nEmbd]

Tiled matmul expanding to 4× hidden dimension.

GELU Activation[B·T,4·nEmbd]

Fused tanh approximation: 0.5x(1+tanh(√(2/π)(x+0.044715x³))). Uses vec4 kernel when aligned.

MLP FC2[B·T,4·nEmbd] → [B·T,nEmbd]

Tiled matmul projecting back to model dimension.

Final LayerNorm[B,T,nEmbd]

Last normalization before vocabulary projection.

LM Head Projection[B,T,vocabSize]

Matmul projecting to vocabulary logits.

Cross-Entropy Loss→ scalar

Log-softmax + NLL: -log(softmax(logits)[target]) averaged over B×T tokens.

Per transformer layer: 8 matmuls + 2 layernorms + 1 softmax + 1 GELU + 2 residual adds + mask fill

GPUreadBuffer — loss scalar + activations for backwardCPU

&triangleleft; Backward Pass

CPU — Tape Walker

Initialize Loss Gradient

loss.grad = ones(shape) — seed the backward pass with dL/dL = 1.

Reverse Tape Walk

Walk recorded tape from newest → oldest entry. Each entry calls its backward closure: entry.backward(grad, backend). Gradients accumulate with += for multi-use variables (residuals).

Broadcasting Undo

reduceBroadcast(grad, originalShape) — sums over broadcast dims. E.g., [1, 512] broadcast to [64, 512] → sum over batch.

Gradient Norm & Clipping

Collect all param gradients → compute global L2 norm. If norm > gradClip: scale by gradClip / norm.

GPU — Gradient Kernels

CrossEntropy Backward

softmax(logits) − oneHot(targets) / N. Reuses forward softmax values.

MatMul Backward×8 per layer

dA = G @ Bᵀ, dB = Aᵀ @ G. Each forward matmul produces 2 backward matmuls. Dominates backward compute.

LayerNorm Backward

Full analytical gradient: normalized input, scale/bias updates. Fused kernel matching forward structure.

Activation Backward

GELU: derivative of tanh approximation. Softmax: s × (g − Σ(g·s)).

Element-wise Backward

add / sub / mul / div / exp / log / sqrt — each has trivial derivative. Uses same vec4 kernel variants as forward.

&orarr; Optimizer — AdamW

All optimizer operations run on CPU. In-place parameter updates with decoupled weight decay.

Momentum

m = β₁·m + (1−β₁)·g
Exponential moving average of gradients. Per-parameter buffer.

Variance

v = β₂·v + (1−β₂)·g²
EMA of squared gradients (RMSprop component).

Bias Correction

m̂ = m/(1−β₁ᵗ)
v̂ = v/(1−β₂ᵗ)
Corrects for zero-init bias.

Param Update

p −= lr·(m̂/(√v̂+ε) + wd·p)
Decoupled weight decay applied directly to params.

Next training step

GPU Kernel Inventory — SPIR-V Compute Shaders

25+ hand-written compute shaders generated by the TypeScript SPIR-V assembler.

Element-wise

add sub mul div neg exp log sqrt scale
+ vec4 variants (4× throughput via 128-bit loads)

Activations

relu gelu silu
+ vec4 variants. GELU uses fused tanh approx.

Reductions

sum_reduce max_reduce
Parallel tree reduction in shared memory. log₂(WG_SIZE) barriers.

Fused Softmax

Row-wise: max → exp(x−max) → sum → normalize. One workgroup per row. No intermediate buffers.

Fused LayerNorm

mean → var → (x−μ)/√(σ²+ε) → γx+β. One workgroup per token.

Tiled MatMul

16×16 shared memory tiles. Loop over K dimension. Barrier-synchronized cooperative loads.

Fused Mul-Add

D[i] = A[i]×B[i] + C[i]
Single-pass 3-input kernel.

F16 Storage

add_f16 sub_f16 mul_f16 div_f16 neg_f16 exp_f16
Compute in f32, store as f16. 2× memory savings.

GPU Infrastructure — Vulkan Native Layer

Zero external dependencies. Custom Vulkan bridge, SPIR-V assembler, and memory management.

Vulkan Native Addon

C (~2000 LOC), N-API bridge. Dynamic Vulkan loading via dlopen — no SDK headers needed. Physical device selection + compute queue.

SPIR-V Assembler

Hand-written in TypeScript (~2500 LOC). Binary assembler — no glslc or glslangValidator. Types, decorations, control flow, GLSL.std.450 extended instructions.

Slab Memory Allocator

Bump-pointer allocation in 64MB slabs (up to 512MB per slab). Buffer pooling + recycling via FinalizationRegistry.

Timeline Semaphores

Vulkan 1.2 timeline semaphores for async dispatch. Record many ops → wait only when results needed. Pipelined execution.

Compute Graph Batching

Lazy evaluation — ops recorded to pending queue. Auto-flush at 64 ops. Single command buffer submit: N×100μs → 1×100μs + N×2μs.

Auto-Tuned Workgroups

Benchmarks {64, 128, 256, 512} on init via 256K element add. Optimal WG_SIZE cached for session. Vec4 kernels when elements % 4 == 0.

Alpha — 40 CPU ops · 25+ GPU kernels · 0 external ML dependencies · Pure TypeScript + Hand-written SPIR-V