Architecture
CPU vs GPU computation map. Every component — tensors, autograd, model, tokenizers, training loop — is hand-written TypeScript with zero ML dependencies.
CPU — TypeScriptGPU — Vulkan / SPIR-VBackend DispatchData Transfer
CPU-Only Operations
Always run on CPU regardless of tensor size. Orchestration, I/O, and control flow.
Data Loading
Tokenization (BPE / char / word), batch sampling, text I/O.
DataLoader.nextBatch() picks random offsets → [B, T] token windows.RNG & Seeding
Deterministic seeded RNG (seed=42). Weight init N(0, 0.02), dropout masks, data sampling order.
LR Schedule
Warmup: linear ramp over ~10% of steps. Cosine decay:
lr × 0.5(1+cos(π·decay)).Checkpoint I/O
Binary v2:
ALPH magic + JSON header + packed Float32. Saves params, optimizer m/v buffers, RNG state.Metrics & Logging
JSONL step metrics (loss, lr, grad norm, tokens/sec). Remote sync to
alpha.omegaai.dev.Backend
Dispatch
≥ 1,000,000 elements → GPU | < 1M → CPU
← CPU path TypeScriptGPU path → SPIR-V
CPUuploadBuffer — staging → device-localGPU
▹ Forward Pass
CPU — TypeScript
Token Embedding[B,T] → [B,T,nEmbd]
embedding(wte, tokens) — gather rows from weight table. Tape records backward for gradient scatter.Position Embedding[B,T] → [B,T,nEmbd]
embedding(wpe, 0..T-1) — positional encoding lookup.Add Embeddings[B,T,nEmbd]
tokEmb + posEmb — element-wise sum. Small models stay on CPU.Causal Mask[T,T]
causalMask(T) — lower triangle = 0, upper = −∞. Always CPU (small, created once).Reshape / Transpose
View operations: reshape to multi-head [B, nHead, T, headDim], transpose for attention, concat heads. No computation — pointer reinterpretation.
Residual Connections
x = x + attnOut, x = x + mlpOut — identity shortcuts stabilizing deep nets.Tape Recording
Every op appends to global Tape: output Variable, input Variables, backward closure. Enables reverse-mode autodiff.
GPU — Vulkan Compute
Transformer Block×N layers
— Multi-Head Attention —
Q, K, V Projections[B·T,nEmbd]²
3× tiled matmul:
x @ Wq, x @ Wk, x @ Wv. 16×16 shared memory tiles.Attention Scores[B,nHead,T,T]
Q @ Kᵀ / √headDim — batched matmul + fused scale. Largest tensor in the forward pass.Softmax[B·nHead,T,T]
Fused row-wise kernel: max → subtract → exp → sum → normalize. One workgroup per row, tree reduction in shared memory.
Value Weighting[B,nHead,T,headDim]
scores @ V — tiled matmul combining attention weights with value vectors.Output Projection[B·T,nEmbd]
concat(heads) @ Wo — final attention matmul.— Feed-Forward MLP —
LayerNorm×2 per layer
Fused kernel: mean → variance → normalize → γx+β. One workgroup per token. Pre-norm architecture.
MLP FC1[B·T,nEmbd] → [B·T,4·nEmbd]
Tiled matmul expanding to 4× hidden dimension.
GELU Activation[B·T,4·nEmbd]
Fused tanh approximation:
0.5x(1+tanh(√(2/π)(x+0.044715x³))). Uses vec4 kernel when aligned.MLP FC2[B·T,4·nEmbd] → [B·T,nEmbd]
Tiled matmul projecting back to model dimension.
Final LayerNorm[B,T,nEmbd]
Last normalization before vocabulary projection.
LM Head Projection[B,T,vocabSize]
Matmul projecting to vocabulary logits.
Cross-Entropy Loss→ scalar
Log-softmax + NLL:
-log(softmax(logits)[target]) averaged over B×T tokens.Per transformer layer: 8 matmuls + 2 layernorms + 1 softmax + 1 GELU + 2 residual adds + mask fill
GPUreadBuffer — loss scalar + activations for backwardCPU
◃ Backward Pass
CPU — Tape Walker
Initialize Loss Gradient
loss.grad = ones(shape) — seed the backward pass with dL/dL = 1.Reverse Tape Walk
Walk recorded tape from newest → oldest entry. Each entry calls its backward closure:
entry.backward(grad, backend). Gradients accumulate with += for multi-use variables (residuals).Broadcasting Undo
reduceBroadcast(grad, originalShape) — sums over broadcast dims. E.g., [1, 512] broadcast to [64, 512] → sum over batch.Gradient Norm & Clipping
Collect all param gradients → compute global L2 norm. If
norm > gradClip: scale by gradClip / norm.GPU — Gradient Kernels
CrossEntropy Backward
softmax(logits) − oneHot(targets) / N. Reuses forward softmax values.MatMul Backward×8 per layer
dA = G @ Bᵀ, dB = Aᵀ @ G. Each forward matmul produces 2 backward matmuls. Dominates backward compute.LayerNorm Backward
Full analytical gradient: normalized input, scale/bias updates. Fused kernel matching forward structure.
Activation Backward
GELU: derivative of tanh approximation. Softmax:
s × (g − Σ(g·s)).Element-wise Backward
add / sub / mul / div / exp / log / sqrt — each has trivial derivative. Uses same vec4 kernel variants as forward.
↻ Optimizer — AdamW
All optimizer operations run on CPU. In-place parameter updates with decoupled weight decay.
Momentum
m = β₁·m + (1−β₁)·gExponential moving average of gradients. Per-parameter buffer.
Variance
v = β₂·v + (1−β₂)·g²EMA of squared gradients (RMSprop component).
Bias Correction
m̂ = m/(1−β₁ᵗ)v̂ = v/(1−β₂ᵗ)Corrects for zero-init bias.
Param Update
p −= lr·(m̂/(√v̂+ε) + wd·p)Decoupled weight decay applied directly to params.
Next training step
GPU Kernel Inventory — SPIR-V Compute Shaders
25+ hand-written compute shaders generated by the TypeScript SPIR-V assembler.
Element-wise
add sub mul div neg exp log sqrt scale+ vec4 variants (4× throughput via 128-bit loads)
Activations
relu gelu silu+ vec4 variants. GELU uses fused tanh approx.
Reductions
sum_reduce max_reduceParallel tree reduction in shared memory. log₂(WG_SIZE) barriers.
Fused Softmax
Row-wise: max → exp(x−max) → sum → normalize. One workgroup per row. No intermediate buffers.
Fused LayerNorm
mean → var → (x−μ)/√(σ²+ε) → γx+β. One workgroup per token.
Tiled MatMul
16×16 shared memory tiles. Loop over K dimension. Barrier-synchronized cooperative loads.
Fused Mul-Add
D[i] = A[i]×B[i] + C[i]Single-pass 3-input kernel.
F16 Storage
add_f16 sub_f16 mul_f16 div_f16 neg_f16 exp_f16Compute in f32, store as f16. 2× memory savings.
GPU Infrastructure — Vulkan Native Layer
Zero external dependencies. Custom Vulkan bridge, SPIR-V assembler, and memory management.
Vulkan Native Addon
C (~2000 LOC), N-API bridge. Dynamic Vulkan loading via
dlopen — no SDK headers needed. Physical device selection + compute queue.SPIR-V Assembler
Hand-written in TypeScript (~2500 LOC). Binary assembler — no
glslc or glslangValidator. Types, decorations, control flow, GLSL.std.450 extended instructions.Slab Memory Allocator
Bump-pointer allocation in 64MB slabs (up to 512MB per slab). Buffer pooling + recycling via
FinalizationRegistry.Timeline Semaphores
Vulkan 1.2 timeline semaphores for async dispatch. Record many ops → wait only when results needed. Pipelined execution.
Compute Graph Batching
Lazy evaluation — ops recorded to pending queue. Auto-flush at 64 ops. Single command buffer submit: N×100μs → 1×100μs + N×2μs.
Auto-Tuned Workgroups
Benchmarks {64, 128, 256, 512} on init via 256K element add. Optimal WG_SIZE cached for session. Vec4 kernels when elements % 4 == 0.
Alpha — 40 CPU ops · 25+ GPU kernels · 0 external ML dependencies · Pure TypeScript + Hand-written SPIR-V