A
Alpha

Architecture

CPU vs GPU computation map. Every component — tensors, autograd, model, tokenizers, training loop — is hand-written TypeScript with zero ML dependencies.

CPU — TypeScriptGPU — Vulkan / SPIR-VBackend DispatchData Transfer

CPU-Only Operations

Always run on CPU regardless of tensor size. Orchestration, I/O, and control flow.

Data Loading
Tokenization (BPE / char / word), batch sampling, text I/O. DataLoader.nextBatch() picks random offsets → [B, T] token windows.
RNG & Seeding
Deterministic seeded RNG (seed=42). Weight init N(0, 0.02), dropout masks, data sampling order.
LR Schedule
Warmup: linear ramp over ~10% of steps. Cosine decay: lr × 0.5(1+cos(π·decay)).
Checkpoint I/O
Binary v2: ALPH magic + JSON header + packed Float32. Saves params, optimizer m/v buffers, RNG state.
Metrics & Logging
JSONL step metrics (loss, lr, grad norm, tokens/sec). Remote sync to alpha.omegaai.dev.
Backend
Dispatch

≥ 1,000,000 elements → GPU   |  < 1M → CPU

← CPU path TypeScriptGPU path → SPIR-V
CPUuploadBuffer — staging → device-localGPU

&triangleright; Forward Pass

CPU — TypeScript
Token Embedding[B,T] → [B,T,nEmbd]
embedding(wte, tokens) — gather rows from weight table. Tape records backward for gradient scatter.
Position Embedding[B,T] → [B,T,nEmbd]
embedding(wpe, 0..T-1) — positional encoding lookup.
Add Embeddings[B,T,nEmbd]
tokEmb + posEmb — element-wise sum. Small models stay on CPU.
Causal Mask[T,T]
causalMask(T) — lower triangle = 0, upper = −∞. Always CPU (small, created once).
Reshape / Transpose
View operations: reshape to multi-head [B, nHead, T, headDim], transpose for attention, concat heads. No computation — pointer reinterpretation.
Residual Connections
x = x + attnOut, x = x + mlpOut — identity shortcuts stabilizing deep nets.
Tape Recording
Every op appends to global Tape: output Variable, input Variables, backward closure. Enables reverse-mode autodiff.
GPU — Vulkan Compute
Transformer Block×N layers
— Multi-Head Attention —
Q, K, V Projections[B·T,nEmbd]²
3× tiled matmul: x @ Wq, x @ Wk, x @ Wv. 16×16 shared memory tiles.
Attention Scores[B,nHead,T,T]
Q @ Kᵀ / √headDim — batched matmul + fused scale. Largest tensor in the forward pass.
Softmax[B·nHead,T,T]
Fused row-wise kernel: max → subtract → exp → sum → normalize. One workgroup per row, tree reduction in shared memory.
Value Weighting[B,nHead,T,headDim]
scores @ V — tiled matmul combining attention weights with value vectors.
Output Projection[B·T,nEmbd]
concat(heads) @ Wo — final attention matmul.
— Feed-Forward MLP —
LayerNorm×2 per layer
Fused kernel: mean → variance → normalize → γx+β. One workgroup per token. Pre-norm architecture.
MLP FC1[B·T,nEmbd] → [B·T,4·nEmbd]
Tiled matmul expanding to 4× hidden dimension.
GELU Activation[B·T,4·nEmbd]
Fused tanh approximation: 0.5x(1+tanh(√(2/π)(x+0.044715x³))). Uses vec4 kernel when aligned.
MLP FC2[B·T,4·nEmbd] → [B·T,nEmbd]
Tiled matmul projecting back to model dimension.
Final LayerNorm[B,T,nEmbd]
Last normalization before vocabulary projection.
LM Head Projection[B,T,vocabSize]
Matmul projecting to vocabulary logits.
Cross-Entropy Loss→ scalar
Log-softmax + NLL: -log(softmax(logits)[target]) averaged over B×T tokens.
Per transformer layer: 8 matmuls + 2 layernorms + 1 softmax + 1 GELU + 2 residual adds + mask fill
GPUreadBuffer — loss scalar + activations for backwardCPU

&triangleleft; Backward Pass

CPU — Tape Walker
Initialize Loss Gradient
loss.grad = ones(shape) — seed the backward pass with dL/dL = 1.
Reverse Tape Walk
Walk recorded tape from newest → oldest entry. Each entry calls its backward closure: entry.backward(grad, backend). Gradients accumulate with += for multi-use variables (residuals).
Broadcasting Undo
reduceBroadcast(grad, originalShape) — sums over broadcast dims. E.g., [1, 512] broadcast to [64, 512] → sum over batch.
Gradient Norm & Clipping
Collect all param gradients → compute global L2 norm. If norm > gradClip: scale by gradClip / norm.
GPU — Gradient Kernels
CrossEntropy Backward
softmax(logits) − oneHot(targets) / N. Reuses forward softmax values.
MatMul Backward×8 per layer
dA = G @ Bᵀ, dB = Aᵀ @ G. Each forward matmul produces 2 backward matmuls. Dominates backward compute.
LayerNorm Backward
Full analytical gradient: normalized input, scale/bias updates. Fused kernel matching forward structure.
Activation Backward
GELU: derivative of tanh approximation. Softmax: s × (g − Σ(g·s)).
Element-wise Backward
add / sub / mul / div / exp / log / sqrt — each has trivial derivative. Uses same vec4 kernel variants as forward.

&orarr; Optimizer — AdamW

All optimizer operations run on CPU. In-place parameter updates with decoupled weight decay.

Momentum
m = β₁·m + (1−β₁)·g
Exponential moving average of gradients. Per-parameter buffer.
Variance
v = β₂·v + (1−β₂)·g²
EMA of squared gradients (RMSprop component).
Bias Correction
m̂ = m/(1−β₁ᵗ)
v̂ = v/(1−β₂ᵗ)
Corrects for zero-init bias.
Param Update
p −= lr·(m̂/(√v̂+ε) + wd·p)
Decoupled weight decay applied directly to params.
Next training step

GPU Kernel Inventory — SPIR-V Compute Shaders

25+ hand-written compute shaders generated by the TypeScript SPIR-V assembler.

Element-wise
add sub mul div neg exp log sqrt scale
+ vec4 variants (4× throughput via 128-bit loads)
Activations
relu gelu silu
+ vec4 variants. GELU uses fused tanh approx.
Reductions
sum_reduce max_reduce
Parallel tree reduction in shared memory. log₂(WG_SIZE) barriers.
Fused Softmax
Row-wise: max → exp(x−max) → sum → normalize. One workgroup per row. No intermediate buffers.
Fused LayerNorm
mean → var → (x−μ)/√(σ²+ε) → γx+β. One workgroup per token.
Tiled MatMul
16×16 shared memory tiles. Loop over K dimension. Barrier-synchronized cooperative loads.
Fused Mul-Add
D[i] = A[i]×B[i] + C[i]
Single-pass 3-input kernel.
F16 Storage
add_f16 sub_f16 mul_f16 div_f16 neg_f16 exp_f16
Compute in f32, store as f16. 2× memory savings.

GPU Infrastructure — Vulkan Native Layer

Zero external dependencies. Custom Vulkan bridge, SPIR-V assembler, and memory management.

Vulkan Native Addon
C (~2000 LOC), N-API bridge. Dynamic Vulkan loading via dlopen — no SDK headers needed. Physical device selection + compute queue.
SPIR-V Assembler
Hand-written in TypeScript (~2500 LOC). Binary assembler — no glslc or glslangValidator. Types, decorations, control flow, GLSL.std.450 extended instructions.
Slab Memory Allocator
Bump-pointer allocation in 64MB slabs (up to 512MB per slab). Buffer pooling + recycling via FinalizationRegistry.
Timeline Semaphores
Vulkan 1.2 timeline semaphores for async dispatch. Record many ops → wait only when results needed. Pipelined execution.
Compute Graph Batching
Lazy evaluation — ops recorded to pending queue. Auto-flush at 64 ops. Single command buffer submit: N×100μs → 1×100μs + N×2μs.
Auto-Tuned Workgroups
Benchmarks {64, 128, 256, 512} on init via 256K element add. Optimal WG_SIZE cached for session. Vec4 kernels when elements % 4 == 0.
Alpha — 40 CPU ops · 25+ GPU kernels · 0 external ML dependencies · Pure TypeScript + Hand-written SPIR-V