polygrad

Polygrad Python

Python bindings for polygrad, a C11 port of tinygrad’s compiler core.

Polygrad moves the compiler and runtime out of Python and into a reusable native library that can be called from multiple frontends, including Python, JavaScript, and R.

This package exposes that core as a lazy Tensor API with autograd, neural network layers, compiled training steps, and HuggingFace model loading.

Project home and docs: https://polygrad.org
Source code and issue tracker: https://github.com/polygrad/polygrad

Installation

pip install polygrad

Build requirements: A C compiler (gcc or clang) and Python development headers (python3-dev).

Runtime requirement: clang must be on PATH. polygrad compiles compute kernels at runtime.

Package requirements: Python >= 3.9 and numpy.

Platform support: Linux only. The current CPU runtime uses POSIX fork() and dlopen().

PyPI package notes:

float32 and float64 dtypes
tinygrad-style device= and .to(...)
CPU works by default, CUDA is detected at runtime
download_hf() additionally requires huggingface_hub

For development:

# Editable install (compiles C sources, auto-syncs from repo)
pip install -e py/

# Or build the shared library manually and point to it
make
export POLYGRAD_LIB=/path/to/build/libpolygrad.so

Quick Start

from polygrad import Tensor

# Create tensors
a = Tensor.rand(3, 4)
b = Tensor.rand(4, 5)

# Matrix multiply + softmax
c = (a @ b).softmax(-1)
print(c.numpy())

# Autograd
x = Tensor([1.0, 2.0, 3.0])
x.requires_grad = True
loss = (x * x).sum()
loss.backward()
print(x.grad.numpy())  # [2.0, 4.0, 6.0]

Devices

from polygrad import Device, Tensor

x = Tensor.rand(4)

if Device.cuda_available():
    y = (x * 2).to('cuda')
    print(y.device)      # CUDA
    print(y.numpy())     # executes on CUDA, returns host numpy array
else:
    y = (x * 2).to('cpu')
    print(y.device)      # CPU

Training Example

from polygrad import Tensor
from polygrad.nn import Linear, SGD, get_parameters

Tensor.manual_seed(42)
model = Linear(2, 1)
opt = SGD(get_parameters(model), lr=0.01)

for i in range(100):
    opt.zero_grad()
    x = Tensor([[1.0, 2.0], [3.0, 4.0]])
    target = Tensor([[5.0], [11.0]])
    loss = (model(x) - target).square().mean()
    loss.backward()
    opt.step()

print(f"loss: {loss.item():.4f}")

Tensor API

Construction

Method	Description
`Tensor(data)`	From list, numpy array, or scalar
`Tensor.zeros(*shape)`	Tensor of zeros
`Tensor.ones(*shape)`	Tensor of ones
`Tensor.full(shape, val)`	Tensor filled with value
`Tensor.rand(*shape)`	Uniform random [0, 1)
`Tensor.randn(*shape)`	Standard normal
`Tensor.randint(low, high, shape)`	Random integers [low, high)
`Tensor.arange(stop, start=0, step=1)`	Arithmetic progression
`Tensor.linspace(start, stop, steps)`	Evenly spaced values
`Tensor.eye(n)`	Identity matrix
`Tensor.empty(*shape)`	Uninitialized tensor
`Tensor.manual_seed(seed)`	Set random seed

Properties

Property	Type	Description
`shape`	tuple	Dimension sizes
`ndim`	int	Number of dimensions
`dtype`	str	`'float32'` or `'float64'`
`device`	str	Current tensor device (`'CPU'` or `'CUDA'`)
`T`	Tensor	Transpose of last two dims
`requires_grad`	bool	Settable; enables autograd
`grad`	Tensor/None	Gradient after `.backward()`

Realization & Conversion

Method	Returns	Description
`realize()`	Tensor	Execute lazy graph, return self
`numpy()`	ndarray	Realize and return numpy array
`item()`	float	Scalar value
`tolist()`	list	Nested Python list
`to(device)`	Tensor	Copy tensor view to `'cpu'` or `'cuda'`
`cpu()` / `cuda()`	Tensor	Convenience wrappers for `to(...)`
`numel()`	int	Total elements
`size(dim=None)`	tuple/int	Shape or dimension size
`detach()`	Tensor	Copy without graph
`clone()`	Tensor	Copy preserving requires_grad

Arithmetic

a + b, a - b, a * b, a / b, -a, a ** b

All support broadcasting and scalar operands.

Comparisons

a < b, a == b, a != b, a > b, a >= b, a <= b

Returns float tensor (1.0 = true, 0.0 = false).

Element-wise Math

Method	Description
`exp()`	e^x
`log()`	ln(x)
`sqrt()`	Square root
`square()`	x^2
`abs()`	Absolute value
`sign()`	Sign (-1, 0, +1)
`reciprocal()`	1/x
`rsqrt()`	1/sqrt(x)
`sin()`, `cos()`, `tan()`	Trigonometric
`ceil()`, `floor()`, `round()`, `trunc()`	Rounding
`isnan()`, `isinf()`	NaN/Inf detection
`exp2()`, `log2()`	Base-2 functions
`where(x, y)`	Conditional: self ? x : y
`maximum(other)`	Element-wise max
`minimum(other)`	Element-wise min
`clamp(min_=None, max_=None)`	Clamp to range

Activations

Method	Description
`relu()`	max(0, x)
`relu6()`	clamp(relu(x), 0, 6)
`leaky_relu(neg_slope=0.01)`	Leaky ReLU
`sigmoid()`	1 / (1 + e^-x)
`tanh()`	Hyperbolic tangent
`gelu()`	Gaussian Error Linear Unit
`quick_gelu()`	Fast GELU approximation
`silu()` / `swish()`	x * sigmoid(x)
`elu(alpha=1.0)`	Exponential Linear Unit
`softplus(beta=1.0)`	log(1 + e^(beta*x)) / beta
`mish()`	x * tanh(softplus(x))
`hardtanh(min_val=-1, max_val=1)`	Clamped linear
`hardswish()`	Hard swish
`hardsigmoid()`	Hard sigmoid

Reductions

Method	Description
`sum(axis=None, keepdim=False)`	Sum along axes
`max(axis=None, keepdim=False)`	Maximum along axes
`min(axis=None, keepdim=False)`	Minimum along axes
`mean(axis=None, keepdim=False)`	Mean along axes
`var(axis=None, keepdim=False, correction=1)`	Variance
`std(axis=None, keepdim=False, correction=1)`	Standard deviation

Movement / Shape

Method	Description
`reshape(shape)` / `view(shape)`	Reshape (supports -1)
`permute(*order)`	Permute dimensions
`transpose(dim0=-2, dim1=-1)`	Swap two dimensions
`expand(*shape)`	Broadcast to shape
`squeeze(dim=None)`	Remove size-1 dims
`unsqueeze(dim)`	Add size-1 dim
`flatten(start_dim=0, end_dim=-1)`	Flatten dim range
`unflatten(dim, sizes)`	Split dim into multiple
`shrink(arg)`	Slice: [(start, end), …]
`pad(arg)`	Pad: [(before, after), …]
`flip(axis)`	Reverse along axes
`repeat(*repeats)`	Tile tensor

Linear Algebra

Method	Description
`matmul(other)` / `dot(other)` / `@`	Matrix multiplication
`linear(weight, bias=None)`	x @ weight.T + bias

Normalization & Loss

Method	Description
`softmax(axis=-1)`	Softmax normalization
`log_softmax(axis=-1)`	Log-softmax
`layernorm(axis=-1, eps=1e-5)`	Layer normalization
`cross_entropy(target, axis=-1)`	Cross-entropy loss
`binary_crossentropy(target)`	Binary cross-entropy

Advanced Operations

Method	Description
`Tensor.einsum(formula, *operands)`	Einstein summation
`rearrange(formula, **kwargs)`	einops-style rearrange
`Tensor.cat(*tensors, dim=0)`	Concatenate along dim
`Tensor.stack(*tensors, dim=0)`	Stack along new dim
`split(sizes, dim=0)`	Split into chunks
`chunk(n, dim=0)`	Split into n chunks
`__getitem__`	Indexing: int, slice, None, Ellipsis

Autograd

x = Tensor([1.0, 2.0])
x.requires_grad = True
loss = (x * x).sum()
loss.backward()
print(x.grad.numpy())  # [2.0, 4.0]

Call backward() on a scalar loss before calling item() or numpy() on the loss.

nn Module

Layers

from polygrad.nn import Linear, LayerNorm, RMSNorm, Embedding, Dropout

Class	Signature	Description
`Linear(in_f, out_f, bias=True)`	y = x @ W.T + b	Fully connected layer
`LayerNorm(shape, eps=1e-5)`	(x - mean) / sqrt(var + eps) * w + b	Layer normalization
`RMSNorm(dim, eps=1e-5)`	x / rms(x) * w	Root mean square normalization
`Embedding(vocab, dim)`	Lookup table	Token embedding
`Dropout(p=0.5)`	Random zeroing	Training-only (controlled by `Tensor.training`)
`GroupNorm(groups, channels)`	Group normalization	Per-group normalization

Optimizers

from polygrad.nn import SGD, Adam, AdamW, get_parameters

Class	Signature
`SGD(params, lr=0.01, momentum=0.0, weight_decay=0.0)`
`Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-8, weight_decay=0.0)`
`AdamW(params, lr=0.001, betas=(0.9, 0.999), eps=1e-8, weight_decay=0.01)`

All optimizers have step() and zero_grad() methods.

State Dict

from polygrad.nn import get_parameters, get_state_dict, load_state_dict

params = get_parameters(model)       # List of Tensor
sd = get_state_dict(model)           # {'weight': Tensor, 'bias': Tensor, ...}
load_state_dict(model2, sd)          # Load params into another model

Compiled Training Steps

Compile a training step into a reusable C program. The first call traces the computation graph; subsequent calls execute with zero scheduling overhead.

from polygrad import Tensor
from polygrad.nn import Linear, SGD, get_parameters, compile_step

Tensor.manual_seed(42)
model = Linear(4, 1)
opt = SGD(get_parameters(model), lr=0.01)

# Sample inputs (shapes must match at runtime)
x = Tensor.rand(8, 4)
y = Tensor.rand(8, 1)

def train_step(model, opt, x, y):
    loss = (model(x) - y).square().mean()
    loss.backward()
    opt.step()
    opt.zero_grad()
    return loss

# Compile: traces forward + backward + optimizer into one PolyStep
step = compile_step(train_step, model, opt, x, y)

# Run: executes compiled kernels with current buffer data
for i in range(100):
    x._data[:] = ...  # update input data in-place
    y._data[:] = ...
    step.run()
    print(f"step {i}: loss = {step.loss_value():.4f}")

compile_step returns a CompiledTrainingStep with:

run() – execute all compiled kernels (forward + backward + optimizer)
loss_value() – read the loss scalar from the output buffer
n_kernels – number of compiled kernels
n_intermediates – number of pre-allocated intermediate buffers

HuggingFace Model Loading

Load pre-trained models directly from HuggingFace format (config.json + safetensors).

To use download_hf(), install the optional Hub client first:

pip install huggingface_hub

from polygrad.hf import load_hf, download_hf, generate
import numpy as np
import json
from pathlib import Path

# Download a small GPT-2 checkpoint from HuggingFace Hub
model_path = download_hf('hf-internal-testing/tiny-random-gpt2')
config = json.loads((Path(model_path) / 'config.json').read_text())
vocab_size = config['vocab_size']
max_seq_len = 16

# Load into a PolyInstance
inst = load_hf(model_path, max_batch=1, max_seq_len=max_seq_len)

# Run forward pass
tokens = np.array([[1, 2, 3, 4]], dtype=np.float32)
outputs = inst.forward(
    x=tokens,
    positions=np.arange(tokens.shape[1], dtype=np.float32).reshape(1, -1),
    arange=np.arange(max_seq_len, dtype=np.float32)
)
logits = outputs['output'].reshape(1, max_seq_len, vocab_size)

# Autoregressive generation
result = generate(inst, tokens, max_new_tokens=2, temperature=1.0, top_k=10)

HF API

Function	Description
`load_hf(path, max_batch=1, max_seq_len=0)`	Load model from local directory
`load_hf_bytes(config, weights, ...)`	Load from raw bytes (no filesystem)
`download_hf(repo_id, cache_dir=None)`	Download from HuggingFace Hub
`generate(inst, tokens, max_new_tokens, ...)`	Autoregressive text generation

Supported model types: GPT-2. Weight formats: F32, F16, BF16 safetensors (single or sharded).

How It Works

Lazy evaluation: Operations build a UOp graph in the C core. No computation happens until realize(), numpy(), item(), or backward().
One FFI call per op: Each Tensor method calls one C function via ctypes. The C core handles all op composition (e.g., gelu = 0.5*x*(1+tanh(sqrt(2/pi)*(x+0.044715*x^3)))).
Realize boundaries: Some ops (softmax, layernorm, var) insert implicit .realize() calls to create kernel boundaries for the scheduler.
Autograd: backward() calls C’s poly_grad() for each parameter, then realizes the gradient tensors.

Limitations

CPU is the default path. CUDA execution requires a working CUDA runtime (libcuda and libnvrtc) on the host.
Conv2d and BatchNorm are stubs (forward raises NotImplementedError)

Tests

python -m pytest py/tests/ -v   # 130 tests (tensor + nn + compiled step + GPT-2 + HF loading + instance)