Python bindings for polygrad, a C11 port of tinygrad’s compiler core.
Polygrad moves the compiler and runtime out of Python and into a reusable native library that can be called from multiple frontends, including Python, JavaScript, and R.
This package exposes that core as a lazy Tensor API with autograd, neural network layers, compiled training steps, and HuggingFace model loading.
pip install polygrad
Build requirements: A C compiler (gcc or clang) and Python development headers (python3-dev).
Runtime requirement: clang must be on PATH. polygrad compiles compute kernels at runtime.
Package requirements: Python >= 3.9 and numpy.
Platform support: Linux only. The current CPU runtime uses POSIX fork() and dlopen().
PyPI package notes:
float32 and float64 dtypesdevice= and .to(...)download_hf() additionally requires huggingface_hubFor development:
# Editable install (compiles C sources, auto-syncs from repo)
pip install -e py/
# Or build the shared library manually and point to it
make
export POLYGRAD_LIB=/path/to/build/libpolygrad.so
from polygrad import Tensor
# Create tensors
a = Tensor.rand(3, 4)
b = Tensor.rand(4, 5)
# Matrix multiply + softmax
c = (a @ b).softmax(-1)
print(c.numpy())
# Autograd
x = Tensor([1.0, 2.0, 3.0])
x.requires_grad = True
loss = (x * x).sum()
loss.backward()
print(x.grad.numpy()) # [2.0, 4.0, 6.0]
from polygrad import Device, Tensor
x = Tensor.rand(4)
if Device.cuda_available():
y = (x * 2).to('cuda')
print(y.device) # CUDA
print(y.numpy()) # executes on CUDA, returns host numpy array
else:
y = (x * 2).to('cpu')
print(y.device) # CPU
from polygrad import Tensor
from polygrad.nn import Linear, SGD, get_parameters
Tensor.manual_seed(42)
model = Linear(2, 1)
opt = SGD(get_parameters(model), lr=0.01)
for i in range(100):
opt.zero_grad()
x = Tensor([[1.0, 2.0], [3.0, 4.0]])
target = Tensor([[5.0], [11.0]])
loss = (model(x) - target).square().mean()
loss.backward()
opt.step()
print(f"loss: {loss.item():.4f}")
| Method | Description |
|---|---|
Tensor(data) |
From list, numpy array, or scalar |
Tensor.zeros(*shape) |
Tensor of zeros |
Tensor.ones(*shape) |
Tensor of ones |
Tensor.full(shape, val) |
Tensor filled with value |
Tensor.rand(*shape) |
Uniform random [0, 1) |
Tensor.randn(*shape) |
Standard normal |
Tensor.randint(low, high, shape) |
Random integers [low, high) |
Tensor.arange(stop, start=0, step=1) |
Arithmetic progression |
Tensor.linspace(start, stop, steps) |
Evenly spaced values |
Tensor.eye(n) |
Identity matrix |
Tensor.empty(*shape) |
Uninitialized tensor |
Tensor.manual_seed(seed) |
Set random seed |
| Property | Type | Description |
|---|---|---|
shape |
tuple | Dimension sizes |
ndim |
int | Number of dimensions |
dtype |
str | 'float32' or 'float64' |
device |
str | Current tensor device ('CPU' or 'CUDA') |
T |
Tensor | Transpose of last two dims |
requires_grad |
bool | Settable; enables autograd |
grad |
Tensor/None | Gradient after .backward() |
| Method | Returns | Description |
|---|---|---|
realize() |
Tensor | Execute lazy graph, return self |
numpy() |
ndarray | Realize and return numpy array |
item() |
float | Scalar value |
tolist() |
list | Nested Python list |
to(device) |
Tensor | Copy tensor view to 'cpu' or 'cuda' |
cpu() / cuda() |
Tensor | Convenience wrappers for to(...) |
numel() |
int | Total elements |
size(dim=None) |
tuple/int | Shape or dimension size |
detach() |
Tensor | Copy without graph |
clone() |
Tensor | Copy preserving requires_grad |
a + b, a - b, a * b, a / b, -a, a ** b
All support broadcasting and scalar operands.
a < b, a == b, a != b, a > b, a >= b, a <= b
Returns float tensor (1.0 = true, 0.0 = false).
| Method | Description |
|---|---|
exp() |
e^x |
log() |
ln(x) |
sqrt() |
Square root |
square() |
x^2 |
abs() |
Absolute value |
sign() |
Sign (-1, 0, +1) |
reciprocal() |
1/x |
rsqrt() |
1/sqrt(x) |
sin(), cos(), tan() |
Trigonometric |
ceil(), floor(), round(), trunc() |
Rounding |
isnan(), isinf() |
NaN/Inf detection |
exp2(), log2() |
Base-2 functions |
where(x, y) |
Conditional: self ? x : y |
maximum(other) |
Element-wise max |
minimum(other) |
Element-wise min |
clamp(min_=None, max_=None) |
Clamp to range |
| Method | Description |
|---|---|
relu() |
max(0, x) |
relu6() |
clamp(relu(x), 0, 6) |
leaky_relu(neg_slope=0.01) |
Leaky ReLU |
sigmoid() |
1 / (1 + e^-x) |
tanh() |
Hyperbolic tangent |
gelu() |
Gaussian Error Linear Unit |
quick_gelu() |
Fast GELU approximation |
silu() / swish() |
x * sigmoid(x) |
elu(alpha=1.0) |
Exponential Linear Unit |
softplus(beta=1.0) |
log(1 + e^(beta*x)) / beta |
mish() |
x * tanh(softplus(x)) |
hardtanh(min_val=-1, max_val=1) |
Clamped linear |
hardswish() |
Hard swish |
hardsigmoid() |
Hard sigmoid |
| Method | Description |
|---|---|
sum(axis=None, keepdim=False) |
Sum along axes |
max(axis=None, keepdim=False) |
Maximum along axes |
min(axis=None, keepdim=False) |
Minimum along axes |
mean(axis=None, keepdim=False) |
Mean along axes |
var(axis=None, keepdim=False, correction=1) |
Variance |
std(axis=None, keepdim=False, correction=1) |
Standard deviation |
| Method | Description |
|---|---|
reshape(*shape) / view(*shape) |
Reshape (supports -1) |
permute(*order) |
Permute dimensions |
transpose(dim0=-2, dim1=-1) |
Swap two dimensions |
expand(*shape) |
Broadcast to shape |
squeeze(dim=None) |
Remove size-1 dims |
unsqueeze(dim) |
Add size-1 dim |
flatten(start_dim=0, end_dim=-1) |
Flatten dim range |
unflatten(dim, sizes) |
Split dim into multiple |
shrink(arg) |
Slice: [(start, end), …] |
pad(arg) |
Pad: [(before, after), …] |
flip(axis) |
Reverse along axes |
repeat(*repeats) |
Tile tensor |
| Method | Description |
|---|---|
matmul(other) / dot(other) / @ |
Matrix multiplication |
linear(weight, bias=None) |
x @ weight.T + bias |
| Method | Description |
|---|---|
softmax(axis=-1) |
Softmax normalization |
log_softmax(axis=-1) |
Log-softmax |
layernorm(axis=-1, eps=1e-5) |
Layer normalization |
cross_entropy(target, axis=-1) |
Cross-entropy loss |
binary_crossentropy(target) |
Binary cross-entropy |
| Method | Description |
|---|---|
Tensor.einsum(formula, *operands) |
Einstein summation |
rearrange(formula, **kwargs) |
einops-style rearrange |
Tensor.cat(*tensors, dim=0) |
Concatenate along dim |
Tensor.stack(*tensors, dim=0) |
Stack along new dim |
split(sizes, dim=0) |
Split into chunks |
chunk(n, dim=0) |
Split into n chunks |
__getitem__ |
Indexing: int, slice, None, Ellipsis |
x = Tensor([1.0, 2.0])
x.requires_grad = True
loss = (x * x).sum()
loss.backward()
print(x.grad.numpy()) # [2.0, 4.0]
Call backward() on a scalar loss before calling item() or numpy() on the loss.
from polygrad.nn import Linear, LayerNorm, RMSNorm, Embedding, Dropout
| Class | Signature | Description |
|---|---|---|
Linear(in_f, out_f, bias=True) |
y = x @ W.T + b | Fully connected layer |
LayerNorm(shape, eps=1e-5) |
(x - mean) / sqrt(var + eps) * w + b | Layer normalization |
RMSNorm(dim, eps=1e-5) |
x / rms(x) * w | Root mean square normalization |
Embedding(vocab, dim) |
Lookup table | Token embedding |
Dropout(p=0.5) |
Random zeroing | Training-only (controlled by Tensor.training) |
GroupNorm(groups, channels) |
Group normalization | Per-group normalization |
from polygrad.nn import SGD, Adam, AdamW, get_parameters
| Class | Signature |
|---|---|
SGD(params, lr=0.01, momentum=0.0, weight_decay=0.0) |
|
Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-8, weight_decay=0.0) |
|
AdamW(params, lr=0.001, betas=(0.9, 0.999), eps=1e-8, weight_decay=0.01) |
All optimizers have step() and zero_grad() methods.
from polygrad.nn import get_parameters, get_state_dict, load_state_dict
params = get_parameters(model) # List of Tensor
sd = get_state_dict(model) # {'weight': Tensor, 'bias': Tensor, ...}
load_state_dict(model2, sd) # Load params into another model
Compile a training step into a reusable C program. The first call traces the computation graph; subsequent calls execute with zero scheduling overhead.
from polygrad import Tensor
from polygrad.nn import Linear, SGD, get_parameters, compile_step
Tensor.manual_seed(42)
model = Linear(4, 1)
opt = SGD(get_parameters(model), lr=0.01)
# Sample inputs (shapes must match at runtime)
x = Tensor.rand(8, 4)
y = Tensor.rand(8, 1)
def train_step(model, opt, x, y):
loss = (model(x) - y).square().mean()
loss.backward()
opt.step()
opt.zero_grad()
return loss
# Compile: traces forward + backward + optimizer into one PolyStep
step = compile_step(train_step, model, opt, x, y)
# Run: executes compiled kernels with current buffer data
for i in range(100):
x._data[:] = ... # update input data in-place
y._data[:] = ...
step.run()
print(f"step {i}: loss = {step.loss_value():.4f}")
compile_step returns a CompiledTrainingStep with:
run() – execute all compiled kernels (forward + backward + optimizer)loss_value() – read the loss scalar from the output buffern_kernels – number of compiled kernelsn_intermediates – number of pre-allocated intermediate buffersLoad pre-trained models directly from HuggingFace format (config.json + safetensors).
To use download_hf(), install the optional Hub client first:
pip install huggingface_hub
from polygrad.hf import load_hf, download_hf, generate
import numpy as np
import json
from pathlib import Path
# Download a small GPT-2 checkpoint from HuggingFace Hub
model_path = download_hf('hf-internal-testing/tiny-random-gpt2')
config = json.loads((Path(model_path) / 'config.json').read_text())
vocab_size = config['vocab_size']
max_seq_len = 16
# Load into a PolyInstance
inst = load_hf(model_path, max_batch=1, max_seq_len=max_seq_len)
# Run forward pass
tokens = np.array([[1, 2, 3, 4]], dtype=np.float32)
outputs = inst.forward(
x=tokens,
positions=np.arange(tokens.shape[1], dtype=np.float32).reshape(1, -1),
arange=np.arange(max_seq_len, dtype=np.float32)
)
logits = outputs['output'].reshape(1, max_seq_len, vocab_size)
# Autoregressive generation
result = generate(inst, tokens, max_new_tokens=2, temperature=1.0, top_k=10)
| Function | Description |
|---|---|
load_hf(path, max_batch=1, max_seq_len=0) |
Load model from local directory |
load_hf_bytes(config, weights, ...) |
Load from raw bytes (no filesystem) |
download_hf(repo_id, cache_dir=None) |
Download from HuggingFace Hub |
generate(inst, tokens, max_new_tokens, ...) |
Autoregressive text generation |
Supported model types: GPT-2. Weight formats: F32, F16, BF16 safetensors (single or sharded).
realize(), numpy(), item(), or backward().gelu = 0.5*x*(1+tanh(sqrt(2/pi)*(x+0.044715*x^3))))..realize() calls to create kernel boundaries for the scheduler.backward() calls C’s poly_grad() for each parameter, then realizes the gradient tensors.libcuda and libnvrtc) on the host.python -m pytest py/tests/ -v # 130 tests (tensor + nn + compiled step + GPT-2 + HF loading + instance)