Writing Problem Files¶
A problem file describes the PyTorch operation you want Makora to optimize. It follows a specific format so the platform can validate, benchmark, and generate optimized kernels.
Problem File Format¶
Every problem file must contain three things:
- A class named
Modelthat extendstorch.nn.Module - A function
get_inputs()that returns a list of input tensors - A function
get_init_inputs()that returns a list of constructor arguments
The Model Class¶
import torch
import torch.nn as nn
class Model(nn.Module):
def __init__(self):
super().__init__()
def forward(self, A: torch.Tensor, B: torch.Tensor) -> torch.Tensor:
return torch.matmul(A, B)
Requirements:
- The class must be named
Model - It must inherit from
torch.nn.Module - The
forwardmethod defines the operation to optimize - Constructor parameters (if any) are provided by
get_init_inputs()
The get_inputs() Function¶
Returns a list of tensors that will be passed to Model.forward():
Requirements:
- Must return a list (positional args to
forward) - Tensors must be on CPU (Makora moves them to GPU automatically)
- Use explicit dtypes when precision matters:
torch.rand(N, N, dtype=torch.float32)
The get_init_inputs() Function¶
Returns a list of arguments passed to the Model() constructor:
If your model takes constructor parameters:
class Model(nn.Module):
def __init__(self, in_features, out_features):
super().__init__()
self.weight = nn.Parameter(torch.randn(out_features, in_features))
def forward(self, x: torch.Tensor) -> torch.Tensor:
return x @ self.weight.T
def get_init_inputs():
return [1024, 2048] # in_features=1024, out_features=2048
Common mistakes that cause validation failures
- The class must be named exactly
Model— notMyModel,Net, or anything else. - All tensors returned by
get_inputs()must be on CPU. Remove any.cuda()or.to("cuda")calls — Makora handles device placement automatically.
Key Requirements¶
| Requirement | Details |
|---|---|
| Class name | Must be Model |
| Inheritance | Must extend torch.nn.Module |
| Tensor placement | All tensors from get_inputs() must be on CPU |
| Explicit dtype | Specify dtype when precision matters |
| Deterministic shapes | Dimensions should be fixed constants, not random |
| No side effects | get_inputs() should only create tensors |
Solution File Format¶
A solution file contains the optimized implementation. Makora generates these automatically, but you can also write them by hand.
Basic Structure¶
import torch
import torch.nn as nn
class ModelNew(nn.Module):
def __init__(self):
super().__init__()
def forward(self, A: torch.Tensor, B: torch.Tensor) -> torch.Tensor:
# Optimized implementation here
...
Requirements:
- The class must be named
ModelNew - The
forwardmethod signature must match the originalModel.forward() - Constructor signature must match the original
Model.__init__() - Output must be numerically equivalent to the original (within tolerance)
CUDA Solution Layout¶
import torch
import torch.nn as nn
from torch.utils.cpp_extension import load_inline
cuda_source = """
#include <torch/extension.h>
#include <cuda_runtime.h>
__global__ void matmul_kernel(const float* A, const float* B, float* C, int N) {
// ... CUDA kernel implementation
}
torch::Tensor matmul_cuda(torch::Tensor A, torch::Tensor B) {
// ... launch kernel
}
"""
cpp_source = "torch::Tensor matmul_cuda(torch::Tensor A, torch::Tensor B);"
matmul_module = load_inline(
name="matmul_cuda",
cpp_sources=cpp_source,
cuda_sources=cuda_source,
functions=["matmul_cuda"],
verbose=False,
)
class ModelNew(nn.Module):
def __init__(self):
super().__init__()
def forward(self, A: torch.Tensor, B: torch.Tensor) -> torch.Tensor:
return matmul_module.matmul_cuda(A.cuda(), B.cuda())
Triton Solution Layout¶
import torch
import torch.nn as nn
import triton
import triton.language as tl
@triton.jit
def matmul_kernel(
a_ptr, b_ptr, c_ptr,
M, N, K,
stride_am, stride_ak,
stride_bk, stride_bn,
stride_cm, stride_cn,
BLOCK_SIZE_M: tl.constexpr, BLOCK_SIZE_N: tl.constexpr, BLOCK_SIZE_K: tl.constexpr,
):
# ... Triton kernel implementation
pass
class ModelNew(nn.Module):
def __init__(self):
super().__init__()
def forward(self, A: torch.Tensor, B: torch.Tensor) -> torch.Tensor:
M, K = A.shape
K, N = B.shape
C = torch.empty(M, N, device=A.device, dtype=A.dtype)
# ... launch Triton kernel
return C
Correctness Tolerances¶
Optimized GPU kernels don't always produce bit-exact results. Floating-point operations can be reordered, fused, or computed at different precisions — all of which introduce small numerical differences. Makora uses two tolerance parameters to decide whether a kernel's output is "correct enough":
- Absolute tolerance (
atol) — The maximum allowed absolute difference between any element in the reference output and the optimized output. Default:0.01. - Relative tolerance (
rtol) — The maximum allowed relative difference, scaled by the magnitude of the expected value. Default:0.01.
A kernel passes validation when every element satisfies:
This is the same formula used by torch.allclose().
Configuring Tolerances¶
Set tolerances when generating:
# Default (atol=0.01, rtol=0.01) — good for most float32 operations
makora generate --file problem.py --device H100
# Tighter tolerances for high-precision workloads
makora generate --file problem.py --device H100 --atol 1e-5 --rtol 1e-5
# Looser tolerances for approximate operations (e.g., softmax, layer norm)
makora generate --file problem.py --device H100 --atol 1e-1 --rtol 1e-1
Close-Miss Kernels¶
When a kernel produces results that are close but outside the configured tolerances, Makora flags it as a close-miss. These kernels appear in makora kernels with their best achieved atol and rtol values instead of a simple pass/fail status.
Close-misses often indicate kernels that are functionally correct but need slightly looser tolerances. If you see close-miss results, consider whether your use case can accept the reported tolerance levels, and run generate again with adjusted --atol/--rtol if so.
Choosing Tolerances¶
Tolerance guidance by operation type
Start with the defaults (0.01 / 0.01). Only loosen tolerances if you see close-miss kernels that are functionally correct for your use case.
| Operation Type | Suggested atol |
Suggested rtol |
|---|---|---|
| Matrix multiply (float32) | 1e-2 |
1e-2 |
| Reduction ops (sum, mean) | 1e-2 |
1e-2 |
| Transcendental ops (exp, log, softmax) | 1e-1 |
1e-1 |
| High-precision requirements | 1e-5 |
1e-5 |
| Half-precision (float16) | 1e-1 |
1e-1 |
Validation Checklist¶
Before generating, verify:
- [ ] Class is named
Model(notMyModel,Net, etc.) - [ ]
Modelextendstorch.nn.Module - [ ]
get_inputs()returns a list of CPU tensors - [ ]
get_init_inputs()returns a list of constructor arguments (or[]) - [ ] Tensor shapes are fixed (no randomness in dimensions)
- [ ] The model runs correctly:
Model(*get_init_inputs())(*get_inputs())
You can validate your file without starting an optimization session:
Complete Examples¶
Example 1: Square Matrix Multiplication
import torch
import torch.nn as nn
class Model(nn.Module):
"""
Simple model that performs a single square matrix multiplication (C = A * B)
"""
def __init__(self):
super().__init__()
def forward(self, A: torch.Tensor, B: torch.Tensor) -> torch.Tensor:
return torch.matmul(A, B)
N = 2048 * 2
def get_inputs():
A = torch.rand(N, N)
B = torch.rand(N, N)
return [A, B]
def get_init_inputs():
return [] # No special initialization inputs needed
Example 2: Rectangular Matrix Multiplication
import torch
import torch.nn as nn
class Model(nn.Module):
"""
Simple model that performs a single matrix multiplication (C = A * B)
"""
def __init__(self):
super().__init__()
def forward(self, A: torch.Tensor, B: torch.Tensor) -> torch.Tensor:
return torch.matmul(A, B)
M = 1024 * 2
K = 4096 * 2
N = 2048 * 2
def get_inputs():
A = torch.rand(M, K)
B = torch.rand(K, N)
return [A, B]
def get_init_inputs():
return [] # No special initialization inputs needed
Converting from Other Formats¶
If you have standalone PyTorch code, wrap it in the problem file format:
- Move the operation into
Model.forward() - Move input creation into
get_inputs()— return CPU tensors as a list - Move any model constructor arguments into
get_init_inputs() - Remove any GPU device placement (
.cuda(),.to("cuda")) from input creation
Before:
A = torch.rand(1024, 1024, device="cuda")
B = torch.rand(1024, 1024, device="cuda")
C = torch.matmul(A, B)
After: