Writing Problem Files¶

A problem file describes the PyTorch operation you want Makora to optimize. It follows a specific format so the platform can validate, benchmark, and generate optimized kernels.

Problem File Format¶

Every problem file must contain three things:

A class named Model that extends torch.nn.Module
A function get_inputs() that returns a list of input tensors
A function get_init_inputs() that returns a list of constructor arguments

The Model Class¶

import torch
import torch.nn as nn

class Model(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, A: torch.Tensor, B: torch.Tensor) -> torch.Tensor:
        return torch.matmul(A, B)

Requirements:

The class must be named Model
It must inherit from torch.nn.Module
The forward method defines the operation to optimize
Constructor parameters (if any) are provided by get_init_inputs()

The `get_inputs()` Function¶

Returns a list of tensors that will be passed to Model.forward():

def get_inputs():
    A = torch.rand(N, N)
    B = torch.rand(N, N)
    return [A, B]

Requirements:

Must return a list (positional args to forward)
Tensors must be on CPU (Makora moves them to GPU automatically)
Use explicit dtypes when precision matters: torch.rand(N, N, dtype=torch.float32)

The `get_init_inputs()` Function¶

Returns a list of arguments passed to the Model() constructor:

def get_init_inputs():
    return []  # No constructor arguments

If your model takes constructor parameters:

class Model(nn.Module):
    def __init__(self, in_features, out_features):
        super().__init__()
        self.weight = nn.Parameter(torch.randn(out_features, in_features))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return x @ self.weight.T

def get_init_inputs():
    return [1024, 2048]  # in_features=1024, out_features=2048

Common mistakes that cause validation failures

The class must be named exactly Model — not MyModel, Net, or anything else.
All tensors returned by get_inputs() must be on CPU. Remove any .cuda() or .to("cuda") calls — Makora handles device placement automatically.

Key Requirements¶

Requirement	Details
Class name	Must be `Model`
Inheritance	Must extend `torch.nn.Module`
Tensor placement	All tensors from `get_inputs()` must be on CPU
Explicit dtype	Specify `dtype` when precision matters
Deterministic shapes	Dimensions should be fixed constants, not random
No side effects	`get_inputs()` should only create tensors

Solution File Format¶

A solution file contains the optimized implementation. Makora generates these automatically, but you can also write them by hand.

Basic Structure¶

import torch
import torch.nn as nn

class ModelNew(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, A: torch.Tensor, B: torch.Tensor) -> torch.Tensor:
        # Optimized implementation here
        ...

Requirements:

The class must be named ModelNew
The forward method signature must match the original Model.forward()
Constructor signature must match the original Model.__init__()
Output must be numerically equivalent to the original (within tolerance)

CUDA Solution Layout¶

import torch
import torch.nn as nn
from torch.utils.cpp_extension import load_inline

cuda_source = """
#include <torch/extension.h>
#include <cuda_runtime.h>

__global__ void matmul_kernel(const float* A, const float* B, float* C, int N) {
    // ... CUDA kernel implementation
}

torch::Tensor matmul_cuda(torch::Tensor A, torch::Tensor B) {
    // ... launch kernel
}
"""

cpp_source = "torch::Tensor matmul_cuda(torch::Tensor A, torch::Tensor B);"

matmul_module = load_inline(
    name="matmul_cuda",
    cpp_sources=cpp_source,
    cuda_sources=cuda_source,
    functions=["matmul_cuda"],
    verbose=False,
)

class ModelNew(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, A: torch.Tensor, B: torch.Tensor) -> torch.Tensor:
        return matmul_module.matmul_cuda(A.cuda(), B.cuda())

Triton Solution Layout¶

import torch
import torch.nn as nn
import triton
import triton.language as tl

@triton.jit
def matmul_kernel(
    a_ptr, b_ptr, c_ptr,
    M, N, K,
    stride_am, stride_ak,
    stride_bk, stride_bn,
    stride_cm, stride_cn,
    BLOCK_SIZE_M: tl.constexpr, BLOCK_SIZE_N: tl.constexpr, BLOCK_SIZE_K: tl.constexpr,
):
    # ... Triton kernel implementation
    pass

class ModelNew(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, A: torch.Tensor, B: torch.Tensor) -> torch.Tensor:
        M, K = A.shape
        K, N = B.shape
        C = torch.empty(M, N, device=A.device, dtype=A.dtype)
        # ... launch Triton kernel
        return C

Correctness Tolerances¶

Optimized GPU kernels don't always produce bit-exact results. Floating-point operations can be reordered, fused, or computed at different precisions — all of which introduce small numerical differences. Makora uses two tolerance parameters to decide whether a kernel's output is "correct enough":

Absolute tolerance (atol) — The maximum allowed absolute difference between any element in the reference output and the optimized output. Default: 0.01.
Relative tolerance (rtol) — The maximum allowed relative difference, scaled by the magnitude of the expected value. Default: 0.01.

A kernel passes validation when every element satisfies:

|reference - optimized| ≤ atol + rtol × |reference|

This is the same formula used by torch.allclose().

Configuring Tolerances¶

Set tolerances when generating:

# Default (atol=0.01, rtol=0.01) — good for most float32 operations
makora generate --file problem.py --device H100

# Tighter tolerances for high-precision workloads
makora generate --file problem.py --device H100 --atol 1e-5 --rtol 1e-5

# Looser tolerances for approximate operations (e.g., softmax, layer norm)
makora generate --file problem.py --device H100 --atol 1e-1 --rtol 1e-1

Close-Miss Kernels¶

When a kernel produces results that are close but outside the configured tolerances, Makora flags it as a close-miss. These kernels appear in makora kernels with their best achieved atol and rtol values instead of a simple pass/fail status.

Close-misses often indicate kernels that are functionally correct but need slightly looser tolerances. If you see close-miss results, consider whether your use case can accept the reported tolerance levels, and run generate again with adjusted --atol/--rtol if so.

Choosing Tolerances¶

Tolerance guidance by operation type

Start with the defaults (0.01 / 0.01). Only loosen tolerances if you see close-miss kernels that are functionally correct for your use case.

Operation Type	Suggested `atol`	Suggested `rtol`
Matrix multiply (float32)	`1e-2`	`1e-2`
Reduction ops (sum, mean)	`1e-2`	`1e-2`
Transcendental ops (exp, log, softmax)	`1e-1`	`1e-1`
High-precision requirements	`1e-5`	`1e-5`
Half-precision (float16)	`1e-1`	`1e-1`

Validation Checklist¶

Before generating, verify:

[ ] Class is named Model (not MyModel, Net, etc.)
[ ] Model extends torch.nn.Module
[ ] get_inputs() returns a list of CPU tensors
[ ] get_init_inputs() returns a list of constructor arguments (or [])
[ ] Tensor shapes are fixed (no randomness in dimensions)
[ ] The model runs correctly: Model(*get_init_inputs())(*get_inputs())

You can validate your file without starting an optimization session:

makora check problem.py

Complete Examples¶

Example 1: Square Matrix Multiplication

import torch
import torch.nn as nn


class Model(nn.Module):
    """
    Simple model that performs a single square matrix multiplication (C = A * B)
    """

    def __init__(self):
        super().__init__()

    def forward(self, A: torch.Tensor, B: torch.Tensor) -> torch.Tensor:
        return torch.matmul(A, B)


N = 2048 * 2


def get_inputs():
    A = torch.rand(N, N)
    B = torch.rand(N, N)
    return [A, B]


def get_init_inputs():
    return []  # No special initialization inputs needed

Example 2: Rectangular Matrix Multiplication

import torch
import torch.nn as nn


class Model(nn.Module):
    """
    Simple model that performs a single matrix multiplication (C = A * B)
    """

    def __init__(self):
        super().__init__()

    def forward(self, A: torch.Tensor, B: torch.Tensor) -> torch.Tensor:
        return torch.matmul(A, B)


M = 1024 * 2
K = 4096 * 2
N = 2048 * 2


def get_inputs():
    A = torch.rand(M, K)
    B = torch.rand(K, N)
    return [A, B]


def get_init_inputs():
    return []  # No special initialization inputs needed

Converting from Other Formats¶

If you have standalone PyTorch code, wrap it in the problem file format:

Move the operation into Model.forward()
Move input creation into get_inputs() — return CPU tensors as a list
Move any model constructor arguments into get_init_inputs()
Remove any GPU device placement (.cuda(), .to("cuda")) from input creation

Before:

A = torch.rand(1024, 1024, device="cuda")
B = torch.rand(1024, 1024, device="cuda")
C = torch.matmul(A, B)

After:

import torch
import torch.nn as nn

class Model(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, A: torch.Tensor, B: torch.Tensor) -> torch.Tensor:
        return torch.matmul(A, B)

def get_inputs():
    return [torch.rand(1024, 1024), torch.rand(1024, 1024)]

def get_init_inputs():
    return []

Writing Problem Files¶

Problem File Format¶

The Model Class¶

The get_inputs() Function¶

The get_init_inputs() Function¶

Key Requirements¶

Solution File Format¶

Basic Structure¶

CUDA Solution Layout¶

Triton Solution Layout¶

Correctness Tolerances¶

Configuring Tolerances¶

Close-Miss Kernels¶

Choosing Tolerances¶

Validation Checklist¶

Complete Examples¶

Converting from Other Formats¶

The `get_inputs()` Function¶

The `get_init_inputs()` Function¶