Skip to content

Writing Problem Files

A problem file describes the PyTorch operation you want Makora to optimize. It follows a specific format so the platform can validate, benchmark, and generate optimized kernels.

Problem File Format

Every problem file must contain three things:

  1. A class named Model that extends torch.nn.Module
  2. A function get_inputs() that returns a list of input tensors
  3. A function get_init_inputs() that returns a list of constructor arguments

The Model Class

import torch
import torch.nn as nn

class Model(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, A: torch.Tensor, B: torch.Tensor) -> torch.Tensor:
        return torch.matmul(A, B)

Requirements:

  • The class must be named Model
  • It must inherit from torch.nn.Module
  • The forward method defines the operation to optimize
  • Constructor parameters (if any) are provided by get_init_inputs()

The get_inputs() Function

Returns a list of tensors that will be passed to Model.forward():

def get_inputs():
    A = torch.rand(N, N)
    B = torch.rand(N, N)
    return [A, B]

Requirements:

  • Must return a list (positional args to forward)
  • Tensors must be on CPU (Makora moves them to GPU automatically)
  • Use explicit dtypes when precision matters: torch.rand(N, N, dtype=torch.float32)

The get_init_inputs() Function

Returns a list of arguments passed to the Model() constructor:

def get_init_inputs():
    return []  # No constructor arguments

If your model takes constructor parameters:

class Model(nn.Module):
    def __init__(self, in_features, out_features):
        super().__init__()
        self.weight = nn.Parameter(torch.randn(out_features, in_features))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return x @ self.weight.T

def get_init_inputs():
    return [1024, 2048]  # in_features=1024, out_features=2048

Common mistakes that cause validation failures

  • The class must be named exactly Model — not MyModel, Net, or anything else.
  • All tensors returned by get_inputs() must be on CPU. Remove any .cuda() or .to("cuda") calls — Makora handles device placement automatically.

Key Requirements

Requirement Details
Class name Must be Model
Inheritance Must extend torch.nn.Module
Tensor placement All tensors from get_inputs() must be on CPU
Explicit dtype Specify dtype when precision matters
Deterministic shapes Dimensions should be fixed constants, not random
No side effects get_inputs() should only create tensors

Solution File Format

A solution file contains the optimized implementation. Makora generates these automatically, but you can also write them by hand.

Basic Structure

import torch
import torch.nn as nn

class ModelNew(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, A: torch.Tensor, B: torch.Tensor) -> torch.Tensor:
        # Optimized implementation here
        ...

Requirements:

  • The class must be named ModelNew
  • The forward method signature must match the original Model.forward()
  • Constructor signature must match the original Model.__init__()
  • Output must be numerically equivalent to the original (within tolerance)

CUDA Solution Layout

import torch
import torch.nn as nn
from torch.utils.cpp_extension import load_inline

cuda_source = """
#include <torch/extension.h>
#include <cuda_runtime.h>

__global__ void matmul_kernel(const float* A, const float* B, float* C, int N) {
    // ... CUDA kernel implementation
}

torch::Tensor matmul_cuda(torch::Tensor A, torch::Tensor B) {
    // ... launch kernel
}
"""

cpp_source = "torch::Tensor matmul_cuda(torch::Tensor A, torch::Tensor B);"

matmul_module = load_inline(
    name="matmul_cuda",
    cpp_sources=cpp_source,
    cuda_sources=cuda_source,
    functions=["matmul_cuda"],
    verbose=False,
)

class ModelNew(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, A: torch.Tensor, B: torch.Tensor) -> torch.Tensor:
        return matmul_module.matmul_cuda(A.cuda(), B.cuda())

Triton Solution Layout

import torch
import torch.nn as nn
import triton
import triton.language as tl

@triton.jit
def matmul_kernel(
    a_ptr, b_ptr, c_ptr,
    M, N, K,
    stride_am, stride_ak,
    stride_bk, stride_bn,
    stride_cm, stride_cn,
    BLOCK_SIZE_M: tl.constexpr, BLOCK_SIZE_N: tl.constexpr, BLOCK_SIZE_K: tl.constexpr,
):
    # ... Triton kernel implementation
    pass

class ModelNew(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, A: torch.Tensor, B: torch.Tensor) -> torch.Tensor:
        M, K = A.shape
        K, N = B.shape
        C = torch.empty(M, N, device=A.device, dtype=A.dtype)
        # ... launch Triton kernel
        return C

Correctness Tolerances

Optimized GPU kernels don't always produce bit-exact results. Floating-point operations can be reordered, fused, or computed at different precisions — all of which introduce small numerical differences. Makora uses two tolerance parameters to decide whether a kernel's output is "correct enough":

  • Absolute tolerance (atol) — The maximum allowed absolute difference between any element in the reference output and the optimized output. Default: 0.01.
  • Relative tolerance (rtol) — The maximum allowed relative difference, scaled by the magnitude of the expected value. Default: 0.01.

A kernel passes validation when every element satisfies:

|reference - optimized| ≤ atol + rtol × |reference|

This is the same formula used by torch.allclose().

Configuring Tolerances

Set tolerances when generating:

# Default (atol=0.01, rtol=0.01) — good for most float32 operations
makora generate --file problem.py --device H100

# Tighter tolerances for high-precision workloads
makora generate --file problem.py --device H100 --atol 1e-5 --rtol 1e-5

# Looser tolerances for approximate operations (e.g., softmax, layer norm)
makora generate --file problem.py --device H100 --atol 1e-1 --rtol 1e-1

Close-Miss Kernels

When a kernel produces results that are close but outside the configured tolerances, Makora flags it as a close-miss. These kernels appear in makora kernels with their best achieved atol and rtol values instead of a simple pass/fail status.

Close-misses often indicate kernels that are functionally correct but need slightly looser tolerances. If you see close-miss results, consider whether your use case can accept the reported tolerance levels, and run generate again with adjusted --atol/--rtol if so.

Choosing Tolerances

Tolerance guidance by operation type

Start with the defaults (0.01 / 0.01). Only loosen tolerances if you see close-miss kernels that are functionally correct for your use case.

Operation Type Suggested atol Suggested rtol
Matrix multiply (float32) 1e-2 1e-2
Reduction ops (sum, mean) 1e-2 1e-2
Transcendental ops (exp, log, softmax) 1e-1 1e-1
High-precision requirements 1e-5 1e-5
Half-precision (float16) 1e-1 1e-1

Validation Checklist

Before generating, verify:

  • [ ] Class is named Model (not MyModel, Net, etc.)
  • [ ] Model extends torch.nn.Module
  • [ ] get_inputs() returns a list of CPU tensors
  • [ ] get_init_inputs() returns a list of constructor arguments (or [])
  • [ ] Tensor shapes are fixed (no randomness in dimensions)
  • [ ] The model runs correctly: Model(*get_init_inputs())(*get_inputs())

You can validate your file without starting an optimization session:

makora check problem.py

Complete Examples

Example 1: Square Matrix Multiplication
import torch
import torch.nn as nn


class Model(nn.Module):
    """
    Simple model that performs a single square matrix multiplication (C = A * B)
    """

    def __init__(self):
        super().__init__()

    def forward(self, A: torch.Tensor, B: torch.Tensor) -> torch.Tensor:
        return torch.matmul(A, B)


N = 2048 * 2


def get_inputs():
    A = torch.rand(N, N)
    B = torch.rand(N, N)
    return [A, B]


def get_init_inputs():
    return []  # No special initialization inputs needed
Example 2: Rectangular Matrix Multiplication
import torch
import torch.nn as nn


class Model(nn.Module):
    """
    Simple model that performs a single matrix multiplication (C = A * B)
    """

    def __init__(self):
        super().__init__()

    def forward(self, A: torch.Tensor, B: torch.Tensor) -> torch.Tensor:
        return torch.matmul(A, B)


M = 1024 * 2
K = 4096 * 2
N = 2048 * 2


def get_inputs():
    A = torch.rand(M, K)
    B = torch.rand(K, N)
    return [A, B]


def get_init_inputs():
    return []  # No special initialization inputs needed

Converting from Other Formats

If you have standalone PyTorch code, wrap it in the problem file format:

  1. Move the operation into Model.forward()
  2. Move input creation into get_inputs() — return CPU tensors as a list
  3. Move any model constructor arguments into get_init_inputs()
  4. Remove any GPU device placement (.cuda(), .to("cuda")) from input creation

Before:

A = torch.rand(1024, 1024, device="cuda")
B = torch.rand(1024, 1024, device="cuda")
C = torch.matmul(A, B)

After:

import torch
import torch.nn as nn

class Model(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, A: torch.Tensor, B: torch.Tensor) -> torch.Tensor:
        return torch.matmul(A, B)

def get_inputs():
    return [torch.rand(1024, 1024), torch.rand(1024, 1024)]

def get_init_inputs():
    return []