Skip to content

Commands

This page contains all Makora CLI command documentation.

Command Sections

Authentication Commands

makora login

Authenticate with the Makora API. Stores credentials locally for all subsequent commands.

Usage

# Interactive — prompts for token
makora login

# Non-interactive — pass token directly
makora login --token YOUR_TOKEN

Options

Option Type Description
--token string API token (skips interactive prompt)
--user string Username (optional)
--url string Override the Makora API URL
--quite flag Disable interactive prompts

Where to Get a Token

  1. Go to https://generate.makora.com/tokens
  2. Log in or create an account
  3. Create a new token or copy an existing one

Credential Storage

Tip

Credentials are stored as a plain text file. For CI/CD pipelines, use the MAKORA_USER_FILE environment variable to point to a credentials file managed by your secrets system.

Credentials are saved to ~/.makora/user by default. Override this with the MAKORA_USER_FILE environment variable.

Examples

# Interactive login
makora login

# Script-friendly login
makora login --token mk-abc123def456

makora logout

Remove stored credentials.

Usage

makora logout

Deletes the credential file at ~/.makora/user.


makora info

Display version information, login status, and environment variable settings.

Usage

makora info

Output

Makora version: 0.1.0
    Repo: makora-cli
    Commit: abc123

Logged in as: user@example.com

Env Variable       Value                                    Default
MAKORA_AUTH_URL     https://be.stage.makora.com/api/v1/      https://be.stage.makora.com/api/v1/
MAKORA_NO_RICH
MAKORA_URL          https://generate.stage.makora.com        https://generate.stage.makora.com
MAKORA_USER_FILE    ~/.makora/user                           ~/.makora/user

Environment Variables

Note

These variables are only needed for advanced use cases like pointing at a staging server or custom credential paths. Most users won't need to change them.

Variable Default Description
MAKORA_URL https://generate.stage.makora.com Base URL for the Makora Generate API
MAKORA_USER_FILE ~/.makora/user Path to the credential file
MAKORA_AUTH_URL https://be.stage.makora.com/api/v1/ Base URL for the authentication API
MAKORA_NO_RICH (empty) Set to any value to disable Rich text formatting

Generate & Optimize

makora generate

Run generation on a problem file for optimization. Makora validates the file, then creates an optimization session that generates progressively faster kernels.

Usage

makora generate --file <path> --device <device> [options]

Options

Option Type Default Description
--file path required Path to the problem file
--device enum required Target device (H100, H200, B200, L40S, MI300X, Adreno 830, Adreno 750, Hexagon v79, Hexagon v75)
--language enum device default Kernel language (cuda, triton, cutedsl, hip, opencl, ripple)
--label string "" Label for the session (visible in makora jobs)
--atol float 0.01 Absolute tolerance for correctness validation (see Tolerances)
--rtol float 0.01 Relative tolerance for correctness validation (see Tolerances)
--fix flag false Enable automatic fix suggestions for validation errors
--instr path(s) none Path(s) to instruction files providing optimization context
--url string none Override the Makora API URL

How It Works

When you run makora generate, the following happens:

  1. Validation — Your problem file is uploaded and validated:
    • Compilation check
    • Preparing objects for execution
    • Benchmarking to establish baseline performance
  2. Session creation — If validation passes, an optimization session starts
  3. Kernel generation — The platform generates and benchmarks optimized kernels in the background

The --fix Flag

Tip

Always try --fix when a run fails validation. It can automatically correct common issues like missing imports, wrong class names, or tensor device placement.

If validation fails, use --fix to get automatic fix suggestions:

makora generate --file problem.py --device H100 --fix

When a fix is available, Makora shows the suggested changes and asks if you want to accept them. If you accept, the fixed code is run again automatically.

Without --fix, a failing run prints a hint:

Hint: try generating with --fix to get automatic fix suggestions:
  makora generate --file problem.py --device H100 --fix

Instruction Files (--instr)

The --instr flag is how you steer the optimization agent. Makora's optimizer is an AI agent that generates and iterates on kernel code — instruction files let you inject your own expert knowledge into that process. Think of it as pair-programming with the agent: you bring the domain expertise, it brings the implementation speed. This is also where you'd provide an existing kernel implementation if you want the agent to start from a particular baseline instead of a blank slate.

This is your opportunity to nudge the agent toward specific optimization strategies, low-level techniques, or hardware-specific tricks that you know will work for your problem. Without instructions, the agent explores on its own. With instructions, you can point it directly at the approach you want.

makora generate --file problem.py --device H100 --instr hints.txt

Multiple instruction files can be combined:

makora generate --file problem.py --device H100 --instr technique.txt --instr constraints.txt

Instruction files are plain text. Their contents are concatenated and passed as context to the optimization agent.

What to Put in Instruction Files

You can include anything that helps the agent write better kernels:

  • Specific optimization techniques — "Use double buffering with shared memory" or "Apply register tiling with a 8x8 thread tile"
  • Low-level intrinsics — "Use __ldg() for read-only global memory loads" or "Use warp shuffle __shfl_sync() for the reduction"
  • Memory access patterns — "The input matrices are always power-of-2 aligned, so you can assume 128-byte aligned loads"
  • Architecture-specific knowledge — "On H100, the L2 cache is 50MB — the working set fits entirely in L2"
  • Algorithmic hints — "This is a tall-skinny matmul (M>>N), so parallelize along M and use a serial reduction along K"
  • Constraints — "Do not use torch.compile" or "The solution must be a single fused kernel"
  • Reference implementations — Paste in a known-good approach from a paper or library and tell the agent to build on it
Example: Guiding a Matrix Multiply with Expert CUDA Knowledge

Say you're optimizing a matrix multiply and you know from experience that on H100, the key to peak throughput is using cp.async to overlap global-to-shared-memory copies with computation, combined with warp-specialized persistent kernels (the approach used by CUTLASS 3.x).

Create a file h100-matmul-hints.txt:

Use an asynchronous warp-specialized persistent kernel design for this matmul:

1. Partition warps into producer and consumer roles. Producer warps issue
   cp.async (or TMA on H100) to load tiles from global memory into shared
   memory. Consumer warps compute on the previously loaded tiles using
   tensor core mma instructions (m16n8k16 for fp32 accum).

2. Use multi-stage software pipelining with at least 3 shared memory buffers
   so that loads, computes, and stores can overlap across pipeline stages.

3. Use the following tiling:
   - Thread block tile: 128x256xK
   - Warp tile: 64x64xK
   - Use ldmatrix (PTX: ldmatrix.sync.aligned.m8n8.x4) for shared-to-register
     loads to feed the tensor cores efficiently.

4. Use inline PTX for the cp.async instructions:
   asm volatile("cp.async.cg.shared.global [%0], [%1], %2;" :: "r"(smem_ptr), "l"(gmem_ptr), "n"(16));
   asm volatile("cp.async.commit_group;");
   asm volatile("cp.async.wait_group %0;" :: "n"(stages - 2));

5. Epilogue: use vectorized 128-bit stores (float4) to write the result
   tile back to global memory with full memory coalescing.

Run with the instruction file:

makora generate --file matmul.py --device H100 --instr h100-matmul-hints.txt

Instead of exploring broadly, the agent will focus on implementing the specific warp-specialized persistent kernel approach you described — and it can often get there much faster than discovering this strategy on its own.

Tips

Getting the most out of instruction files

  • Be specific. "Make it faster" doesn't help. "Use 128x128 thread block tiles with 8 pipeline stages" does.
  • Include code snippets. If you know the exact PTX or intrinsic call, paste it in. The agent can incorporate it directly.
  • Combine with --language. If your instructions reference CUDA intrinsics, make sure you're running with --language cuda. If they reference Triton tl.dot tuning, use --language triton.
  • Iterate. Check results with makora kernels, then refine your instructions and run generate again.

Examples

# Basic run on H100
makora generate --file problem.py --device H100

# Generate with Triton on H100
makora generate --file problem.py --device H100 --language triton

# Generate on AMD MI300X
makora generate --file problem.py --device MI300X

# Generate with a label and fix suggestions
makora generate --file problem.py --device H100 --label "matmul-v2" --fix

# Generate with custom tolerances
makora generate --file problem.py --device H100 --atol 1e-3 --rtol 1e-3

# Generate with instruction context
makora generate --file problem.py --device H100 --instr optimization-hints.txt

Output

Device: H100
Language: cuda

✓ Validation passed
  Compilation: passed
  Preparation: passed
  Benchmarking: passed (1.234 ms)

Session created!
  Session ID: a1b2c3d4
  Problem ID: e5f6a7b8

Monitor progress with: makora jobs

Jobs & Sessions

makora jobs

List all your optimization sessions and their current status.

Usage

makora jobs [--fast]

Options

Option Type Default Description
--fast flag false Skip fetching extra data (device, speedup) for faster output

Output Columns

Column Description
Session ID First 8 characters of the session UUID
Status Current status (running, completed, failed, stopped, etc.)
Label Session label (set with --label when running makora generate), truncated to 20 characters
Device Target device (omitted with --fast)
vs torch.compile Best speedup vs torch.compile baseline (omitted with --fast)
Started Relative time since session started

Examples

# List all jobs with full details
makora jobs

# Quick listing (skip device/speedup lookups)
makora jobs --fast

Output

                              Jobs
 Session ID   Status        Label        Device   vs torch.compile   Started
 a1b2c3d4     ● running     matmul-v2    H100     1.94x              5m ago
 e5f6a7b8     ● completed   conv-test    L40S     2.31x              1h ago
 c9d0e1f2     ● failed      -            MI300X   -                  3h ago

makora stop

Stop a running optimization session.

Usage

makora stop <job_uuid>

Arguments

Argument Description
job_uuid The UUID (or UUID prefix) of the session to stop

Tip

UUID prefix matching is supported — you only need enough characters to uniquely identify the session. In most cases the first 4-8 characters are enough.

Examples

# Stop using full UUID
makora stop a1b2c3d4-e5f6-7890-abcd-ef1234567890

# Stop using prefix (must be unique)
makora stop a1b2c3d4

# Stop using short prefix
makora stop a1b2

Output

Found job: a1b2c3d4-e5f6-7890-abcd-ef1234567890
Job a1b2c3d4 stopped successfully.

Kernels & Results

makora kernels

View the optimized kernels generated by an optimization session.

Usage

# List all kernels for a session
makora kernels <session_id>

# View a specific kernel's code and performance
makora kernels <session_id> <kernel_id>

# Save kernel code to a file
makora kernels <session_id> <kernel_id> -o <output_file>

Arguments

Argument Description
session_id Session ID or prefix
kernel_id (Optional) Kernel ID or prefix — shows code and performance details

Options

Option Type Description
-o, --output path Save kernel code to a file instead of printing it

Prefix matching

Both session_id and kernel_id support prefix matching — you only need enough characters to uniquely identify the target. The first 4-8 characters usually suffice.

Listing Kernels

makora kernels a1b2c3d4
Output Columns
Column Description
Attempt Which optimization attempt generated this kernel
Kernel ID First 8 characters of the kernel UUID
Name Kernel name (truncated to 15 characters)
Status Evaluation status (completed, failed, or close-miss with tolerance info)
Time Execution time (with unit)
vs torch.compile Speedup compared to torch.compile baseline
Example Output
              Kernels for a1b2c3d4 (matmul-v2)
 Attempt   Kernel ID   Name          Status        Time       vs torch.compile
 1         f1e2d3c4    kernel_v1     ● completed   0.523 ms   1.82x
 2         b5a6c7d8    kernel_v2     ● completed   0.491 ms   1.94x
 3         a9b0c1d2    kernel_v3     ● failed      -          -

Viewing Kernel Code

makora kernels a1b2c3d4 b5a6c7d8

Displays the full kernel source code with syntax highlighting, followed by performance metrics:

── kernel_v2 (b5a6c7d8) ──
  Kernel time:      0.491 ms
  Reference eager:  1.234 ms
  torch.compile:    0.952 ms
  vs eager:         2.51x
  vs torch.compile: 1.94x

Saving Kernel Code

makora kernels a1b2c3d4 b5a6c7d8 -o solution.py
Kernel saved to: solution.py

Examples

# List all kernels from a session
makora kernels a1b2c3d4

# View best kernel's code
makora kernels a1b2c3d4 b5a6c7d8

# Save kernel for evaluation
makora kernels a1b2c3d4 b5a6c7d8 -o solution.py

makora refcode

View the original reference code (problem file) that was used for a session.

Usage

makora refcode <session_id> [-o <output_file>]

Arguments

Argument Description
session_id Session ID or prefix

Options

Option Type Description
-o, --output path Save reference code to a file

Examples

# View the original problem code
makora refcode a1b2c3d4

# Save it to a file
makora refcode a1b2c3d4 -o original_problem.py

Evaluate

makora evaluate

Benchmark an optimized kernel against a reference implementation on remote hardware. Returns execution times and speedup.

Usage

makora evaluate <reference_file> <optimized_file> [options]

Arguments

Argument Description
reference_file Path to the reference/problem file
optimized_file Path to the optimized solution file

Options

Option Type Default Description
-d, --device string L40S See all devices in Supported Hardware
--url string none Override the Makora API URL

Device names are case-insensitive.

Output

Evaluating code...

✓ Evaluation successful!

Benchmark Results:
  Reference time: 1.234567 ms
  Solution time:  0.491234 ms
  Speedup:        2.51x

Examples

# Evaluate on default device (H100)
makora evaluate problem.py solution.py

# Evaluate on H100
makora evaluate problem.py solution.py --device H100

# Evaluate on AMD MI300X
makora evaluate problem.py solution.py --device MI300x

makora check

Tip

Use makora check to validate your problem file before committing to a full generate. It catches errors quickly without creating an optimization session.

Validate a problem file without starting an optimization session. Runs compilation, preparation, and benchmarking checks.

Usage

makora check <file> [--device <device>]

Arguments

Argument Description
file Path to the problem file to validate

Options

Option Type Default Description
--device enum H100 Target device for validation (H100, H200, B200, L40S, MI300X, etc.)

Examples

# Validate on default device (H100)
makora check problem.py

# Validate for a specific device
makora check problem.py --device L40S
makora check problem.py --device MI300X

Output

Shows validation results including compilation, preparation, and benchmarking status. If validation fails, error logs are displayed.

Profile

makora profile

Profile an optimized kernel on remote hardware. While makora evaluate tells you how fast your kernel is, makora profile tells you why — returning hardware counters, occupancy data, Nsight Systems and Nsight Compute traces, and even the generated SASS assembly so you can see exactly what the GPU is doing.

Use profiling when you need to diagnose performance bottlenecks, verify that your optimization strategy is working at the hardware level, or gather data to inform your next round of makora generate ... --instr hints.

Currently, only the NVIDIA H100 is supported by the profiler.

Usage

makora profile <reference_file> <optimized_file> [options]

Arguments

Argument Description
reference_file Path to the reference/problem file
optimized_file Path to the optimized solution file

Options

Option Type Default Description
-d, --device string H100 Currently, only H100 is supported
--url string none Override the Makora API URL

What You Get Back

Profiling runs in full mode, which returns the most comprehensive data available. For each GPU kernel launched by your code, the output can include:

Raw Metrics

Hardware performance counters and execution statistics:

Metrics:
  duration_ns: 491234
  registers_per_thread: 32
  shared_memory_bytes: 8192
  grid_size: [128, 1, 1]
  block_size: [256, 1, 1]
  occupancy: 0.75

Interpreting raw metrics

These are the numbers you need to diagnose performance. For example:

  • High register count (e.g., 128+ per thread) → low occupancy, consider reducing register pressure
  • Low occupancy → not enough warps to hide memory latency, try smaller tile sizes or less shared memory per block
  • Short duration but many kernel launches → launch overhead is significant, consider fusing kernels
Details Page

Detailed kernel execution breakdown from the profiler, including timing per operation, memory throughput, and compute utilization.

Nsight Systems (nsys) Report

The full Nsight Systems trace output showing the timeline of GPU activity — kernel launches, memory transfers, synchronization points, and idle gaps. This is the same data you'd get from running nsys profile locally, but executed on remote hardware.

Additional Data in the API Response

Note

The data below is available through the API response even if not all fields are printed by the CLI. Use the API directly if you need access to SASS assembly or annotated source.

The profiling API also captures the following data (available through the API even if not all are printed by the CLI):

Data Description
CUDA source The compiled CUDA source code as seen by the profiler
SASS assembly The actual GPU assembly (SASS) that ran on the hardware — the ground truth of what your kernel compiled to
Annotated source Source code annotated with profiling data (hotspots, stall reasons)
Torch trace PyTorch execution trace for understanding the operator-level breakdown

Example Output

Profiling code...

Profiling successful!

Profiled 2 kernel(s):

--- Kernel 1 ---

Metrics:
  duration_ns: 491234
  registers_per_thread: 32
  shared_memory_bytes: 49152
  grid_size: [128, 1, 1]
  block_size: [256, 1, 1]

Details:
  Compute Throughput:     78.3%
  Memory Throughput:      45.2%
  Achieved Occupancy:     75.0%
  Warp Execution Eff:     98.4%

Nsys Report:
  Time(%)  Total Time (ns)  Instances  Avg (ns)   Kernel Name
  -------  ---------------  ---------  ---------  -----------
   85.2%          491234          1    491234     matmul_kernel
   14.8%           85432          1     85432     elementwise_add

--- Kernel 2 ---
  ...

When to Use Profile vs Evaluate

makora evaluate makora profile
Purpose Get speedup number Understand why it's fast/slow
Speed Fast Slower (runs profiling tools)
Output Reference time, solution time, speedup Hardware counters, nsys trace, SASS, source annotations
Use when Checking if your kernel is faster Diagnosing bottlenecks, planning next optimization

Recommended workflow

  1. evaluate first to see the speedup
  2. If the speedup isn't what you expected, profile to find out why
  3. Use profiling data to write better --instr hints for your next run

Examples

# Profile on default device (H100)
makora profile problem.py solution.py

# Profile on H100
makora profile problem.py solution.py --device H100

# Profile on AMD MI300X
makora profile problem.py solution.py --device MI300x

Expert Generate

makora expert-generate

Generate a single optimized GPU kernel using AI-powered expert optimization patterns. Unlike makora generate which runs a full optimization loop, this command generates a single optimized kernel and prints the code to stdout.

Usage

makora expert-generate <file> [options]

Arguments

Argument Description
file Path to the kernel file to optimize

Options

Option Type Default Description
-p, --problem path none Path to the reference/problem file for additional context
-d, --device string L40S See full device list in Supported Hardware
-l, --language string cuda Target language (cuda, triton, cutedsl, hip, opencl, ripple). Must be compatible with the selected device.
--speedup float none Current speedup vs baseline (provides context for further optimization)
--url string none Override the Makora API URL

Output

Piping to a file

Kernel code goes to stdout and status messages go to stderr, so you can pipe directly to a file with > solution.py without capturing log noise.

The generated kernel code is printed to stdout. Status messages and summaries go to stderr. This makes it easy to pipe the output to a file:

makora expert-generate kernel.py > optimized_kernel.py

Examples

# Generate optimized CUDA kernel for L40S (default)
makora expert-generate kernel.py

# Generate with problem file context
makora expert-generate kernel.py --problem problem.py

# Generate Triton kernel for H100
makora expert-generate kernel.py --device H100 --language triton

# Generate HIP kernel for MI300X
makora expert-generate kernel.py --device MY300x --language hip

# Provide current speedup for context
makora expert-generate kernel.py --problem problem.py --speedup 1.5

# Pipe output directly to a file
makora expert-generate kernel.py --problem problem.py > solution.py

Output Example

# stderr:
Generating optimized kernel...
Summary: Applied tiling and shared memory optimization for matrix multiplication

# stdout:
import torch
import torch.nn as nn
from torch.utils.cpp_extension import load_inline

cuda_source = """
// ... optimized CUDA kernel code
"""

class ModelNew(nn.Module):
    ...

Search tools for finding GPU documentation, optimization snippets, and technical references.

Search Makora's document database for GPU programming references.

Usage

makora document-search <query> [options]

Arguments

Argument Description
query Search query string

Options

Option Type Default Description
-n, --max-entries int 5 Maximum number of documents to return (1–49)
--url string none Override the Makora API URL

Examples

# Search for shared memory documentation
makora document-search "CUDA shared memory bank conflicts"

# Get more results
makora document-search "matrix multiplication optimization" --max-entries 10

Output

Searching documents...
Found 3 document(s):

--- Document 1 ---
id: abc123
score: 0.92
meta: {"source": "cuda_guide", "section": "shared_memory"}
content:
  [Document content...]

--- Document 2 ---
  ...

Companion CLI: makora-skills

Note

The commands below require the makora-skills package, which is separate from the main makora CLI. Install it with pip install makora-skills.

The makora-skills package provides additional search commands. Install it separately:

pip install makora-skills

makora search-snippets

Search for GPU code optimization snippets and techniques.

makora search-snippets <query> [options]
Options
Option Type Default Description
-n, --max-entries int 5 Maximum number of results
-l, --language string cuda Programming language filter
-a, --architecture string none GPU architecture filter (e.g., H100, MI300X)
Examples
# Search for CUDA optimization snippets
makora search-snippets "matrix multiplication tiling"

# Search for Triton snippets for H100
makora search-snippets "fused attention kernel" --language triton --architecture H100

# Get more results
makora search-snippets "memory coalescing" --max-entries 10

makora search-docs

Search for GPU documentation and API references.

makora search-docs <query> [options]
Options
Option Type Default Description
-n, --max-entries int 5 Maximum number of results
-l, --language string none Programming language filter
-a, --architecture string none GPU architecture filter (e.g., H100, MI300X)
Examples
# Search for documentation
makora search-docs "warp shuffle instructions"

# Filter by architecture
makora search-docs "memory hierarchy" --architecture MI300X

# Filter by language
makora search-docs "kernel launch configuration" --language cuda

Plugin Install

makora install

Install the Makora plugin for supported platforms.

Usage

makora install <target>

Arguments

Argument Description
target Platform to install for. Currently only claude is supported.

makora install claude

Installs the Makora plugin into Claude Code, giving Claude access to GPU optimization tools directly in your coding sessions.

makora install claude

Login required

You must be logged in (makora login) before running this command. The installer needs your credentials to configure the plugin.

What It Does

  1. Removes any previously cached Makora plugin
  2. Installs makora-plugin as a Claude Code MCP server
  3. Registers available Makora skills for Claude to use

Available Plugin Commands After Install

Once installed, Claude Code gains access to:

  • Evaluate — Benchmark optimized code against reference implementations
  • Generate — Generate optimized GPU kernels from problem descriptions
  • Optimize — Iteratively optimize CUDA/Triton kernels
  • Search docs — Search GPU documentation and API references
  • Search snippets — Find GPU optimization code snippets

Example

# Log in first
makora login

# Install the Claude Code plugin
makora install claude
Installing Makora plugin for Claude Code...
Installing makora-plugin...

Makora plugin installed successfully for Claude Code!