Commands¶
This page contains all Makora CLI command documentation.
Command Sections¶
- Authentication
- Generate & Optimize
- Jobs & Sessions
- Kernels & Results
- Evaluate
- Profile
- Expert Generate
- Search
- Plugin Install
Authentication Commands¶
makora login¶
Authenticate with the Makora API. Stores credentials locally for all subsequent commands.
Usage¶
# Interactive — prompts for token
makora login
# Non-interactive — pass token directly
makora login --token YOUR_TOKEN
Options¶
| Option | Type | Description |
|---|---|---|
--token |
string | API token (skips interactive prompt) |
--user |
string | Username (optional) |
--url |
string | Override the Makora API URL |
--quite |
flag | Disable interactive prompts |
Where to Get a Token¶
- Go to https://generate.makora.com/tokens
- Log in or create an account
- Create a new token or copy an existing one
Credential Storage¶
Tip
Credentials are stored as a plain text file. For CI/CD pipelines, use the MAKORA_USER_FILE environment variable to point to a credentials file managed by your secrets system.
Credentials are saved to ~/.makora/user by default. Override this with the MAKORA_USER_FILE environment variable.
Examples¶
makora logout¶
Remove stored credentials.
Usage¶
Deletes the credential file at ~/.makora/user.
makora info¶
Display version information, login status, and environment variable settings.
Usage¶
Output¶
Makora version: 0.1.0
Repo: makora-cli
Commit: abc123
Logged in as: user@example.com
Env Variable Value Default
MAKORA_AUTH_URL https://be.stage.makora.com/api/v1/ https://be.stage.makora.com/api/v1/
MAKORA_NO_RICH
MAKORA_URL https://generate.stage.makora.com https://generate.stage.makora.com
MAKORA_USER_FILE ~/.makora/user ~/.makora/user
Environment Variables¶
Note
These variables are only needed for advanced use cases like pointing at a staging server or custom credential paths. Most users won't need to change them.
| Variable | Default | Description |
|---|---|---|
MAKORA_URL |
https://generate.stage.makora.com |
Base URL for the Makora Generate API |
MAKORA_USER_FILE |
~/.makora/user |
Path to the credential file |
MAKORA_AUTH_URL |
https://be.stage.makora.com/api/v1/ |
Base URL for the authentication API |
MAKORA_NO_RICH |
(empty) | Set to any value to disable Rich text formatting |
Generate & Optimize¶
makora generate¶
Run generation on a problem file for optimization. Makora validates the file, then creates an optimization session that generates progressively faster kernels.
Usage¶
Options¶
| Option | Type | Default | Description |
|---|---|---|---|
--file |
path | required | Path to the problem file |
--device |
enum | required | Target device (H100, H200, B200, L40S, MI300X, Adreno 830, Adreno 750, Hexagon v79, Hexagon v75) |
--language |
enum | device default | Kernel language (cuda, triton, cutedsl, hip, opencl, ripple) |
--label |
string | "" |
Label for the session (visible in makora jobs) |
--atol |
float | 0.01 |
Absolute tolerance for correctness validation (see Tolerances) |
--rtol |
float | 0.01 |
Relative tolerance for correctness validation (see Tolerances) |
--fix |
flag | false |
Enable automatic fix suggestions for validation errors |
--instr |
path(s) | none | Path(s) to instruction files providing optimization context |
--url |
string | none | Override the Makora API URL |
How It Works¶
When you run makora generate, the following happens:
- Validation — Your problem file is uploaded and validated:
- Compilation check
- Preparing objects for execution
- Benchmarking to establish baseline performance
- Session creation — If validation passes, an optimization session starts
- Kernel generation — The platform generates and benchmarks optimized kernels in the background
The --fix Flag¶
Tip
Always try --fix when a run fails validation. It can automatically correct common issues like missing imports, wrong class names, or tensor device placement.
If validation fails, use --fix to get automatic fix suggestions:
When a fix is available, Makora shows the suggested changes and asks if you want to accept them. If you accept, the fixed code is run again automatically.
Without --fix, a failing run prints a hint:
Hint: try generating with --fix to get automatic fix suggestions:
makora generate --file problem.py --device H100 --fix
Instruction Files (--instr)¶
The --instr flag is how you steer the optimization agent. Makora's optimizer is an AI agent that generates and iterates on kernel code — instruction files let you inject your own expert knowledge into that process. Think of it as pair-programming with the agent: you bring the domain expertise, it brings the implementation speed. This is also where you'd provide an existing kernel implementation if you want the agent to start from a particular baseline instead of a blank slate.
This is your opportunity to nudge the agent toward specific optimization strategies, low-level techniques, or hardware-specific tricks that you know will work for your problem. Without instructions, the agent explores on its own. With instructions, you can point it directly at the approach you want.
Multiple instruction files can be combined:
Instruction files are plain text. Their contents are concatenated and passed as context to the optimization agent.
What to Put in Instruction Files¶
You can include anything that helps the agent write better kernels:
- Specific optimization techniques — "Use double buffering with shared memory" or "Apply register tiling with a 8x8 thread tile"
- Low-level intrinsics — "Use
__ldg()for read-only global memory loads" or "Use warp shuffle__shfl_sync()for the reduction" - Memory access patterns — "The input matrices are always power-of-2 aligned, so you can assume 128-byte aligned loads"
- Architecture-specific knowledge — "On H100, the L2 cache is 50MB — the working set fits entirely in L2"
- Algorithmic hints — "This is a tall-skinny matmul (M>>N), so parallelize along M and use a serial reduction along K"
- Constraints — "Do not use
torch.compile" or "The solution must be a single fused kernel" - Reference implementations — Paste in a known-good approach from a paper or library and tell the agent to build on it
Example: Guiding a Matrix Multiply with Expert CUDA Knowledge¶
Say you're optimizing a matrix multiply and you know from experience that on H100, the key to peak throughput is using cp.async to overlap global-to-shared-memory copies with computation, combined with warp-specialized persistent kernels (the approach used by CUTLASS 3.x).
Create a file h100-matmul-hints.txt:
Use an asynchronous warp-specialized persistent kernel design for this matmul:
1. Partition warps into producer and consumer roles. Producer warps issue
cp.async (or TMA on H100) to load tiles from global memory into shared
memory. Consumer warps compute on the previously loaded tiles using
tensor core mma instructions (m16n8k16 for fp32 accum).
2. Use multi-stage software pipelining with at least 3 shared memory buffers
so that loads, computes, and stores can overlap across pipeline stages.
3. Use the following tiling:
- Thread block tile: 128x256xK
- Warp tile: 64x64xK
- Use ldmatrix (PTX: ldmatrix.sync.aligned.m8n8.x4) for shared-to-register
loads to feed the tensor cores efficiently.
4. Use inline PTX for the cp.async instructions:
asm volatile("cp.async.cg.shared.global [%0], [%1], %2;" :: "r"(smem_ptr), "l"(gmem_ptr), "n"(16));
asm volatile("cp.async.commit_group;");
asm volatile("cp.async.wait_group %0;" :: "n"(stages - 2));
5. Epilogue: use vectorized 128-bit stores (float4) to write the result
tile back to global memory with full memory coalescing.
Run with the instruction file:
Instead of exploring broadly, the agent will focus on implementing the specific warp-specialized persistent kernel approach you described — and it can often get there much faster than discovering this strategy on its own.
Tips¶
Getting the most out of instruction files
- Be specific. "Make it faster" doesn't help. "Use 128x128 thread block tiles with 8 pipeline stages" does.
- Include code snippets. If you know the exact PTX or intrinsic call, paste it in. The agent can incorporate it directly.
- Combine with
--language. If your instructions reference CUDA intrinsics, make sure you're running with--language cuda. If they reference Tritontl.dottuning, use--language triton. - Iterate. Check results with
makora kernels, then refine your instructions and run generate again.
Examples¶
# Basic run on H100
makora generate --file problem.py --device H100
# Generate with Triton on H100
makora generate --file problem.py --device H100 --language triton
# Generate on AMD MI300X
makora generate --file problem.py --device MI300X
# Generate with a label and fix suggestions
makora generate --file problem.py --device H100 --label "matmul-v2" --fix
# Generate with custom tolerances
makora generate --file problem.py --device H100 --atol 1e-3 --rtol 1e-3
# Generate with instruction context
makora generate --file problem.py --device H100 --instr optimization-hints.txt
Output¶
Device: H100
Language: cuda
✓ Validation passed
Compilation: passed
Preparation: passed
Benchmarking: passed (1.234 ms)
Session created!
Session ID: a1b2c3d4
Problem ID: e5f6a7b8
Monitor progress with: makora jobs
Jobs & Sessions¶
makora jobs¶
List all your optimization sessions and their current status.
Usage¶
Options¶
| Option | Type | Default | Description |
|---|---|---|---|
--fast |
flag | false |
Skip fetching extra data (device, speedup) for faster output |
Output Columns¶
| Column | Description |
|---|---|
| Session ID | First 8 characters of the session UUID |
| Status | Current status (running, completed, failed, stopped, etc.) |
| Label | Session label (set with --label when running makora generate), truncated to 20 characters |
| Device | Target device (omitted with --fast) |
| vs torch.compile | Best speedup vs torch.compile baseline (omitted with --fast) |
| Started | Relative time since session started |
Examples¶
# List all jobs with full details
makora jobs
# Quick listing (skip device/speedup lookups)
makora jobs --fast
Output¶
Jobs
Session ID Status Label Device vs torch.compile Started
a1b2c3d4 ● running matmul-v2 H100 1.94x 5m ago
e5f6a7b8 ● completed conv-test L40S 2.31x 1h ago
c9d0e1f2 ● failed - MI300X - 3h ago
makora stop¶
Stop a running optimization session.
Usage¶
Arguments¶
| Argument | Description |
|---|---|
job_uuid |
The UUID (or UUID prefix) of the session to stop |
Tip
UUID prefix matching is supported — you only need enough characters to uniquely identify the session. In most cases the first 4-8 characters are enough.
Examples¶
# Stop using full UUID
makora stop a1b2c3d4-e5f6-7890-abcd-ef1234567890
# Stop using prefix (must be unique)
makora stop a1b2c3d4
# Stop using short prefix
makora stop a1b2
Output¶
Kernels & Results¶
makora kernels¶
View the optimized kernels generated by an optimization session.
Usage¶
# List all kernels for a session
makora kernels <session_id>
# View a specific kernel's code and performance
makora kernels <session_id> <kernel_id>
# Save kernel code to a file
makora kernels <session_id> <kernel_id> -o <output_file>
Arguments¶
| Argument | Description |
|---|---|
session_id |
Session ID or prefix |
kernel_id |
(Optional) Kernel ID or prefix — shows code and performance details |
Options¶
| Option | Type | Description |
|---|---|---|
-o, --output |
path | Save kernel code to a file instead of printing it |
Prefix matching
Both session_id and kernel_id support prefix matching — you only need enough characters to uniquely identify the target. The first 4-8 characters usually suffice.
Listing Kernels¶
Output Columns¶
| Column | Description |
|---|---|
| Attempt | Which optimization attempt generated this kernel |
| Kernel ID | First 8 characters of the kernel UUID |
| Name | Kernel name (truncated to 15 characters) |
| Status | Evaluation status (completed, failed, or close-miss with tolerance info) |
| Time | Execution time (with unit) |
| vs torch.compile | Speedup compared to torch.compile baseline |
Example Output¶
Kernels for a1b2c3d4 (matmul-v2)
Attempt Kernel ID Name Status Time vs torch.compile
1 f1e2d3c4 kernel_v1 ● completed 0.523 ms 1.82x
2 b5a6c7d8 kernel_v2 ● completed 0.491 ms 1.94x
3 a9b0c1d2 kernel_v3 ● failed - -
Viewing Kernel Code¶
Displays the full kernel source code with syntax highlighting, followed by performance metrics:
── kernel_v2 (b5a6c7d8) ──
Kernel time: 0.491 ms
Reference eager: 1.234 ms
torch.compile: 0.952 ms
vs eager: 2.51x
vs torch.compile: 1.94x
Saving Kernel Code¶
Examples¶
# List all kernels from a session
makora kernels a1b2c3d4
# View best kernel's code
makora kernels a1b2c3d4 b5a6c7d8
# Save kernel for evaluation
makora kernels a1b2c3d4 b5a6c7d8 -o solution.py
makora refcode¶
View the original reference code (problem file) that was used for a session.
Usage¶
Arguments¶
| Argument | Description |
|---|---|
session_id |
Session ID or prefix |
Options¶
| Option | Type | Description |
|---|---|---|
-o, --output |
path | Save reference code to a file |
Examples¶
# View the original problem code
makora refcode a1b2c3d4
# Save it to a file
makora refcode a1b2c3d4 -o original_problem.py
Evaluate¶
makora evaluate¶
Benchmark an optimized kernel against a reference implementation on remote hardware. Returns execution times and speedup.
Usage¶
Arguments¶
| Argument | Description |
|---|---|
reference_file |
Path to the reference/problem file |
optimized_file |
Path to the optimized solution file |
Options¶
| Option | Type | Default | Description |
|---|---|---|---|
-d, --device |
string | L40S |
See all devices in Supported Hardware |
--url |
string | none | Override the Makora API URL |
Device names are case-insensitive.
Output¶
Evaluating code...
✓ Evaluation successful!
Benchmark Results:
Reference time: 1.234567 ms
Solution time: 0.491234 ms
Speedup: 2.51x
Examples¶
# Evaluate on default device (H100)
makora evaluate problem.py solution.py
# Evaluate on H100
makora evaluate problem.py solution.py --device H100
# Evaluate on AMD MI300X
makora evaluate problem.py solution.py --device MI300x
makora check¶
Tip
Use makora check to validate your problem file before committing to a full generate. It catches errors quickly without creating an optimization session.
Validate a problem file without starting an optimization session. Runs compilation, preparation, and benchmarking checks.
Usage¶
Arguments¶
| Argument | Description |
|---|---|
file |
Path to the problem file to validate |
Options¶
| Option | Type | Default | Description |
|---|---|---|---|
--device |
enum | H100 |
Target device for validation (H100, H200, B200, L40S, MI300X, etc.) |
Examples¶
# Validate on default device (H100)
makora check problem.py
# Validate for a specific device
makora check problem.py --device L40S
makora check problem.py --device MI300X
Output¶
Shows validation results including compilation, preparation, and benchmarking status. If validation fails, error logs are displayed.
Profile¶
makora profile¶
Profile an optimized kernel on remote hardware. While makora evaluate tells you how fast your kernel is, makora profile tells you why — returning hardware counters, occupancy data, Nsight Systems and Nsight Compute traces, and even the generated SASS assembly so you can see exactly what the GPU is doing.
Use profiling when you need to diagnose performance bottlenecks, verify that your optimization strategy is working at the hardware level, or gather data to inform your next round of makora generate ... --instr hints.
Currently, only the NVIDIA H100 is supported by the profiler.
Usage¶
Arguments¶
| Argument | Description |
|---|---|
reference_file |
Path to the reference/problem file |
optimized_file |
Path to the optimized solution file |
Options¶
| Option | Type | Default | Description |
|---|---|---|---|
-d, --device |
string | H100 |
Currently, only H100 is supported |
--url |
string | none | Override the Makora API URL |
What You Get Back¶
Profiling runs in full mode, which returns the most comprehensive data available. For each GPU kernel launched by your code, the output can include:
Raw Metrics¶
Hardware performance counters and execution statistics:
Metrics:
duration_ns: 491234
registers_per_thread: 32
shared_memory_bytes: 8192
grid_size: [128, 1, 1]
block_size: [256, 1, 1]
occupancy: 0.75
Interpreting raw metrics
These are the numbers you need to diagnose performance. For example:
- High register count (e.g., 128+ per thread) → low occupancy, consider reducing register pressure
- Low occupancy → not enough warps to hide memory latency, try smaller tile sizes or less shared memory per block
- Short duration but many kernel launches → launch overhead is significant, consider fusing kernels
Details Page¶
Detailed kernel execution breakdown from the profiler, including timing per operation, memory throughput, and compute utilization.
Nsight Systems (nsys) Report¶
The full Nsight Systems trace output showing the timeline of GPU activity — kernel launches, memory transfers, synchronization points, and idle gaps. This is the same data you'd get from running nsys profile locally, but executed on remote hardware.
Additional Data in the API Response¶
Note
The data below is available through the API response even if not all fields are printed by the CLI. Use the API directly if you need access to SASS assembly or annotated source.
The profiling API also captures the following data (available through the API even if not all are printed by the CLI):
| Data | Description |
|---|---|
| CUDA source | The compiled CUDA source code as seen by the profiler |
| SASS assembly | The actual GPU assembly (SASS) that ran on the hardware — the ground truth of what your kernel compiled to |
| Annotated source | Source code annotated with profiling data (hotspots, stall reasons) |
| Torch trace | PyTorch execution trace for understanding the operator-level breakdown |
Example Output¶
Profiling code...
Profiling successful!
Profiled 2 kernel(s):
--- Kernel 1 ---
Metrics:
duration_ns: 491234
registers_per_thread: 32
shared_memory_bytes: 49152
grid_size: [128, 1, 1]
block_size: [256, 1, 1]
Details:
Compute Throughput: 78.3%
Memory Throughput: 45.2%
Achieved Occupancy: 75.0%
Warp Execution Eff: 98.4%
Nsys Report:
Time(%) Total Time (ns) Instances Avg (ns) Kernel Name
------- --------------- --------- --------- -----------
85.2% 491234 1 491234 matmul_kernel
14.8% 85432 1 85432 elementwise_add
--- Kernel 2 ---
...
When to Use Profile vs Evaluate¶
makora evaluate |
makora profile |
|
|---|---|---|
| Purpose | Get speedup number | Understand why it's fast/slow |
| Speed | Fast | Slower (runs profiling tools) |
| Output | Reference time, solution time, speedup | Hardware counters, nsys trace, SASS, source annotations |
| Use when | Checking if your kernel is faster | Diagnosing bottlenecks, planning next optimization |
Recommended workflow
evaluatefirst to see the speedup- If the speedup isn't what you expected,
profileto find out why - Use profiling data to write better
--instrhints for your next run
Examples¶
# Profile on default device (H100)
makora profile problem.py solution.py
# Profile on H100
makora profile problem.py solution.py --device H100
# Profile on AMD MI300X
makora profile problem.py solution.py --device MI300x
Expert Generate¶
makora expert-generate¶
Generate a single optimized GPU kernel using AI-powered expert optimization patterns. Unlike makora generate which runs a full optimization loop, this command generates a single optimized kernel and prints the code to stdout.
Usage¶
Arguments¶
| Argument | Description |
|---|---|
file |
Path to the kernel file to optimize |
Options¶
| Option | Type | Default | Description |
|---|---|---|---|
-p, --problem |
path | none | Path to the reference/problem file for additional context |
-d, --device |
string | L40S |
See full device list in Supported Hardware |
-l, --language |
string | cuda |
Target language (cuda, triton, cutedsl, hip, opencl, ripple). Must be compatible with the selected device. |
--speedup |
float | none | Current speedup vs baseline (provides context for further optimization) |
--url |
string | none | Override the Makora API URL |
Output¶
Piping to a file
Kernel code goes to stdout and status messages go to stderr, so you can pipe directly to a file with > solution.py without capturing log noise.
The generated kernel code is printed to stdout. Status messages and summaries go to stderr. This makes it easy to pipe the output to a file:
Examples¶
# Generate optimized CUDA kernel for L40S (default)
makora expert-generate kernel.py
# Generate with problem file context
makora expert-generate kernel.py --problem problem.py
# Generate Triton kernel for H100
makora expert-generate kernel.py --device H100 --language triton
# Generate HIP kernel for MI300X
makora expert-generate kernel.py --device MY300x --language hip
# Provide current speedup for context
makora expert-generate kernel.py --problem problem.py --speedup 1.5
# Pipe output directly to a file
makora expert-generate kernel.py --problem problem.py > solution.py
Output Example¶
# stderr:
Generating optimized kernel...
Summary: Applied tiling and shared memory optimization for matrix multiplication
# stdout:
import torch
import torch.nn as nn
from torch.utils.cpp_extension import load_inline
cuda_source = """
// ... optimized CUDA kernel code
"""
class ModelNew(nn.Module):
...
Search¶
Search tools for finding GPU documentation, optimization snippets, and technical references.
makora document-search¶
Search Makora's document database for GPU programming references.
Usage¶
Arguments¶
| Argument | Description |
|---|---|
query |
Search query string |
Options¶
| Option | Type | Default | Description |
|---|---|---|---|
-n, --max-entries |
int | 5 |
Maximum number of documents to return (1–49) |
--url |
string | none | Override the Makora API URL |
Examples¶
# Search for shared memory documentation
makora document-search "CUDA shared memory bank conflicts"
# Get more results
makora document-search "matrix multiplication optimization" --max-entries 10
Output¶
Searching documents...
Found 3 document(s):
--- Document 1 ---
id: abc123
score: 0.92
meta: {"source": "cuda_guide", "section": "shared_memory"}
content:
[Document content...]
--- Document 2 ---
...
Companion CLI: makora-skills¶
Note
The commands below require the makora-skills package, which is separate from the main makora CLI. Install it with pip install makora-skills.
The makora-skills package provides additional search commands. Install it separately:
makora search-snippets¶
Search for GPU code optimization snippets and techniques.
Options¶
| Option | Type | Default | Description |
|---|---|---|---|
-n, --max-entries |
int | 5 |
Maximum number of results |
-l, --language |
string | cuda |
Programming language filter |
-a, --architecture |
string | none | GPU architecture filter (e.g., H100, MI300X) |
Examples¶
# Search for CUDA optimization snippets
makora search-snippets "matrix multiplication tiling"
# Search for Triton snippets for H100
makora search-snippets "fused attention kernel" --language triton --architecture H100
# Get more results
makora search-snippets "memory coalescing" --max-entries 10
makora search-docs¶
Search for GPU documentation and API references.
Options¶
| Option | Type | Default | Description |
|---|---|---|---|
-n, --max-entries |
int | 5 |
Maximum number of results |
-l, --language |
string | none | Programming language filter |
-a, --architecture |
string | none | GPU architecture filter (e.g., H100, MI300X) |
Examples¶
# Search for documentation
makora search-docs "warp shuffle instructions"
# Filter by architecture
makora search-docs "memory hierarchy" --architecture MI300X
# Filter by language
makora search-docs "kernel launch configuration" --language cuda
Plugin Install¶
makora install¶
Install the Makora plugin for supported platforms.
Usage¶
Arguments¶
| Argument | Description |
|---|---|
target |
Platform to install for. Currently only claude is supported. |
makora install claude¶
Installs the Makora plugin into Claude Code, giving Claude access to GPU optimization tools directly in your coding sessions.
Login required
You must be logged in (makora login) before running this command. The installer needs your credentials to configure the plugin.
What It Does¶
- Removes any previously cached Makora plugin
- Installs
makora-pluginas a Claude Code MCP server - Registers available Makora skills for Claude to use
Available Plugin Commands After Install¶
Once installed, Claude Code gains access to:
- Evaluate — Benchmark optimized code against reference implementations
- Generate — Generate optimized GPU kernels from problem descriptions
- Optimize — Iteratively optimize CUDA/Triton kernels
- Search docs — Search GPU documentation and API references
- Search snippets — Find GPU optimization code snippets