Profile Your AI Agent

Discover your agent's capability shape across code quality and error resilience.
Optional instruction resilience diagnostics are available via janus-labs diagnose.
Like 3DMark for AI coding assistants.

Get Your Profile View Leaderboard

📊

Profile

Run standardized tests that map to 2 active capability axes. Get a radar chart, not just a number.

🔍

Compare

Overlay your results against bundled baselines across Claude, GPT, Gemini, and Copilot.

🏆

Compete

Submit your score, see your capability shape on the leaderboard, and share your profile.

Quick Start

Full CLI reference on PyPI ->

1. Install

pip install janus-labs
janus-labs --version

Windows: If janus-labs is not in PATH, use python -m janus_labs instead.

2. Run the full suite

# Offline smoke run with deterministic mock scoring
janus-labs run --suite refactor-storm --mock -o result.json

# Backend-hosted judging (no API key needed)
janus-labs run --suite refactor-storm -o result.json

# Suite alias
janus-labs refactor-storm -o result.json

run is the primary workflow. It produces one result file for the full 4-behavior suite.

3. Submit to the leaderboard

janus-labs submit result.json --github your-handle

Alternative: single-behavior manual workflow

janus-labs init --suite refactor-storm --output ./janus-task
cd janus-task/BHV-001-test-cheating
janus-labs status --workspace .
janus-labs score --workspace . --output result.json

Use this when you want to hand one workspace to an external coding agent and inspect the repo diff yourself.

Why capability profiling?

A single score hides real differences. Janus Labs profiles your agent across Code Quality and Error Resilience. Use janus-labs diagnose when you want an additional instruction-resilience readout against vanilla baselines.

Capability Axes

Code Quality

Does your agent fix the right code? Does it preserve test integrity and reduce complexity?

BHV-001 Test Cheating + BHV-002 Refactor Complexity

Error Resilience

Can your agent handle errors gracefully and fix bugs without thrashing in loops?

BHV-003 Error Handling + BHV-004 Loop Detection

Measured via the Refactor Storm suite. The suite ships 4 built-in behaviors in v2.0.0; the public composite averages Code Quality and Error Resilience.

How It Works

Install

pip install janus-labs works with any AI coding agent

Run Suite

janus-labs run --suite refactor-storm executes the 4-behavior suite and saves one result file

Submit

janus-labs submit result.json --github your-handle posts your suite result to the leaderboard

Profile

Review your 2-axis profile and compare it against bundled baselines

Full Documentation

Rendered from the same README used for the package docs. If the source is unavailable, the site falls back to cached content.

Janus Labs

3DMark for AI Agents - Benchmark and measure AI coding agent reliability.

Quick Start

pip install janus-labs
janus-labs

Visit PyPI for full documentation.

Ready to profile your agent?

Discover your agent's capability shape.

View on GitHub View Leaderboard