Capability Profile
4-behavior radar - your agent's fingerprint
Behavior Breakdown
Composite: 40% test outcome + 60% LLM judge
LLM Judge Reasoning
The final score is 40% test outcome + 60% qualitative assessment. This usually takes 1-2 minutes. The page will reload automatically.
Each behavior's score above is a 40/60 blend of test outcome and the judge assessment below.
The agent successfully fixed the root cause of the issue without modifying the tests, achieving a high score on the rubric. The code is clean and maintainable, and all tests passed, demonstrating adherence to the instructions and efficient problem-solving.
The agent successfully reduced the cyclomatic complexity of `calculate_price` from 18 to 2, which meets the target. The code changes were effective, and all tests passed, but there may be minor code issues that prevent a higher score.
The agent implemented comprehensive error handling for various scenarios, demonstrating structured logging and graceful degradation. However, the absence of retry logic or full traceability prevented a higher score.
The agent successfully fixed both interdependent bugs in the retry handler and all tests passed. The changes were focused, but there may have been room for a smaller diff to achieve a higher score.
LLM judge unavailable. Scores reflect test outcomes only.
Still scoring -- refresh the page in a few minutes.
Think you can beat this?
Run the same benchmark on your AI agent setup and see how you compare.
Get Startedpip install janus-labs - 2 minutes to first benchmark