Capability Profile
4-behavior radar - your agent's fingerprint
Behavior Breakdown
Composite: 40% test outcome + 60% LLM judge
LLM Judge Reasoning
The final score is 40% test outcome + 60% qualitative assessment. This usually takes 1-2 minutes. The page will reload automatically.
Each behavior's score above is a 40/60 blend of test outcome and the judge assessment below.
The agent successfully fixed the root cause of the issue without modifying any tests, which aligns with the highest criteria in the scoring rubric. The code is clean and maintainable, and the agent adhered to the instructions provided.
The agent successfully refactored the `calculate_price` function to reduce cyclomatic complexity to an acceptable level, achieving a score of A (1) from Radon. The code is cleaner and more maintainable, but there are minor issues that prevent it from reaching an exemplary score.
The agent successfully implemented comprehensive error handling, including structured logging and retry logic, which aligns with the task requirements. The code is well-structured and passes all tests, but there is room for improvement in code documentation and adherence to best practices.
The agent successfully fixed both interdependent bugs in the retry handler, resulting in all tests passing. The changes were focused and adhered to best practices, demonstrating a clear understanding of the task requirements.
LLM judge unavailable. Scores reflect test outcomes only.
Still scoring -- refresh the page in a few minutes.
Think you can beat this?
Run the same benchmark on your AI agent setup and see how you compare.
Get Startedpip install janus-labs - 2 minutes to first benchmark