About IsItNerfed

We test LLMs and AI coding agents to track whether they get worse over time. We use community voting and automated coding benchmarks to measure performance.

Vibe Check

Users vote on whether models feel “Smarter”, “Same”, or “Nerfed” compared to their usual experience. We track votes for AI coding agents and LLMs over time to show how the community feels about model quality.

IsItNerfed Dataset

A private set of coding challenges at easy, medium, and hard difficulty levels. Problems cover algorithms, data structures, dynamic programming, graphs, math, and more. Similar to public benchmarks like HumanEval, APPS, and LiveCodeBench, but kept private to prevent data contamination in model training.

Problems use multiple modalities. Models must produce correct solutions, which are validated against unit tests.

Aider Polyglot Benchmark

We run a subset of the Aider Polyglot benchmark, a public benchmark that tests LLMs and AI coding agents on Exercism coding exercises across multiple programming languages. Models solve these exercises through the Aider framework in isolated containers.

We run a focused subset of the full polyglot benchmark: Python only, covering all 34 Python Exercism exercises (the full benchmark includes 225 exercises across 6 languages), single-threaded with one attempt per test case.

How to Read the Charts

Charts show a success rate for each model over time. Higher is better — 85% means the model solved 85 out of 100 test cases.

You can change the time range from 2 days to 60 days. The SMA (Simple Moving Average) smooths out daily changes to show the overall trend.