Why we grade our own AI-generated tests (and you should too)
Why we grade our own AI-generated tests (and you should too)
There is a dirty secret in AI-powered testing: most AI-generated tests are garbage.
Not all of them. But enough of them that the QA community has noticed. One widely-discussed case involved an AI tool that generated 188 test cases and claimed a 100% pass rate on code that could not even compile. The tests looked professional. They had steps, expected results, preconditions. They were also completely worthless.
This is not an edge case. It is the default behavior of every AI test generator on the market, including ours.
The problem: AI optimizes for looking good
When you ask an LLM to generate test cases, it optimizes for structure and completeness. It produces tests that look like tests: numbered steps, expected results, proper formatting. What it does not optimize for is whether those tests actually catch bugs.
Here is what "assertion-thin" looks like in practice:
| What the AI generates | What it actually verifies | |---|---| | "Verify the page loads successfully" | That the HTTP response is not a 500 | | "Check that the form submits" | That clicking Submit does not crash | | "Ensure the dashboard displays correctly" | That something renders |
These tests will pass on a broken application. A login form that accepts any password? The test says "verify login succeeds" and it does. A payment flow that charges the wrong amount? The test says "verify payment completes" and it does.
The fundamental issue: the AI that wrote the test cannot objectively evaluate whether the test is good. It generated the output. Asking it to judge its own work produces the same optimistic bias that created the problem.
What we built: LLM-as-Judge
At BetterQA, we have a saying: the chef should not certify his own dish. That applies to our AI too.
BugBoard already generates test cases using AI. You describe a feature, upload a screenshot, or paste requirements, and the AI produces 15-20 test cases in under 30 seconds. That part works well. The problem was that we had no way to tell you which of those 15 tests were strong and which were filler.
So we added a second AI pass. After generation completes, a separate LLM call (different prompt, temperature 0.0, acting as an independent reviewer) grades every test case on four dimensions:
The four quality dimensions
| Dimension | What it checks | Score range | |---|---|---| | Assertion depth | Does the test verify meaningful outcomes, not just "page loads"? Are there checks for specific values, state changes, and error messages? | 0-100 | | Business logic | Does the test cover actual business rules? Would catching a failure here prevent real user impact? | 0-100 | | Edge coverage | Are there boundary conditions, error paths, or negative scenarios? | 0-100 | | Reproducibility | Can this test run independently without timing dependencies, specific data states, or external service availability? | 0-100 |
Each test case gets an overall score (average of all four) and a verdict:
- Strong (avg >= 70, at most 1 flag): This test is ready to use
- Needs Review (between thresholds): A human should look at this before trusting it
- Weak (avg < 40 or 3+ flags): This test probably tests nothing useful
What the flags tell you
The judge does not just score. It flags specific problems:
| Flag | What it means | |---|---| | assertion_thin | Single assertion, or assertions that only check truthiness | | happy_path_only | No negative scenarios, no error handling tested | | no_edge_cases | Missing boundary conditions (empty input, max length, special characters) | | not_reproducible | Depends on timing, specific data state, or external services | | vague_expected | Expected results are imprecise ("should work correctly") | | missing_preconditions | No setup described, test assumes state without establishing it |
A test flagged as "assertion_thin" and "happy_path_only" tells you exactly what to fix: add assertions that check specific values and add a negative test case. That is actionable. A generic "this test needs improvement" is not.
Why this matters more than you think
The QA community is right to be skeptical of AI-generated tests. The criticism is not that AI cannot write tests. It is that AI writes tests that look like they work without actually working. And the tools that generate them have no mechanism to tell you which ones are real and which ones are theater.
This is the exact same problem that exists with manual test writing, just accelerated. A junior QA engineer writing "verify it works" is the same failure mode as an AI writing "verify the page loads successfully." The difference is that the AI produces 20 of these in 30 seconds, so the bad ones get buried in volume.
The solution is not to stop using AI for test generation. The solution is to grade the output with the same rigor you would apply to a human tester's work.
How it works in BugBoard
- Generate: Describe your feature, upload a screenshot, or paste requirements
- Wait: AI produces test cases (15-20 in under 30 seconds)
- Judge: A second AI pass scores every test case (appears within 10-15 seconds)
- Act: Strong tests go straight to your test suite. Weak tests get flagged. You decide what to do with the rest.
The quality badges appear inline on each test case card. You do not need to go to a separate dashboard or run a report. When you regenerate or refine a test case, the old scores are cleared and the judge runs again on the new content.
The honest part
We could have shipped the quality scoring without showing you the weak results. Filter them out, only show the strong ones, claim a higher quality rate. Several competitors do exactly this.
We chose not to. If the AI generated a weak test, you should know. You might still want it. You might want to fix it. You might want to delete it. But hiding it from you is the same behavior that produced the "188 passing tests on broken code" incident.
Independent QA means grading yourself honestly. That applies to the tools just as much as the teams.