Skip to content

We Stopped Fixing Flaky Tests and Started Defining Good Ones

Published: at 10:16 AM

Our test suite looked healthy from the outside. Coverage was extensive. Features were well represented. And yet, when a test failed, nobody trusted it.

I remember the exact moment I realized we had a problem. I was in a standup, and someone mentioned a failed test. Without even looking up, another engineer said, “Just rerun it.” No investigation. No curiosity. Just…rerun it.

That became our reflex. Failures were shrugged off. Reruns were normal. CI noise was just part of the morning routine. That’s when it hit me:

The problem wasn’t flakiness. The problem was that we never defined what a good test actually is.

Coverage Without Trust Is Just Noise

Like most mature platforms, our flakiness came from everywhere:

Each issue alone felt manageable. Together, they created a suite that technically “worked” but emotionally didn’t. We were there.

Instead of Fixing Tests, We Redefined Them

Someone asked the question that changed everything:

“Why are we even trying to fix these one by one?”

We had been asking:

“How do we fix this flaky test?”

We should have been asking:

“What qualities must a test have to be trusted long-term?”

That question gave us permission to step back. To stop patching and start thinking about what “good” actually meant for us.

From that conversation, we created something we called the uplifted test.

What Is an “Uplifted Test”?

An uplifted test isn’t just a passing test. It’s not even just a stable test.

It’s a test that consistently meets a high bar across nine quality characteristics. Not “most of them.” All nine.

These tests became our north star. They’re the ones we:

They represent how we want tests to be written, not just how they happen to work today.

The 9 Characteristics of an Uplifted Test

1. Readable

Clear structure, descriptive names, logical flow.

If someone has to ask you what a test does, it’s already failing. I started using the “explain it to your manager” test—if you can’t describe what’s being tested in one sentence, neither can your code.

2. Maintainable

Modular, uses custom commands where intent matters, built to expect change.

A test that breaks every time someone adjusts padding isn’t strict—it’s brittle. There’s a difference.

3. DRY (Don’t Repeat Yourself)

Shared setup and reusable patterns.

We had tests with the same 15-line login sequence copy-pasted 40 times. When the login flow changed, we updated… 37 of them. Guess what happened with the other three.

4. Uses Modern, Maintained Libraries

Relies on supported APIs and actively maintained tools.

That plugin you found on npm with 47 stars and last updated in 2021? Yeah, that’s your next source of flakiness.

5. Follows Best Practices

No hard waits. Proper assertions. Respects async behavior.

6. Minimal API Calls

Uses APIs intentionally, not as a shortcut to avoid UI instability.

Look, I get it. API calls are faster and more reliable. But an end-to-end test that spends 90% of its time hitting APIs isn’t testing the end-to-end. It’s testing your API client.

7. Good Locators

Prefers data-testid and semantic selectors.

cy.get('.css-1s2u09g-control') is not a locator. It’s a time bomb. Selectors are contracts between your tests and your UI, not wild guesses.

8. Balanced Sensitivity

Not so strict that harmless changes break it. Not so loose that regressions slip through.

Good tests fail for the right reasons. If a button stops working and nothing fails, they’re underfit.

9. Tests Real User Flows

Real navigation, real interactions, real outcomes.

We had a test that verified a modal by directly manipulating the DOM to make it visible. Passed every time. Found exactly zero real bugs.

What Changed Once We Set the Bar

Once the nine characteristics were explicit, something subtle happened.

Code reviews got… easier? Less personal?

Instead of:

We started saying:

Quality stopped being about feelings. It became about alignment with something we’d all agreed on.

The arguments didn’t go away, but they got more productive.

Why Flakiness Stopped Feeling Random

Before, flaky tests felt like bad luck. Like we’d angered the CI gods somehow.

After, patterns were obvious.

The tests that failed most often weren’t unlucky—they were:

Once you define what good looks like, bad tests stop hiding.

Flakiness wasn’t the mystery. Undefined quality was.

This Was Never About Perfection

Here’s what we’re not saying: every test must be uplifted.

Some tests exist to cover weird edge cases. Some are temporary debugging tools. Some are just fine being okay—they don’t justify the investment to make them great.

The shift was this:

Tests were no longer accidentally bad. They were consciously not uplifted.

That distinction matters more than any metric could.

We knew which tests were solid. We knew which ones weren’t. And most importantly, we knew why.

The Unexpected Benefit: Trust Came Back

This is the part I didn’t expect.

Failures started meaning something again.

A red build used to mean “eh, probably flaky.” Now it means “something actually broke.”

Engineers stopped ignoring CI. They started investigating. QA stopped babysiting reruns and started improving coverage. Our Slack #test-failures channel went from constant noise to genuine signal.

We didn’t reduce flakiness by adding more cy.wait() calls.

We reduced it by tightening our thinking.

Final Thought

If your test suite feels flaky, you can keep stabilizing individual tests.

Or you can step back and ask:

“Does this test deserve to be trusted?”

If you can’t answer that clearly, if your team doesn’t share a common definition of what “good” even means, then the flakiness isn’t a bug.

It’s feedback.

And maybe it’s time to listen ✌️