Quick Summary 

A test suite optimization engagement for a SaaS platform’s Playwright regression suite – 44 minutes to under 6 in ten weeks, total tests cut by 50%, flake rate from 14% to 3%. Below: diagnostic, three-layer architecture, AI-assisted flake clustering, and lessons. 

If your regression suite has crept past forty minutes, engineers have stopped running it before pushing, and your SDETs spend more time triaging flakes than writing new tests – these are the symptoms of a slow test suite, and the problem is almost certainly not your framework. It is suite drift, and it compounds until every release feels like a coin flip. Test suite optimization is the structural fix – and it starts with admitting the framework is not to blame. 

These are symptoms our team diagnosed last year for a mid-sized SaaS platform. Their Playwright suite had grown from 9 to 44 minutes over three years, SDETs spent 30-40% of each sprint on flake triage, and production hotfix rate had crept from 1% to 7%. Engineering had stopped trusting green builds. 

Why Is My Test Suite So Slow?  

The Diagnosis Every test suite optimization project we take on starts with the same structural audit, and this one surfaced the classic profile: 2,412 tests, with the top 40 catching over 95% of production grade regressions. The rest was redundant coverage (the same login flow validated seven ways), dead paths (tests for features removed 18 months earlier), and UI-heavy scenarios that belonged in unit tests. Three patterns dominated. 

Test suite composition before vs after chart showing reduced redundant tests and increased signal coverage

  • Overgrowth. Every bug ticket had spawned a test. Every feature had added several. Nobody had deleted anything. The suite was, effectively, a write-only log.  
  • Poor prioritization. The login flow – which breaks the business if it fails – sat in the same nightly bucket as a cosmetic check on a settings page no paying customer had visited in two months.
  • Flakiness. 14% of tests failed on retry. That is not a test suite; it is a coin flip. It had trained engineers to mark failures as “probably flaky” before even opening the log.

Regression runtime and test reliability consistently rank among the top QA productivity concerns in industry surveys, and both degrade sharply with suite age. This is a structural problem, not a team-skill one. 

Why Most Test Suite Optimization Efforts Stall  

The client had tried the obvious moves to speed up their slow test suite, and they were not wrong to.  

Manual duplicate culling deleted ~180 tests over two sprints; three months later the suite was larger than before. Parallelization on 8 Playwright workers dropped wall-clock time from 44 to 22 minutes – a real win that masked the underlying bloat, and 18 months later those 22 minutes had drifted back to 38. Scaling the SDET team added more engineers, more tests, same trust problem.  

These moves buy a quarter. Then the suite grows back, because the process that created the bloat never changed. Engineering teams that defer structural intervention often face the same technical-debt compounding pattern we documented in our legacy app modernization case study – each additional quarter of drift multiplies the rebuild cost.

Our Test Suite Optimization Approach: Telemetry First, AI Second  

Test suite optimization done right starts with telemetry, not tooling. Before writing a line of code to fix the slow test suite, we spent two weeks instrumenting. We captured per-test duration, pass/fail history, flake rate, last-failure date, and ownership for every test. This diagnostic phase – the one most teams skip – shaped every architectural decision that followed.  

We anchored on three choices. First, Playwright test sharding combined with impact analysis as the primary runtime lever – not simply adding more workers. Second, a segmented suite design (Smoke, Regression, Edge) with a strict contract for what belonged in each layer. Third, Allure Report for historical trend data, not just pass/fail snapshots – the trend view shows which tests add signal and which add noise.  

We deliberately did not start with AI tooling. AI works best on mature suites with clean telemetry. On a chaotic suite it produces confident-sounding noise. 

Three-Layer Test Suite Architecture:  

Smoke, Regression, Edge The three-layer split is the operational centerpiece of any meaningful test suite optimization – not because the pattern is fashionable, but because it directly addressed the all-or-nothing run that everyone had learned to ignore.

 Three-layer test suite architecture showing smoke, regression, and edge tests with triggers, runtime, and responsibilities

Smoke -The Four-Minute Trust Signal  

Contains only tests that validate business-critical flows: login, core navigation, form submission, payment. In this case, 84 tests. Runs on every PR, completes in under four minutes, and is the only layer that blocks merges. Every test here has to earn its slot by protecting revenue or user facing functionality. 

Playwright smoke test code verifying login flow and dashboard load using critical path assertions

Why this works 

The @smoke tag lets the runner select this test for the PR gate. The assertions check the business outcome – did the user reach the dashboard with primary navigation loaded – rather than UI details that change without breaking functionality. Within two sprints, engineers resumed running smoke locally before pushing.

Regression – Impact-Based, Not Exhaustive  

Covers feature-level validations and integration paths. Historically ran in full on every PR; we switched it to impact-based test selection – a code-file-to-test dependency map that runs only the tests relevant to the diff. For 80% of PRs, the subset runs in 4-6 minutes. Full regression runs nightly and pre-release.  

Edge – Exploratory, Scheduled, Deliberately Slow  

Long-tail tests: uncommon permission combinations, timezone edge cases, data-migration scenarios. Weekly cadence, dedicated triage owner. Moving Edge out of the PR flow made its failures readable – when 20 slow tests fail inside a run of 200, each one gets attention. 

Don’t have a sprint to spare for this?

The architecture is not the hard part; instrumenting a slow test suite is. Our team at ScriptsHub Technologies packages test suite optimization as a standalone two-week engagement – diagnostic telemetry plus a prioritized modernization roadmap. Work is delivered by our QA and testing team alongside senior SDETs. Learn more at scriptshub.net.

With the architecture settled, the next question was the flakes.

How to Fix Flaky Tests With Pattern-Based Clustering  

AI tools have been oversold in QA. Used carefully, they earned a slot in two specific workflows. 

AI-assisted QA workflow diagram showing redundancy detection, human review, and flake clustering process

Redundancy detection. We ran semantic clustering on test titles, assertions, and setup code to flag tests that were the same check written differently by different engineers over time. The model surfaced 412 candidate duplicates. A senior SDET reviewed each cluster – AI does not know which of two near-identical tests is worth keeping – and confirmed 318 for deletion.  

Flake clustering. Failure-pattern analysis on twelve months of Allure history grouped flaky tests by likely root cause: network timing, stale selectors, order-dependent state, async race conditions. Four clusters accounted for 71% of all flakes. Fixing the top cluster – async state reset in the test harness – eliminated 38% of flakes in a single sprint.  

The anti-pattern that dominated the top cluster was an async race from waiting on global network state in an Angular app – flaky on ~6% of runs. 

Flaky Playwright test using networkidle wait and UI selectors for complaint submission flow

Stable Playwright test waiting for element visibility and enabled button before submitting complaint

Why this works 

waitForLoadState(‘networkidle’) is unreliable on Angular apps because long polling endpoints and analytics beacons keep the network perpetually busy. Swapping it for an element-level waitFor({ state: ‘visible’ }) scopes the wait to exactly what the test needs. The :not([disabled]) selector gates on Angular’s validation pass completing. Applied across the 47 tests in this cluster, this single change eliminated roughly 38% of flakes across the entire suite.

What AI never did. It did not write the fixes. It did not decide which tests to keep. The test suite optimization architecture itself was human work, guided by telemetry. AI did the pattern matching humans are slow at, and handed the result back to engineers who made the calls. AI for structure, humans for judgment. Teams that want to apply this pattern-matching approach to their own suite without building the embeddings infrastructure in-house can run it through an AI consulting engagement – we bring the models, they keep the deletion authority.

Test Suite Optimization Results: 44 Minutes to Under 6  

Ten weeks of incremental test suite optimization – no big-bang rewrite – produced measurable change against the diagnostic-phase baseline. 

QA results table showing reduced test count, faster runtimes, lower flake rate, and improved release stability

The smoke + impact-based regression combination was the highest-leverage change. Deleting half the suite was the emotionally hardest. Fixing flakes rebuilt credibility with engineering. Within a month, the QA team had stopped apologizing in standups – a shift that mattered more than any runtime number on the dashboard. 

The Four-Question Test Suite Health Check  

We have since distilled this test suite optimization engagement into a rubric any team can apply monthly to spot drift early. 

Four-question test suite health check covering runtime, impact coverage, flake causes, and test deletion practices

  1. What is my smoke suite’s runtime, and how often does it run locally?If it is over five minutes, or engineers are not running it before pushing, the layer is not doing its job. 
  2. What percentage of my test runs are impact-based, not full-suite?Below 50% is an opportunity.  
  3. What is my flake rate, and do I know which four patterns cause 70% of it?If you cannot name them, your suite is producing noise, not signal.  
  4. When was the last time Ideleteda test on purpose? If the answer is “I cannot remember,” your suite is writing checks your team cannot cash. 

Is Your Test Suite Slowing Your Team Down?  

Smarter testing beats more testing. A lean, well-prioritized, well-observed suite catches more real bugs in eight minutes than a slow test suite does in forty. Your test suite should be a confidence engine, not a Friday-afternoon ritual. If it is not, the answer is not more tests – it is better ones.  

At ScriptsHub Technologies, we deliver test suite optimization across SaaS, healthcare, and education platforms as part of our full services. Each engagement is tailored to the framework, domain, and team structure. If your regression runtime has drifted past 20 minutes, your flake rate is above 5%, or engineering has quietly stopped trusting green builds, we should talk.  

Request a complimentary Test Suite Health Assessment – a 45-minute diagnostic call plus a written report covering your suite’s runtime distribution across smoke/regression/edge layers, the top five flake root-cause patterns in your current suite, redundancy candidates from a semantic-clustering pass, and a prioritized quick-wins roadmap. Delivered within 10 business days. No commitment. Reach us at info@scriptshub.net or explore more engineering case studies on our blog.

Frequently Asked Questions

1. What is test suite optimization?

Test suite optimization is the systematic process of reducing test runtime, eliminating flaky tests, and pruning redundancy without losing coverage. It combines telemetry analysis, architectural layering, impact-based selection, and disciplined deletion of obsolete tests.

2. Why is my test suite so slow?

Most slow test suites are not slow because of the framework. They are slow because of suite drift: additive growth without pruning, a flat structure, and flake rates that train engineers to ignore failures. The fix is structural.

3. How do I fix flaky tests in Playwright?

Cluster flakes by root cause – network timing, stale selectors, order dependency, async races — then fix by pattern. In Playwright, replace waitForLoadState('networkidle') with element-level waitFor({ state: 'visible' }) and gate clicks on :not([disabled]).

4. What is a good flake rate for a Playwright test suite?

Below 2% is the target for a healthy Playwright suite. Above 5% means engineers stop trusting green builds. Above 10% means the suite is actively training the team to ignore real failures.

5. What is test impact analysis?

Test impact analysis maps code files to the tests that exercise them, so a pull request only triggers tests relevant to its diff. Playwright supports this natively via --only-changed. Well-tuned, it covers 80% of PRs in 4-6 minutes.

6. What percentage of tests should be in smoke vs regression?

A useful heuristic: smoke should be 3-8% of total tests and complete in under five minutes. Smoke blocks merges. Regression runs impact-based on PRs. Edge runs weekly or pre-release.

This post got you thinking? Share it and spark a conversation!