Spark Data Skew: Fix Slow GroupBy in PySpark Pipelines

Kiran Irabatti

7 hours ago

Quick Summary

A fintech client’s Azure Synapse Analytics pipeline was taking over four hours to process daily transaction aggregations, well past a three-hour SLA. Our data engineering team traced the root cause to severe Spark data skew: one partition was handling 78% of rows, forcing the entire cluster to wait on a single executor. By applying salting and repartitioning, we cut the pipeline runtime from 4.2 hours to under 90 minutes without any infrastructure changes.

Our data engineering team at ScriptsHub Technologies was brought in to stabilize a failing data pipeline for a fintech client processing daily transaction records in Azure Synapse Analytics. The pipeline had worked reliably at modest volumes, but as transaction counts scaled past 50 million rows per day, daily aggregation jobs began running past their three-hour SLA window. Spark data skew at this scale pointed to something deeper than a resource shortage.

The client’s operations team had already tried standard remedies: upgrading the Synapse dedicated SQL pool, increasing node count, and rewriting the SQL aggregation queries. None of it made a meaningful difference. Every morning, the pipeline delayed the downstream reporting layer and held up finance teams waiting on daily summaries. That pattern of hardware upgrades producing no improvement is almost always a sign of data skew, not insufficient compute. While this engagement was on Azure Synapse, the salting technique we applied works identically on Databricks, EMR, or any standalone PySpark cluster.

The Problem: A Pipeline Breaching Its SLA Every Morning

The pipeline processed a single large daily transaction file, joined it against a merchant reference table, and produced a daily aggregation grouped by transaction_date and merchant_category. On paper, the logic was straightforward. In practice, the job was missing its SLA by more than an hour every single day.

Synapse dedicated SQL pool metrics looked normal across the board. CPU utilization was reasonable, memory stayed within limits, and network I/O was unremarkable. The partition imbalance was not visible at the infrastructure layer. Instead, the bottleneck was buried inside the Spark execution plan, invisible to standard monitoring dashboards.

The Diagnosis: Understanding Spark Data Skew in Synapse

Spark data skew is the uneven distribution of rows across partitions in a distributed Spark workload, where one executor handles a disproportionate share of data while others sit idle. A Spark partition is a logical chunk of data that a single executor thread processes independently. When partitions are deeply imbalanced, most of the cluster idles while one overloaded task runs to completion. The job takes as long as its slowest partition, regardless of how many nodes are provisioned.

Our team pulled the Synapse Spark monitoring logs and found the tell-tale sign within minutes. One executor task in the aggregation stage was consuming 87% of the total job duration. The remaining 47 tasks finished in under four minutes. The pipeline was not universally slow. One task was choking, and the rest of the cluster waited for it.

The root cause was the composite grouping key: transaction_date combined with merchant_category. The client’s transaction data had a heavy concentration in one merchant category (e-commerce), so every daily run funneled the majority of rows into a single executor. The shuffle operation during the GROUP BY aggregation amplified this imbalance further, writing most of its shuffle data to one reducer.

Evaluating the Fix: Repartition vs Salting for Spark Data Skew

Two primary approaches exist for correcting skewed data distribution, and each suits a different severity level.

Comparison table of Spark data skew solutions: repartition, key salting, and coalesce with use cases and tradeoffs

With Spark 3.2+, Adaptive Query Execution (AQE) can handle moderate skew automatically, but for severe cases like 78% concentration in a single key, manual salting remains the most reliable fix.

Our diagnostic showed 78% of rows landing in the e-commerce merchant_category bucket. With Spark data skew that severe, simple redistribution alone would not resolve the imbalance. We needed salting for the GROUP BY aggregation, combined with a rebalancing step before the join. According to Apache Spark’s tuning documentation, salting is the recommended approach when a single key accounts for a disproportionate share of data in a wide transformation.

The Fix

We restructured the pipeline in two stages: first rebalancing before the join, then salting the skewed aggregation key.

Stage 1: Rebalance before the join

Why this works: Repartitioning on transaction_id distributes rows by a high-cardinality column with millions of unique values. This prevents the subsequent join from inheriting the skewed merchant_category distribution, ensuring shuffle data is spread evenly across all executors before the expensive join operation begins. Both input and output datasets use Delta Lake format for ACID transactions and time-travel capability.

Stage 2: Salt the skewed aggregation key

Why this works: The salt suffix splits the dominant e-commerce key into 10 independent partitions (ecom_0 through ecom_9), each handling roughly 7.8% of total rows instead of 78%. The first GROUP BY aggregates these salted keys in parallel across the full cluster. The second pass strips the suffix and sums the partial results back together. Because addition is associative, the final totals are mathematically identical to a single-pass GROUP BY.

The coalesce(8) at the write stage keeps the output file count manageable for the downstream SQL pool reading these Delta files, without triggering another expensive shuffle.

Inside the Fix: How Two-Pass Salting Rebalances Skewed Partitions

Two-pass salting works by breaking a skewed partition key into multiple temporary sub-keys, aggregating each sub-key independently, and then recombining the partial results into correct group-level totals. The technique eliminates the bottleneck without changing the final output. Understanding the data flow through each stage clarifies why this approach succeeds where repartition alone cannot.

The diagram below traces how transaction rows move through the pipeline after salting is applied. In the first stage, a random integer suffix (0 through 9) is appended to the skewed merchant_category key, distributing the 78% e-commerce concentration across 10 separate partitions. The first aggregation pass computes partial sums and counts on these salted keys. In the second pass, the salt suffix is stripped and the partial results are summed again to produce the true group totals. The final output matches a standard GROUP BY exactly, but the work is spread across the full cluster instead of bottlenecking on a single executor.

Data flow through the two-pass salting pipeline – from skewed input (78% in one partition) through salted aggregation to balanced output (82-minute runtime, zero data loss).

The critical insight is that salting does not change what the aggregation computes. The second pass reaggregation produces identical totals to a single-pass GROUP BY because addition is associative. Splitting a sum into partial sums across salted buckets and then summing those partials yields the same result. The only difference is that no single executor is forced to process 78% of the dataset alone, which is precisely the Spark data skew condition that caused the pipeline bottleneck in the first place.

Validation

Our team ran the rewritten pipeline against three historical dates with known output values to confirm correctness. The pipeline produced total transaction amounts and row counts within a 0.001% tolerance of the original output, reflecting expected rounding from the salt-then-reaggregate approach. We also tested it on the highest-volume trading day to confirm the salting logic performed reliably under peak load.

Synapse Spark monitoring confirmed the previously dominant executor task dropping from 87% of total stage duration to 11%. Load was now spread across 40+ tasks finishing within a six minute window of each other. Shuffle write volume dropped by 34% because evenly distributed data produces more uniform shuffle blocks.

If your Synapse or Databricks pipelines are showing similar symptoms – hardware upgrades producing no improvement, one stage dominating the entire job – our data engineering team can diagnose the root cause. We also offer AI-powered pipeline diagnostics for teams looking to automate bottleneck detection. Book a free pipeline diagnostic at scriptshub.net and we’ll pinpoint whether data skew, distribution strategy, or something else is holding you back.

The Outcome

The daily aggregation pipeline runtime fell from an average of 4.2 hours to 82 minutes, a reduction of roughly 67%. Resolving Spark data skew helped the pipeline consistently meet the three-hour SLA, with most runs finishing in under 90 minutes. The finance team eliminated morning reporting delays, and the client reduced Synapse compute costs by avoiding unnecessary cluster runtime.

In our experience across multiple client engagements, uneven data distribution in Spark workloads accounts for more than half of all Azure Synapse performance complaints where hardware upgrades failed to produce improvement. The key takeaway is that throwing more nodes at a skewed workload does not solve the problem. Additional compute gives you more idle cores waiting for one bad task to finish.

Conclusion

The key takeaway is that Azure Synapse slow query issues rooted in Spark data skew are largely invisible to infrastructure-level diagnostics. Adding nodes, upgrading pools, and rewriting SQL does nothing when one overloaded task holds back the entire cluster. The outcome was unambiguous: diagnosing through Spark monitoring logs, quantifying skew with row counts, and applying two-pass salted aggregation cut a 4.2-hour pipeline down to 82 minutes with zero infrastructure changes. In short, fixing the data distribution solved what no amount of hardware could.

ScriptsHub Technologies works with data engineering teams across the US, UK, and India to stabilize and optimize Synapse, Databricks, and PySpark pipelines. If your daily pipelines are consistently breaching SLAs or your hardware upgrades are not moving the needle, book a free consultation at scriptshub.net and our team will run a diagnostic on your current pipeline architecture. Follow ScriptsHub Technologies on LinkedIn for weekly practical content on data engineering, Power BI, and cloud analytics.

Frequently Asked Questions?

Q. What causes Azure Synapse Analytics to run slow even with more nodes?

Spark data skew forces one overloaded executor to bottleneck the entire job. Additional nodes only add idle resources that cannot assist the single slow task.

Q. How do I check if my PySpark job has data skew?

In the Spark UI, open the slow stage and sort tasks by duration. If one task takes significantly longer than the rest, run df.rdd.glom().map(len).collect() to confirm data size imbalance across executors.

Q. What is Spark data skew and how do I identify it?

Spark data skew is uneven row distribution across partitions. In Synapse Spark monitoring, one task consuming most of a stage’s runtime confirms skew.

Q. How does key salting fix data skew in PySpark aggregations?

Key salting appends a random integer to the skewed GROUP BY key, spreading rows across multiple partitions. A second pass removes the salt and combines partial results.

Q. When should I use repartition vs coalesce in PySpark?

Use repartition before wide transformations to redistribute data evenly via a full shuffle. Use coalesce after transformations to reduce output file count without reshuffling.

Q. Does two-pass salting change the aggregation results?

No. The second aggregation pass removes the salt prefix and sums partial results, producing totals identical to a standard GROUP BY within negligible rounding tolerance.

Q. How many salt buckets should I use for severe data skew?

The bottom line: start with a SALT_FACTOR of 10. Increase if the skewed task still dominates after the first pass; decrease if the second-pass reaggregation adds unnecessary overhead.