Create rule S7470: PySpark's \"RDD.groupByKey\", when used in conjuction with \"RDD.mapValues\" with a commutative and associative operation, should be replaced by \"RDD.reduceByKey\"

joke1196 · joke1196 · commit c675ef11d6f6 · 2025-04-07T11:22:39.000+02:00
diff --git a/rules/S7470/python/metadata.json b/rules/S7470/python/metadata.json
@@ -1,12 +1,14 @@
 {
-  "title": "FIXME",
+  "title": "PySpark's \"RDD.groupByKey\", when used in conjuction with \"RDD.mapValues\" with a commutative and associative operation, should be replaced by \"RDD.reduceByKey\"",
   "type": "CODE_SMELL",
   "status": "ready",
   "remediation": {
     "func": "Constant\/Issue",
     "constantCost": "5min"
   },
   "tags": [
+    "data-science",
+    "pyspark"
   ],
   "defaultSeverity": "Major",
   "ruleSpecification": "RSPEC-7470",
@@ -16,10 +18,8 @@
   "quickfix": "unknown",
   "code": {
     "impacts": {
-      "MAINTAINABILITY": "HIGH",
-      "RELIABILITY": "MEDIUM",
-      "SECURITY": "LOW"
+      "MAINTAINABILITY": "MEDIUM"
     },
-    "attribute": "CONVENTIONAL"
+    "attribute": "EFFICIENT"
   }
 }
diff --git a/rules/S7470/python/rule.adoc b/rules/S7470/python/rule.adoc
@@ -1,44 +1,78 @@
-FIXME: add a description
+This rule raises an issue when `RDD.groupByKey` is used in conjuction with `RDD.mapValues`
+and a commutative and associative function instead of `RDD.reduceByKey`.
 
-// If you want to factorize the description uncomment the following line and create the file.
-//include::../description.adoc[]
 
 == Why is this an issue?
 
-FIXME: remove the unused optional headers (that are commented out)
+The PySpark API offers multiple ways of performing aggregation. 
+When performing aggregations, data is usually shuffled between partitions of course, 
+this shuffling and its associated cost are needed to compute the result correctly.
+
+There are however cases where some aggregation methods could be more efficient than others.
+For example when using `RDD.groupByKey` in conjunction with `RDD.mapValues` if the function passed to `RDD.mapValues`
+is commutative and associative, it is preferable to use `RDD.reduceByKey` instead.
+The performance gain from `RDD.reduceByKey` comes from the amount of data that needs to be moved between PySpark tasks.
+`RDD.reduceByKey` will effectively reduce the number of rows in a partition before sending the data over the network for further reduction.
+On the other hand, when using `RDD.groupByKey` with `RDD.mapValues` the reduction is only done 
+after the data has been moved around the cluster, effectively slowing down 
+the computation process by transferring a higher amount of data over the network.
 
-//=== What is the potential impact?
 
 == How to fix it
-//== How to fix it in FRAMEWORK NAME
+
+To fix this issue replace the call `RDD.groupByKey` and `RDD.mapValues`  with `RDD.reduceByKey`.
 
 === Code examples
 
 ==== Noncompliant code example
 
 [source,python,diff-id=1,diff-type=noncompliant]
 ----
-FIXME
+from pyspark import SparkContext
+
+sc = SparkContext("local", "Example")
+rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 2), ("b", 3)])
+result = rdd.groupByKey().mapValues(lambda values: sum(values)).collect() # Noncompliant: an associative and commutative operation is used with `groupByKey` and `mapValues`
 ----
 
 ==== Compliant solution
 
 [source,python,diff-id=1,diff-type=compliant]
 ----
-FIXME
+from pyspark import SparkContext
+
+sc = SparkContext("local", "Example")
+rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 2), ("b", 3)])
+result = rdd.reduceByKey(lambda x, y: x + y).collect() # Compliant
 ----
 
-//=== How does this work?
+== Resources
+=== Documentation
+
+* PySpark Documentation - https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.reduceByKey.html#pyspark.RDD.reduceByKey[pyspark.RDD.reduceByKey]
+* PySpark Documentation - https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.groupByKey.html#pyspark.RDD.groupByKey[pyspark.RDD.groupByKey]
+* PySpark Documentation - https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.mapValues.html#pyspark.RDD.mapValues[pyspark.RDD.mapValues]
+
+=== Articles & blog posts
+
+* Spark By Example - https://sparkbyexamples.com/spark/spark-groupbykey-vs-reducebykey/[Spark groupByKey() vs reduceByKey()]
+
+ifdef::env-github,rspecator-view[]
+=== Implementation Specification
+
+As a first implementation we should focus on simple operations: sum and math.prod
+
+=== Message
+
+Replace the usage of "RDD.groupByKey" and "RDD.mapValues" with "RDD.reduceByKey"
+
+=== Highlighting
+
+The main location is the method `groupByKey` and the secondary location is the `mapValues` call.
 
-//=== Pitfalls
+=== Quickfix
 
-//=== Going the extra mile
+N/A as we cannot easily convert the function passed to mapValues to a function passed to reduceByKey
 
+endif::env-github,rspecator-view[]
 
-//== Resources
-//=== Documentation
-//=== Articles & blog posts
-//=== Conference presentations
-//=== Standards
-//=== External coding guidelines
-//=== Benchmarks