|
1 |
| -FIXME: add a description |
| 1 | +This rule raises an issue when `RDD.groupByKey` is used in conjuction with `RDD.mapValues` |
| 2 | +and a commutative and associative function instead of `RDD.reduceByKey`. |
2 | 3 |
|
3 |
| -// If you want to factorize the description uncomment the following line and create the file. |
4 |
| -//include::../description.adoc[] |
5 | 4 |
|
6 | 5 | == Why is this an issue?
|
7 | 6 |
|
8 |
| -FIXME: remove the unused optional headers (that are commented out) |
| 7 | +The PySpark API offers multiple ways of performing aggregation. |
| 8 | +When performing aggregations, data is usually shuffled between partitions of course, |
| 9 | +this shuffling and its associated cost are needed to compute the result correctly. |
| 10 | + |
| 11 | +There are however cases where some aggregation methods could be more efficient than others. |
| 12 | +For example when using `RDD.groupByKey` in conjunction with `RDD.mapValues` if the function passed to `RDD.mapValues` |
| 13 | +is commutative and associative, it is preferable to use `RDD.reduceByKey` instead. |
| 14 | +The performance gain from `RDD.reduceByKey` comes from the amount of data that needs to be moved between PySpark tasks. |
| 15 | +`RDD.reduceByKey` will effectively reduce the number of rows in a partition before sending the data over the network for further reduction. |
| 16 | +On the other hand, when using `RDD.groupByKey` with `RDD.mapValues` the reduction is only done |
| 17 | +after the data has been moved around the cluster, effectively slowing down |
| 18 | +the computation process by transferring a higher amount of data over the network. |
9 | 19 |
|
10 |
| -//=== What is the potential impact? |
11 | 20 |
|
12 | 21 | == How to fix it
|
13 |
| -//== How to fix it in FRAMEWORK NAME |
| 22 | + |
| 23 | +To fix this issue replace the call `RDD.groupByKey` and `RDD.mapValues` with `RDD.reduceByKey`. |
14 | 24 |
|
15 | 25 | === Code examples
|
16 | 26 |
|
17 | 27 | ==== Noncompliant code example
|
18 | 28 |
|
19 | 29 | [source,python,diff-id=1,diff-type=noncompliant]
|
20 | 30 | ----
|
21 |
| -FIXME |
| 31 | +from pyspark import SparkContext |
| 32 | +
|
| 33 | +sc = SparkContext("local", "Example") |
| 34 | +rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 2), ("b", 3)]) |
| 35 | +result = rdd.groupByKey().mapValues(lambda values: sum(values)).collect() # Noncompliant: an associative and commutative operation is used with `groupByKey` and `mapValues` |
22 | 36 | ----
|
23 | 37 |
|
24 | 38 | ==== Compliant solution
|
25 | 39 |
|
26 | 40 | [source,python,diff-id=1,diff-type=compliant]
|
27 | 41 | ----
|
28 |
| -FIXME |
| 42 | +from pyspark import SparkContext |
| 43 | +
|
| 44 | +sc = SparkContext("local", "Example") |
| 45 | +rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 2), ("b", 3)]) |
| 46 | +result = rdd.reduceByKey(lambda x, y: x + y).collect() # Compliant |
29 | 47 | ----
|
30 | 48 |
|
31 |
| -//=== How does this work? |
| 49 | +== Resources |
| 50 | +=== Documentation |
| 51 | + |
| 52 | +* PySpark Documentation - https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.reduceByKey.html#pyspark.RDD.reduceByKey[pyspark.RDD.reduceByKey] |
| 53 | +* PySpark Documentation - https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.groupByKey.html#pyspark.RDD.groupByKey[pyspark.RDD.groupByKey] |
| 54 | +* PySpark Documentation - https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.mapValues.html#pyspark.RDD.mapValues[pyspark.RDD.mapValues] |
| 55 | + |
| 56 | +=== Articles & blog posts |
| 57 | + |
| 58 | +* Spark By Example - https://sparkbyexamples.com/spark/spark-groupbykey-vs-reducebykey/[Spark groupByKey() vs reduceByKey()] |
| 59 | + |
| 60 | +ifdef::env-github,rspecator-view[] |
| 61 | +=== Implementation Specification |
| 62 | + |
| 63 | +As a first implementation we should focus on simple operations: sum and math.prod |
| 64 | + |
| 65 | +=== Message |
| 66 | + |
| 67 | +Replace the usage of "RDD.groupByKey" and "RDD.mapValues" with "RDD.reduceByKey" |
| 68 | + |
| 69 | +=== Highlighting |
| 70 | + |
| 71 | +The main location is the method `groupByKey` and the secondary location is the `mapValues` call. |
32 | 72 |
|
33 |
| -//=== Pitfalls |
| 73 | +=== Quickfix |
34 | 74 |
|
35 |
| -//=== Going the extra mile |
| 75 | +N/A as we cannot easily convert the function passed to mapValues to a function passed to reduceByKey |
36 | 76 |
|
| 77 | +endif::env-github,rspecator-view[] |
37 | 78 |
|
38 |
| -//== Resources |
39 |
| -//=== Documentation |
40 |
| -//=== Articles & blog posts |
41 |
| -//=== Conference presentations |
42 |
| -//=== Standards |
43 |
| -//=== External coding guidelines |
44 |
| -//=== Benchmarks |
|
0 commit comments