Skip to content
Draft
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions rules/S7472/metadata.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
{
}
25 changes: 25 additions & 0 deletions rules/S7472/python/metadata.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
{
"title": "Avoid the usage of \"unionAll()\" in favor of \"union()\"",
"type": "CODE_SMELL",
"status": "ready",
"remediation": {
"func": "Constant\/Issue",
"constantCost": "5min"
},
"tags": [
"data-science",
"pyspark"
],
"defaultSeverity": "Major",
"ruleSpecification": "RSPEC-7472",
"sqKey": "S7472",
"scope": "All",
"defaultQualityProfiles": ["Sonar way"],
"quickfix": "unknown",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I can add a quick fix but I don't know how to do it 😢 Do you know what I should write here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So depending on how we can implement it you can put one of the key word present here

"code": {
"impacts": {
"MAINTAINABILITY": "MEDIUM",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you have one extra , here...

},
"attribute": "CONVENTIONAL"
}
}
51 changes: 51 additions & 0 deletions rules/S7472/python/rule.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
This rule raises an issue when `pyspark.sql.DataFrame.unionAll` is used instead of `pyspark.sql.DataFrame.union`.


== Why is this an issue?

When using Pyspark, it is recommended to avoid using the `unionAll()` method and instead use the `union()` method when combining two DataFrames. The unionAll() method is deprecated and has been replaced by union(), which provides the same functionality. Using union() ensures compatibility with future versions of PySpark and aligns with current best practices.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the sake of consistency, I would maybe put all unionAll() and union() in between backticks. WDYT?


=== Code examples

==== Noncompliant code example

[source,python,diff-id=1,diff-type=noncompliant]
----
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Example").getOrCreate()

data1 = [(1, "Alice"), (2, "Bob")]
data2 = [(3, "Cathy"), (4, "David")]

df1 = spark.createDataFrame(data1, ["id", "name"])
df2 = spark.createDataFrame(data2, ["id", "name"])

# Noncompliant: unionAll() is deprecated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I think it would be best to put it on the same line as the issue raised. This way the diff view contains less changes on SQS.

combined_df = df1.unionAll(df2)
----

==== Compliant solution

[source,python,diff-id=1,diff-type=compliant]
----
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Example").getOrCreate()

data1 = [(1, "Alice"), (2, "Bob")]
data2 = [(3, "Cathy"), (4, "David")]

df1 = spark.createDataFrame(data1, ["id", "name"])
df2 = spark.createDataFrame(data2, ["id", "name"])

combined_df = df1.union(df2)
----

== Resources
=== Documentation

* Mungingdata post - https://www.mungingdata.com/pyspark/union-unionbyname-merge-dataframes/[Combining PySpark DataFrames]
* Spark by examples - https://sparkbyexamples.com/pyspark/pyspark-union-and-unionall/[union and unionAll]
* PySpark Documentation - https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.unionAll.html[pyspark.sql.DataFrame.unionAll]

Loading