Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions rules/S7472/metadata.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
{
}
25 changes: 25 additions & 0 deletions rules/S7472/python/metadata.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
{
"title": "Avoid the usage of \"unionAll()\" in favor of \"union()\"",
"type": "CODE_SMELL",
"status": "ready",
"remediation": {
"func": "Constant\/Issue",
"constantCost": "5min"
},
"tags": [
"data-science",
"pyspark"
],
"defaultSeverity": "Major",
"ruleSpecification": "RSPEC-7472",
"sqKey": "S7472",
"scope": "All",
"defaultQualityProfiles": ["Sonar way"],
"quickfix": "unknown",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I can add a quick fix but I don't know how to do it 😢 Do you know what I should write here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So depending on how we can implement it you can put one of the key word present here

"code": {
"impacts": {
"MAINTAINABILITY": "MEDIUM",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you have one extra , here...

},
"attribute": "CONVENTIONAL"
}
}
52 changes: 52 additions & 0 deletions rules/S7472/python/rule.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
This rule raises an issue when `pyspark.sql.DataFrame.unionAll` is used instead of `pyspark.sql.DataFrame.union`.


== Why is this an issue?

When using Pyspark, it is recommended to avoid using the `unionAll()` method and instead use the `union()` method when combining two DataFrames.
The `unionAll()` method is an alias for `union()`, which provides the same functionality.
Using `union()` ensures compatibility with future versions of PySpark and aligns with current best practices.

=== Code examples

==== Noncompliant code example

[source,python,diff-id=1,diff-type=noncompliant]
----
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Example").getOrCreate()

data1 = [(1, "Alice"), (2, "Bob")]
data2 = [(3, "Cathy"), (4, "David")]

df1 = spark.createDataFrame(data1, ["id", "name"])
df2 = spark.createDataFrame(data2, ["id", "name"])

combined_df = df1.unionAll(df2) # Noncompliant: unionAll() should be replaced by union()
----

==== Compliant solution

[source,python,diff-id=1,diff-type=compliant]
----
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Example").getOrCreate()

data1 = [(1, "Alice"), (2, "Bob")]
data2 = [(3, "Cathy"), (4, "David")]

df1 = spark.createDataFrame(data1, ["id", "name"])
df2 = spark.createDataFrame(data2, ["id", "name"])

combined_df = df1.union(df2)
----

== Resources
=== Documentation

* Mungingdata post - https://www.mungingdata.com/pyspark/union-unionbyname-merge-dataframes/[Combining PySpark DataFrames]
* Spark by examples - https://sparkbyexamples.com/pyspark/pyspark-union-and-unionall/[union and unionAll]
* PySpark Documentation - https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.unionAll.html[pyspark.sql.DataFrame.unionAll]

Loading