-
Notifications
You must be signed in to change notification settings - Fork 31
SONARPY-2696 Create rule S7469: Do not use duplicated column names #4874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
github-actions
wants to merge
3
commits into
master
Choose a base branch
from
rule/add-RSPEC-S7469
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
{ | ||
} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
{ | ||
"title": "PySpark's \"DataFrame\" column names should be unique", | ||
"type": "CODE_SMELL", | ||
"status": "ready", | ||
"remediation": { | ||
"func": "Constant\/Issue", | ||
"constantCost": "5min" | ||
}, | ||
"tags": [ | ||
"data-science", | ||
"pyspark" | ||
], | ||
"defaultSeverity": "Major", | ||
"ruleSpecification": "RSPEC-7469", | ||
"sqKey": "S7469", | ||
"scope": "All", | ||
"defaultQualityProfiles": ["Sonar way"], | ||
"quickfix": "unknown", | ||
"code": { | ||
"impacts": { | ||
"MAINTAINABILITY": "MEDIUM", | ||
"RELIABILITY": "HIGH" | ||
}, | ||
"attribute": "CONVENTIONAL" | ||
} | ||
} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,119 @@ | ||
This rule raises an issue if PySpark's data frames have duplicate column names. Both case-sensitive and case-insensitive duplicates are considered. | ||
|
||
== Why is this an issue? | ||
|
||
In PySpark, a `DataFrame` with duplicate column names can cause ambiguous and unexpected results with join, transformation, and data retrieval operations, all while making the code more confusing. For example: | ||
|
||
* Column selection becomes unpredictable: `df.select("name")` will raise an exception | ||
* Joins with other DataFrames may produce unexpected results or errors | ||
* Saving to external data sources may fail | ||
|
||
Case-insensitive duplicates, for example a column named "name" and "Name", are also flagged. This is because having column names that differ only in casing creates confusion when referencing columns and makes code harder to understand and maintain leading to subtle bugs that are difficult to detect and fix. | ||
|
||
== How to fix it | ||
To fix this issue, remove or rename the duplicate columns. | ||
|
||
=== Code examples | ||
|
||
==== Noncompliant code example | ||
|
||
[source,python,diff-id=1,diff-type=noncompliant] | ||
---- | ||
from pyspark.sql import SparkSession | ||
|
||
spark = SparkSession.builder.appName("Example").getOrCreate() | ||
|
||
data = [(1, "Alice", 28), (2, "Bob", 25)] | ||
df = spark.createDataFrame(data, ["id", "name", "name"]) # Noncompliant: "name" is duplicated | ||
---- | ||
|
||
==== Compliant solution | ||
|
||
[source,python,diff-id=1,diff-type=compliant] | ||
---- | ||
from pyspark.sql import SparkSession | ||
|
||
spark = SparkSession.builder.appName("Example").getOrCreate() | ||
|
||
data = [(1, "Alice", 28), (2, "Bob", 25)] | ||
df = spark.createDataFrame(data, ["id", "name", "age"]) # Compliant | ||
---- | ||
|
||
== Resources | ||
=== Documentation | ||
* PySpark Documentation - https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.createDataFrame.html[SparkSession.createDataFrame] | ||
|
||
ifdef::env-github,rspecator-view[] | ||
=== Implementation Specification | ||
|
||
At a minimum, this rule should raise when `SparkSession.createDataFrame(...)` is used with an array with duplicate string literals. | ||
The rule should check both if columns are case-sensitive duplicates (e.g. "name" and "name") and case-insensitive duplicates (e.g. "name" and "Name"), in order to report a different issue message. | ||
|
||
There are a few cases, where the rule can be expanded. | ||
|
||
* `SparkSession.createDataFrame(...)` is quite complex and there are a lot of ways to create a DataFrame with it | ||
** `SparkSession.createDataFrame(...)` with a dictionary (e.g. `SparkSession.createDataFrame([{"id": 2, "name": "Alice"}, {"id": 2, "name": "Bob"}])`) | ||
|
||
** `SparkSession.createDataFrame(...)` can be given a string definition of the schema (e.g. `SparkSession.createDataFrame([('Alice', 1)], "name: string, name: int")`) | ||
|
||
** `SparkSession.createDataFrame(...)` can be used with row objects (see below) | ||
** `SparkSession.createDataFrame(...)` can be used with a schema (see below) | ||
|
||
[source,python] | ||
---- | ||
Person = Row("name", "name") | ||
spark.createDataFrame([Person("Alice", 1), Person("Bob", 2)]) | ||
---- | ||
|
||
[source,python] | ||
---- | ||
data = ... | ||
schema = StructType([ | ||
StructField("name", StringType(), True), | ||
StructField("name", StringType(), True), | ||
StructField("age", IntegerType(), True)]) | ||
spark.createDataFrame(data, schema).show() | ||
---- | ||
|
||
Schemas can also be nested | ||
[source,python] | ||
---- | ||
nested_schema = StructType([ | ||
StructField("id", IntegerType(), True), | ||
StructField("nested", StructType([ | ||
StructField("field1", StringType(), True), | ||
StructField("field2", StringType(), True) | ||
]), True) | ||
]) | ||
---- | ||
|
||
In addition to that, parts can be passed as variable, instead of literals. This seems to be especially common for schemas. | ||
|
||
This rule could also apply to pandas `DataFrame`s, as well as the `DataFrame`s from Pandas API on Spark. However, this would increase the scope of an already big rule. Depending if the implementation raises on pandas (or pandas on spark) `DataFrame`s or not, the rule description should be updated to reflect this. Below are examples of how to construct a pandas `DataFrame` with duplicate column names. | ||
|
||
[source,python] | ||
---- | ||
import pandas as pd | ||
# the example below also work with pyspark.pandas | ||
import pyspark.pandas as ps | ||
|
||
pd.DataFrame(data=[1, 2], columns=["name", "name"]) # Noncompliant | ||
pd.DataFrame.from_dict(data={"row_1": [1, 2], "row_2": [3,4]}, orient="index", columns=["name", "name"]) # Noncompliant | ||
pd.DataFrame.from_records(data=[(3, 'a'), (2, 'b'), (1, 'c'), (0, 'd')], columns=['col', 'col']) # Noncompliant | ||
---- | ||
|
||
Documentation for best practices for pandas on spark: https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/best_practices.html#do-not-use-duplicated-column-names[Best Practices] | ||
|
||
=== Message | ||
|
||
If the a column is case-sensitive duplicate (e.g. exactly the same), the message should be the following: | ||
_Rename or remove the duplicate columns in the data frame._ | ||
|
||
If the a column is case-insensitive duplicate (e.g. the casing might differ), the message should be the following: | ||
|
||
_Rename or remove the case-insensitive duplicate columns in the data frame._ | ||
|
||
=== Highlighting | ||
|
||
The main location is the `createDataFrame` and the secondary location is the duplicate column names. | ||
endif::env-github,rspecator-view[] |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we maybe also raise as well on capitalized duplicates? Not sure if we can easily detect if the configuration
builder.config("spark.sql.caseSensitive", "true")
and not raise too many FPs.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could treat that as a separate rule. Having the same name, but with different capitalization, is confusing either way. But I guess for this rule, we probably should not raise in that caseI've updated the rule text slightly to reflect that we raise in both cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the case insensitive part I think you are right even with the configuration it is a bad practice.