-
Notifications
You must be signed in to change notification settings - Fork 31
SONARPY-2696 Create rule S7469: Do not use duplicated column names #4874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
209c385
to
6e397d6
Compare
6e397d6
to
be53d3a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great first rule the description is clear and concise. I added a few comments as I think we can make a few improvements to make it clearer.
rules/S7469/python/metadata.json
Outdated
}, | ||
"tags": [ | ||
"pyspark", | ||
"dataframe", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure about these tags? Did you check with Jean already? Normally we should only have something about the library name and the domain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, no. I copied them from the rule proposal ticket. I've changed them to just the domain and library for now
rules/S7469/python/metadata.json
Outdated
@@ -0,0 +1,28 @@ | |||
{ | |||
"title": "Column names should be unique", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be a good idea to specify in which context this rule applies. Here I would not know what Column refers to. Maybe adding PySpark's DataFrame Column names
would be clearer.
rules/S7469/python/rule.adoc
Outdated
@@ -0,0 +1,99 @@ | |||
This rule raises an issue if a data frame has duplicate column names. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as for the title of the rule maybe add PySpark's DataFrame, so there are no confusion about what the rule will do. (as it will not raise on pandas DataFrames for example).
rules/S7469/python/rule.adoc
Outdated
In PySpark, a `DataFrame` with duplicate column names can cause ambiguous and unexpected results with join, transformation, and data retrieval operations. For example: | ||
|
||
* Column selection becomes unpredictable: `df.select("name")` will raise an exception | ||
* Joins with other DataFrames may produce unexpected results or errors | ||
* Saving to external data sources may fail | ||
|
||
This rule ensures that a `DataFrame` is not constructed with duplicated column names. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really good! Concise and clear
rules/S7469/python/rule.adoc
Outdated
|
||
== Resources | ||
=== Documentation | ||
- PySpark Documentation - https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/best_practices.html#do-not-use-duplicated-column-names[Best Practices] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might get a bit confusing to show documentation about pandas on spark, which is not mentioned in the rule at all. Maybe we could add a small sentence in the description saying that we will raise on ps.createDataFrame and on spark.createDataFrame. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I didn't realize that the examples import pandas.
But we probably should raise on both methods anyway
On second thought, this would blow up the rule implementation quite a bit and also complicate the description of the rule quite a bit. Unfortunately, I didn't find any other documentation that mentions this specifically. Because of this, I would be tempted to leave it there, since the general sentiment holds true regardless of pandas or pyspark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the Best Practices article says It is disallowed to use duplicated column names because Spark SQL does not allow this in general. Pandas API on Spark inherits this behavior.
And for me it means we can create a rule that is sort of generic stating that DataFrame should not have duplicated names. And just specify that we will raise on pandas on spark DataFrames as well. I don't think we need to go into too much detail about pandas, but we will raise on spark.createDataFrame
and on spark.pandas.createDataFrame
, because the latter inherits the behavior of the former. So for me this resource makes total sense here.
rules/S7469/python/rule.adoc
Outdated
|
||
There are a few cases, where the rule can be expanded. | ||
|
||
* `createDataFrame(...)` is quite complex and there are a lot of ways to create a DataFrame with it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here similar to the comment above I think we should mention ps and spark DataFrame creation.
* Saving to external data sources may fail | ||
|
||
This rule ensures that a `DataFrame` is not constructed with duplicated column names. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we maybe also raise as well on capitalized duplicates? Not sure if we can easily detect if the configuration builder.config("spark.sql.caseSensitive", "true")
and not raise too many FPs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could treat that as a separate rule. Having the same name, but with different capitalization, is confusing either way. But I guess for this rule, we probably should not raise in that case
I've updated the rule text slightly to reflect that we raise in both cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the case insensitive part I think you are right even with the configuration it is a bad practice.
a226d74
to
ae23ade
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have just added a small comment regarding the spark.pandas that should be mentioned in the description of the rule.
rules/S7469/python/rule.adoc
Outdated
|
||
== Resources | ||
=== Documentation | ||
- PySpark Documentation - https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/best_practices.html#do-not-use-duplicated-column-names[Best Practices] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the Best Practices article says It is disallowed to use duplicated column names because Spark SQL does not allow this in general. Pandas API on Spark inherits this behavior.
And for me it means we can create a rule that is sort of generic stating that DataFrame should not have duplicated names. And just specify that we will raise on pandas on spark DataFrames as well. I don't think we need to go into too much detail about pandas, but we will raise on spark.createDataFrame
and on spark.pandas.createDataFrame
, because the latter inherits the behavior of the former. So for me this resource makes total sense here.
* Saving to external data sources may fail | ||
|
||
This rule ensures that a `DataFrame` is not constructed with duplicated column names. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the case insensitive part I think you are right even with the configuration it is a bad practice.
9b92182
to
3c56e8f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would go the other around leaving the rule a bit more specialized with PySpark and adding the text saying this rule will raise also on pandas on spark DataFrame
. If ever we need to make it more generic because we want to extend it to other libraries we could rework the rule a that time. It feels a bit of a premature optimization here.
d4aa59d
to
9c1f239
Compare
As per our discussion, I've reverted the last commit to make the rule specific to PySpark. Additionally, I've left a note in the implementation specification which mentions pandas, what to check, and that the rule description should be updated accordingly if panda is supported by the rule. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job! LGTM!
9c1f239
to
8e7e833
Compare
|
|
You can preview this rule here (updated a few minutes after each push).
Review
A dedicated reviewer checked the rule description successfully for: