Skip to content

SONARPY-2696 Create rule S7469: Do not use duplicated column names #4874

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

github-actions[bot]
Copy link
Contributor

@github-actions github-actions bot commented Apr 3, 2025

You can preview this rule here (updated a few minutes after each push).

Review

A dedicated reviewer checked the rule description successfully for:

  • logical errors and incorrect information
  • information gaps and missing content
  • text style and tone
  • PR summary and labels follow the guidelines

@github-actions github-actions bot added the python label Apr 3, 2025
@Seppli11 Seppli11 changed the title Create rule S7469 SONARPY-2696 Create rule S7469: Do not use duplicated column names Apr 3, 2025
@Seppli11 Seppli11 force-pushed the rule/add-RSPEC-S7469 branch 3 times, most recently from 209c385 to 6e397d6 Compare April 4, 2025 09:02
@Seppli11 Seppli11 requested a review from joke1196 April 4, 2025 09:03
@Seppli11 Seppli11 force-pushed the rule/add-RSPEC-S7469 branch from 6e397d6 to be53d3a Compare April 4, 2025 09:11
Copy link
Contributor

@joke1196 joke1196 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great first rule the description is clear and concise. I added a few comments as I think we can make a few improvements to make it clearer.

},
"tags": [
"pyspark",
"dataframe",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure about these tags? Did you check with Jean already? Normally we should only have something about the library name and the domain.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, no. I copied them from the rule proposal ticket. I've changed them to just the domain and library for now

@@ -0,0 +1,28 @@
{
"title": "Column names should be unique",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be a good idea to specify in which context this rule applies. Here I would not know what Column refers to. Maybe adding PySpark's DataFrame Column names would be clearer.

@@ -0,0 +1,99 @@
This rule raises an issue if a data frame has duplicate column names.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as for the title of the rule maybe add PySpark's DataFrame, so there are no confusion about what the rule will do. (as it will not raise on pandas DataFrames for example).

Comment on lines 5 to 11
In PySpark, a `DataFrame` with duplicate column names can cause ambiguous and unexpected results with join, transformation, and data retrieval operations. For example:

* Column selection becomes unpredictable: `df.select("name")` will raise an exception
* Joins with other DataFrames may produce unexpected results or errors
* Saving to external data sources may fail

This rule ensures that a `DataFrame` is not constructed with duplicated column names.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really good! Concise and clear


== Resources
=== Documentation
- PySpark Documentation - https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/best_practices.html#do-not-use-duplicated-column-names[Best Practices]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might get a bit confusing to show documentation about pandas on spark, which is not mentioned in the rule at all. Maybe we could add a small sentence in the description saying that we will raise on ps.createDataFrame and on spark.createDataFrame. WDYT?

Copy link
Contributor

@Seppli11 Seppli11 Apr 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I didn't realize that the examples import pandas.
But we probably should raise on both methods anyway
On second thought, this would blow up the rule implementation quite a bit and also complicate the description of the rule quite a bit. Unfortunately, I didn't find any other documentation that mentions this specifically. Because of this, I would be tempted to leave it there, since the general sentiment holds true regardless of pandas or pyspark.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the Best Practices article says It is disallowed to use duplicated column names because Spark SQL does not allow this in general. Pandas API on Spark inherits this behavior. And for me it means we can create a rule that is sort of generic stating that DataFrame should not have duplicated names. And just specify that we will raise on pandas on spark DataFrames as well. I don't think we need to go into too much detail about pandas, but we will raise on spark.createDataFrame and on spark.pandas.createDataFrame, because the latter inherits the behavior of the former. So for me this resource makes total sense here.


There are a few cases, where the rule can be expanded.

* `createDataFrame(...)` is quite complex and there are a lot of ways to create a DataFrame with it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here similar to the comment above I think we should mention ps and spark DataFrame creation.

* Saving to external data sources may fail

This rule ensures that a `DataFrame` is not constructed with duplicated column names.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we maybe also raise as well on capitalized duplicates? Not sure if we can easily detect if the configuration builder.config("spark.sql.caseSensitive", "true") and not raise too many FPs.

Copy link
Contributor

@Seppli11 Seppli11 Apr 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could treat that as a separate rule. Having the same name, but with different capitalization, is confusing either way. But I guess for this rule, we probably should not raise in that case
I've updated the rule text slightly to reflect that we raise in both cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the case insensitive part I think you are right even with the configuration it is a bad practice.

@Seppli11 Seppli11 force-pushed the rule/add-RSPEC-S7469 branch 5 times, most recently from a226d74 to ae23ade Compare April 4, 2025 13:48
@Seppli11 Seppli11 requested a review from joke1196 April 4, 2025 13:48
Copy link
Contributor

@joke1196 joke1196 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have just added a small comment regarding the spark.pandas that should be mentioned in the description of the rule.


== Resources
=== Documentation
- PySpark Documentation - https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/best_practices.html#do-not-use-duplicated-column-names[Best Practices]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the Best Practices article says It is disallowed to use duplicated column names because Spark SQL does not allow this in general. Pandas API on Spark inherits this behavior. And for me it means we can create a rule that is sort of generic stating that DataFrame should not have duplicated names. And just specify that we will raise on pandas on spark DataFrames as well. I don't think we need to go into too much detail about pandas, but we will raise on spark.createDataFrame and on spark.pandas.createDataFrame, because the latter inherits the behavior of the former. So for me this resource makes total sense here.

* Saving to external data sources may fail

This rule ensures that a `DataFrame` is not constructed with duplicated column names.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the case insensitive part I think you are right even with the configuration it is a bad practice.

@Seppli11 Seppli11 force-pushed the rule/add-RSPEC-S7469 branch from 9b92182 to 3c56e8f Compare April 7, 2025 07:31
@Seppli11 Seppli11 requested a review from joke1196 April 7, 2025 07:31
Copy link
Contributor

@joke1196 joke1196 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would go the other around leaving the rule a bit more specialized with PySpark and adding the text saying this rule will raise also on pandas on spark DataFrame. If ever we need to make it more generic because we want to extend it to other libraries we could rework the rule a that time. It feels a bit of a premature optimization here.

@Seppli11 Seppli11 force-pushed the rule/add-RSPEC-S7469 branch 4 times, most recently from d4aa59d to 9c1f239 Compare April 7, 2025 09:24
@Seppli11 Seppli11 requested a review from joke1196 April 7, 2025 09:24
@Seppli11
Copy link
Contributor

Seppli11 commented Apr 7, 2025

I think I would go the other around leaving the rule a bit more specialized with PySpark and adding the text saying this rule will raise also on pandas on spark DataFrame. If ever we need to make it more generic because we want to extend it to other libraries we could rework the rule a that time. It feels a bit of a premature optimization here.

As per our discussion, I've reverted the last commit to make the rule specific to PySpark. Additionally, I've left a note in the implementation specification which mentions pandas, what to check, and that the rule description should be updated accordingly if panda is supported by the rule.

Copy link
Contributor

@joke1196 joke1196 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job! LGTM!

@Seppli11 Seppli11 force-pushed the rule/add-RSPEC-S7469 branch from 9c1f239 to 8e7e833 Compare April 7, 2025 09:51
Copy link

sonarqube-next bot commented Apr 7, 2025

Quality Gate passed Quality Gate passed for 'rspec-tools'

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarQube

Copy link

sonarqube-next bot commented Apr 7, 2025

Quality Gate passed Quality Gate passed for 'rspec-frontend'

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarQube

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants