SONARPY-2696 Create rule S7469: Do not use duplicated column names #4874

github-actions · 2025-04-03T14:02:41Z

You can preview this rule here (updated a few minutes after each push).

Review

A dedicated reviewer checked the rule description successfully for:

logical errors and incorrect information
information gaps and missing content
text style and tone
PR summary and labels follow the guidelines

joke1196

Great first rule the description is clear and concise. I added a few comments as I think we can make a few improvements to make it clearer.

joke1196 · 2025-04-04T11:55:40Z

rules/S7469/python/metadata.json

+  },
+  "tags": [
+    "pyspark",
+    "dataframe",


Are you sure about these tags? Did you check with Jean already? Normally we should only have something about the library name and the domain.

TBH, no. I copied them from the rule proposal ticket. I've changed them to just the domain and library for now

joke1196 · 2025-04-04T11:57:22Z

rules/S7469/python/metadata.json

@@ -0,0 +1,28 @@
+{
+  "title": "Column names should be unique",


I think it would be a good idea to specify in which context this rule applies. Here I would not know what Column refers to. Maybe adding PySpark's DataFrame Column names would be clearer.

joke1196 · 2025-04-04T11:58:45Z

rules/S7469/python/rule.adoc

@@ -0,0 +1,99 @@
+This rule raises an issue if a data frame has duplicate column names.


Same comment as for the title of the rule maybe add PySpark's DataFrame, so there are no confusion about what the rule will do. (as it will not raise on pandas DataFrames for example).

joke1196 · 2025-04-04T12:01:38Z

rules/S7469/python/rule.adoc

+In PySpark, a `DataFrame` with duplicate column names can cause ambiguous and unexpected results with join, transformation, and data retrieval operations. For example:
+
+* Column selection becomes unpredictable: `df.select("name")` will raise an exception
+* Joins with other DataFrames may produce unexpected results or errors 
+* Saving to external data sources may fail
+
+This rule ensures that a `DataFrame` is not constructed with duplicated column names.


This is really good! Concise and clear

joke1196 · 2025-04-04T12:05:32Z

rules/S7469/python/rule.adoc

+
+== Resources
+=== Documentation
+- PySpark Documentation - https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/best_practices.html#do-not-use-duplicated-column-names[Best Practices]


This might get a bit confusing to show documentation about pandas on spark, which is not mentioned in the rule at all. Maybe we could add a small sentence in the description saying that we will raise on ps.createDataFrame and on spark.createDataFrame. WDYT?

Oh, I didn't realize that the examples import pandas.
~~But we probably should raise on both methods anyway~~
On second thought, this would blow up the rule implementation quite a bit and also complicate the description of the rule quite a bit. Unfortunately, I didn't find any other documentation that mentions this specifically. Because of this, I would be tempted to leave it there, since the general sentiment holds true regardless of pandas or pyspark.

So the Best Practices article says It is disallowed to use duplicated column names because Spark SQL does not allow this in general. Pandas API on Spark inherits this behavior. And for me it means we can create a rule that is sort of generic stating that DataFrame should not have duplicated names. And just specify that we will raise on pandas on spark DataFrames as well. I don't think we need to go into too much detail about pandas, but we will raise on spark.createDataFrame and on spark.pandas.createDataFrame, because the latter inherits the behavior of the former. So for me this resource makes total sense here.

joke1196 · 2025-04-04T12:06:49Z

rules/S7469/python/rule.adoc

+
+There are a few cases, where the rule can be expanded. 
+
+* `createDataFrame(...)` is quite complex and there are a lot of ways to create a DataFrame with it


Here similar to the comment above I think we should mention ps and spark DataFrame creation.

joke1196 · 2025-04-04T12:09:54Z

rules/S7469/python/rule.adoc

+* Saving to external data sources may fail
+
+This rule ensures that a `DataFrame` is not constructed with duplicated column names.
+


Should we maybe also raise as well on capitalized duplicates? Not sure if we can easily detect if the configuration builder.config("spark.sql.caseSensitive", "true") and not raise too many FPs.

~~We could treat that as a separate rule. Having the same name, but with different capitalization, is confusing either way. But I guess for this rule, we probably should not raise in that case~~
I've updated the rule text slightly to reflect that we raise in both cases.

I like the case insensitive part I think you are right even with the configuration it is a bad practice.

joke1196

I have just added a small comment regarding the spark.pandas that should be mentioned in the description of the rule.

joke1196 · 2025-04-04T14:30:33Z

rules/S7469/python/rule.adoc

+
+== Resources
+=== Documentation
+- PySpark Documentation - https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/best_practices.html#do-not-use-duplicated-column-names[Best Practices]


So the Best Practices article says It is disallowed to use duplicated column names because Spark SQL does not allow this in general. Pandas API on Spark inherits this behavior. And for me it means we can create a rule that is sort of generic stating that DataFrame should not have duplicated names. And just specify that we will raise on pandas on spark DataFrames as well. I don't think we need to go into too much detail about pandas, but we will raise on spark.createDataFrame and on spark.pandas.createDataFrame, because the latter inherits the behavior of the former. So for me this resource makes total sense here.

joke1196 · 2025-04-04T14:32:56Z

rules/S7469/python/rule.adoc

+* Saving to external data sources may fail
+
+This rule ensures that a `DataFrame` is not constructed with duplicated column names.
+


I like the case insensitive part I think you are right even with the configuration it is a bad practice.

joke1196

I think I would go the other around leaving the rule a bit more specialized with PySpark and adding the text saying this rule will raise also on pandas on spark DataFrame. If ever we need to make it more generic because we want to extend it to other libraries we could rework the rule a that time. It feels a bit of a premature optimization here.

Seppli11 · 2025-04-07T09:27:00Z

I think I would go the other around leaving the rule a bit more specialized with PySpark and adding the text saying this rule will raise also on pandas on spark DataFrame. If ever we need to make it more generic because we want to extend it to other libraries we could rework the rule a that time. It feels a bit of a premature optimization here.

As per our discussion, I've reverted the last commit to make the rule specific to PySpark. Additionally, I've left a note in the implementation specification which mentions pandas, what to check, and that the rule description should be updated accordingly if panda is supported by the rule.

joke1196

Good job! LGTM!

sonarqube-next · 2025-04-07T09:53:34Z

Quality Gate passed for 'rspec-tools'

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarQube

sonarqube-next · 2025-04-07T09:54:45Z

Quality Gate passed for 'rspec-frontend'

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarQube

github-actions bot assigned Seppli11 Apr 3, 2025

github-actions bot added the python label Apr 3, 2025

Seppli11 changed the title ~~Create rule S7469~~ SONARPY-2696 Create rule S7469: Do not use duplicated column names Apr 3, 2025

Seppli11 force-pushed the rule/add-RSPEC-S7469 branch 3 times, most recently from 209c385 to 6e397d6 Compare April 4, 2025 09:02

Seppli11 requested a review from joke1196 April 4, 2025 09:03

Seppli11 force-pushed the rule/add-RSPEC-S7469 branch from 6e397d6 to be53d3a Compare April 4, 2025 09:11

joke1196 requested changes Apr 4, 2025

View reviewed changes

Seppli11 force-pushed the rule/add-RSPEC-S7469 branch 5 times, most recently from a226d74 to ae23ade Compare April 4, 2025 13:48

Seppli11 requested a review from joke1196 April 4, 2025 13:48

joke1196 requested changes Apr 4, 2025

View reviewed changes

Seppli11 force-pushed the rule/add-RSPEC-S7469 branch from 9b92182 to 3c56e8f Compare April 7, 2025 07:31

Seppli11 requested a review from joke1196 April 7, 2025 07:31

joke1196 requested changes Apr 7, 2025

View reviewed changes

Seppli11 force-pushed the rule/add-RSPEC-S7469 branch 4 times, most recently from d4aa59d to 9c1f239 Compare April 7, 2025 09:24

Seppli11 requested a review from joke1196 April 7, 2025 09:24

joke1196 approved these changes Apr 7, 2025

View reviewed changes

Seppli11 and others added 3 commits April 7, 2025 11:51

Create rule S7469

954d2c8

fixes after review

e1877bb

fixes after second review

8e7e833

Seppli11 force-pushed the rule/add-RSPEC-S7469 branch from 9c1f239 to 8e7e833 Compare April 7, 2025 09:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SONARPY-2696 Create rule S7469: Do not use duplicated column names #4874

SONARPY-2696 Create rule S7469: Do not use duplicated column names #4874

github-actions bot commented Apr 3, 2025

joke1196 left a comment

joke1196 Apr 4, 2025

Seppli11 Apr 4, 2025

joke1196 Apr 4, 2025

joke1196 Apr 4, 2025

joke1196 Apr 4, 2025

joke1196 Apr 4, 2025

Seppli11 Apr 4, 2025 •

edited

Loading

joke1196 Apr 4, 2025

joke1196 Apr 4, 2025

joke1196 Apr 4, 2025

Seppli11 Apr 4, 2025 •

edited

Loading

joke1196 Apr 4, 2025

joke1196 left a comment

joke1196 Apr 4, 2025

joke1196 Apr 4, 2025

joke1196 left a comment

Seppli11 commented Apr 7, 2025

joke1196 left a comment

sonarqube-next bot commented Apr 7, 2025

sonarqube-next bot commented Apr 7, 2025

		@@ -0,0 +1,99 @@
		This rule raises an issue if a data frame has duplicate column names.


		There are a few cases, where the rule can be expanded.

		* `createDataFrame(...)` is quite complex and there are a lot of ways to create a DataFrame with it

		* Saving to external data sources may fail

		This rule ensures that a `DataFrame` is not constructed with duplicated column names.

SONARPY-2696 Create rule S7469: Do not use duplicated column names #4874

Are you sure you want to change the base?

SONARPY-2696 Create rule S7469: Do not use duplicated column names #4874

Conversation

github-actions bot commented Apr 3, 2025

Review

joke1196 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Seppli11 Apr 4, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Seppli11 Apr 4, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joke1196 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joke1196 left a comment

Choose a reason for hiding this comment

Seppli11 commented Apr 7, 2025

joke1196 left a comment

Choose a reason for hiding this comment

sonarqube-next bot commented Apr 7, 2025

Quality Gate passed for 'rspec-tools'

sonarqube-next bot commented Apr 7, 2025

Quality Gate passed for 'rspec-frontend'

Seppli11 Apr 4, 2025 •

edited

Loading

Seppli11 Apr 4, 2025 •

edited

Loading