Skip to content

Commit 8e7e833

Browse files
committed
fixes after second review
1 parent e1877bb commit 8e7e833

File tree

1 file changed

+16
-2
lines changed

1 file changed

+16
-2
lines changed

rules/S7469/python/rule.adoc

Lines changed: 16 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ In PySpark, a `DataFrame` with duplicate column names can cause ambiguous and un
88
* Joins with other DataFrames may produce unexpected results or errors
99
* Saving to external data sources may fail
1010

11-
Case-insensitive duplicates, for example a column named "name" and "Name", are also flagged. This is because having column names that differ only in casing creates confusion when referencing columns and makes code harder to understand and maintain. This can lead to subtle bugs that are difficult to detect and fix.
11+
Case-insensitive duplicates, for example a column named "name" and "Name", are also flagged. This is because having column names that differ only in casing creates confusion when referencing columns and makes code harder to understand and maintain leading to subtle bugs that are difficult to detect and fix.
1212

1313
== How to fix it
1414
To fix this issue, remove or rename the duplicate columns.
@@ -41,7 +41,7 @@ df = spark.createDataFrame(data, ["id", "name", "age"]) # Compliant
4141

4242
== Resources
4343
=== Documentation
44-
- PySpark Documentation - https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/best_practices.html#do-not-use-duplicated-column-names[Best Practices]
44+
* PySpark Documentation - https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.createDataFrame.html[SparkSession.createDataFrame]
4545

4646
ifdef::env-github,rspecator-view[]
4747
=== Implementation Specification
@@ -89,6 +89,20 @@ nested_schema = StructType([
8989

9090
In addition to that, parts can be passed as variable, instead of literals. This seems to be especially common for schemas.
9191

92+
This rule could also apply to pandas `DataFrame`s, as well as the `DataFrame`s from Pandas API on Spark. However, this would increase the scope of an already big rule. Depending if the implementation raises on pandas (or pandas on spark) `DataFrame`s or not, the rule description should be updated to reflect this. Below are examples of how to construct a pandas `DataFrame` with duplicate column names.
93+
94+
[source,python]
95+
----
96+
import pandas as pd
97+
# the example below also work with pyspark.pandas
98+
import pyspark.pandas as ps
99+
100+
pd.DataFrame(data=[1, 2], columns=["name", "name"]) # Noncompliant
101+
pd.DataFrame.from_dict(data={"row_1": [1, 2], "row_2": [3,4]}, orient="index", columns=["name", "name"]) # Noncompliant
102+
pd.DataFrame.from_records(data=[(3, 'a'), (2, 'b'), (1, 'c'), (0, 'd')], columns=['col', 'col']) # Noncompliant
103+
----
104+
105+
Documentation for best practices for pandas on spark: https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/best_practices.html#do-not-use-duplicated-column-names[Best Practices]
92106

93107
=== Message
94108

0 commit comments

Comments
 (0)