|
| 1 | +This rule raises an issue if a data frame has duplicate column names. |
| 2 | + |
| 3 | +== Why is this an issue? |
| 4 | + |
| 5 | +In PySpark, a `DataFrame` with duplicate column names can cause ambiguous and unexpected results with join, transformation, and data retrieval operations. For example: |
| 6 | + |
| 7 | +* Column selection becomes unpredictable: `df.select("name")` will raise an exception |
| 8 | +* Joins with other DataFrames may produce unexpected results or errors |
| 9 | +* Saving to external data sources may fail |
| 10 | + |
| 11 | +This rule ensures that a `DataFrame` is not constructed with duplicated column names. |
| 12 | + |
| 13 | +== How to fix it |
| 14 | +To fix this issue, remove or rename the duplicate columns. |
| 15 | + |
| 16 | +=== Code examples |
| 17 | + |
| 18 | +==== Noncompliant code example |
| 19 | + |
| 20 | +[source,python,diff-id=1,diff-type=noncompliant] |
| 21 | +---- |
| 22 | +from pyspark.sql import SparkSession |
| 23 | +
|
| 24 | +spark = SparkSession.builder.appName("Example").getOrCreate() |
| 25 | +
|
| 26 | +data = [(1, "Alice", 28), (2, "Bob", 25)] |
| 27 | +df = spark.createDataFrame(data, ["id", "name", "name"]) |
| 28 | +---- |
| 29 | + |
| 30 | +==== Compliant solution |
| 31 | + |
| 32 | +[source,python,diff-id=1,diff-type=compliant] |
| 33 | +---- |
| 34 | +from pyspark.sql import SparkSession |
| 35 | +
|
| 36 | +spark = SparkSession.builder.appName("Example").getOrCreate() |
| 37 | +
|
| 38 | +data = [(1, "Alice", 28), (2, "Bob", 25)] |
| 39 | +df = spark.createDataFrame(data, ["id", "name", "age"]) |
| 40 | +---- |
| 41 | + |
| 42 | +== Resources |
| 43 | +=== Documentation |
| 44 | +- PySpark Documentation - https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/best_practices.html#do-not-use-duplicated-column-names[Best Practices] |
0 commit comments