Skip to content

Commit 209c385

Browse files
committed
Create rule S7469
1 parent ddb24ae commit 209c385

File tree

3 files changed

+74
-0
lines changed

3 files changed

+74
-0
lines changed

rules/S7469/metadata.json

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
{
2+
}

rules/S7469/python/metadata.json

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
{
2+
"title": "Column names should be unique",
3+
"type": "CODE_SMELL",
4+
"status": "ready",
5+
"remediation": {
6+
"func": "Constant\/Issue",
7+
"constantCost": "5min"
8+
},
9+
"tags": [
10+
"pyspark",
11+
"dataframe",
12+
"best-practice",
13+
"data-integrity"
14+
],
15+
"defaultSeverity": "Major",
16+
"ruleSpecification": "RSPEC-7469",
17+
"sqKey": "S7469",
18+
"scope": "All",
19+
"defaultQualityProfiles": ["Sonar way"],
20+
"quickfix": "unknown",
21+
"code": {
22+
"impacts": {
23+
"MAINTAINABILITY": "MEDIUM",
24+
"RELIABILITY": "HIGH"
25+
},
26+
"attribute": "CONVENTIONAL"
27+
}
28+
}

rules/S7469/python/rule.adoc

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
This rule raises an issue if a data frame has duplicate column names.
2+
3+
== Why is this an issue?
4+
5+
In PySpark, a `DataFrame` with duplicate column names can cause ambiguous and unexpected results with join, transformation, and data retrieval operations. For example:
6+
7+
* Column selection becomes unpredictable: `df.select("name")` will raise an exception
8+
* Joins with other DataFrames may produce unexpected results or errors
9+
* Saving to external data sources may fail
10+
11+
This rule ensures that a `DataFrame` is not constructed with duplicated column names.
12+
13+
== How to fix it
14+
To fix this issue, remove or rename the duplicate columns.
15+
16+
=== Code examples
17+
18+
==== Noncompliant code example
19+
20+
[source,python,diff-id=1,diff-type=noncompliant]
21+
----
22+
from pyspark.sql import SparkSession
23+
24+
spark = SparkSession.builder.appName("Example").getOrCreate()
25+
26+
data = [(1, "Alice", 28), (2, "Bob", 25)]
27+
df = spark.createDataFrame(data, ["id", "name", "name"])
28+
----
29+
30+
==== Compliant solution
31+
32+
[source,python,diff-id=1,diff-type=compliant]
33+
----
34+
from pyspark.sql import SparkSession
35+
36+
spark = SparkSession.builder.appName("Example").getOrCreate()
37+
38+
data = [(1, "Alice", 28), (2, "Bob", 25)]
39+
df = spark.createDataFrame(data, ["id", "name", "age"])
40+
----
41+
42+
== Resources
43+
=== Documentation
44+
- PySpark Documentation - https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/best_practices.html#do-not-use-duplicated-column-names[Best Practices]

0 commit comments

Comments
 (0)