Create rule S7469

Seppli11 · Seppli11 · commit 209c385ff1f4 · 2025-04-04T09:35:22.000+02:00
diff --git a/rules/S7469/metadata.json b/rules/S7469/metadata.json
@@ -0,0 +1,2 @@
+{
+}
diff --git a/rules/S7469/python/metadata.json b/rules/S7469/python/metadata.json
@@ -0,0 +1,28 @@
+{
+  "title": "Column names should be unique",
+  "type": "CODE_SMELL",
+  "status": "ready",
+  "remediation": {
+    "func": "Constant\/Issue",
+    "constantCost": "5min"
+  },
+  "tags": [
+    "pyspark",
+    "dataframe",
+    "best-practice",
+    "data-integrity"
+  ],
+  "defaultSeverity": "Major",
+  "ruleSpecification": "RSPEC-7469",
+  "sqKey": "S7469",
+  "scope": "All",
+  "defaultQualityProfiles": ["Sonar way"],
+  "quickfix": "unknown",
+  "code": {
+    "impacts": {
+      "MAINTAINABILITY": "MEDIUM",
+      "RELIABILITY": "HIGH"
+    },
+    "attribute": "CONVENTIONAL"
+  }
+}
diff --git a/rules/S7469/python/rule.adoc b/rules/S7469/python/rule.adoc
@@ -0,0 +1,44 @@
+This rule raises an issue if a data frame has duplicate column names.
+
+== Why is this an issue?
+
+In PySpark, a `DataFrame` with duplicate column names can cause ambiguous and unexpected results with join, transformation, and data retrieval operations. For example:
+
+* Column selection becomes unpredictable: `df.select("name")` will raise an exception
+* Joins with other DataFrames may produce unexpected results or errors 
+* Saving to external data sources may fail
+
+This rule ensures that a `DataFrame` is not constructed with duplicated column names.
+
+== How to fix it
+To fix this issue, remove or rename the duplicate columns.
+
+=== Code examples
+
+==== Noncompliant code example
+
+[source,python,diff-id=1,diff-type=noncompliant]
+----
+from pyspark.sql import SparkSession
+
+spark = SparkSession.builder.appName("Example").getOrCreate()
+
+data = [(1, "Alice", 28), (2, "Bob", 25)]
+df = spark.createDataFrame(data, ["id", "name", "name"])
+----
+
+==== Compliant solution
+
+[source,python,diff-id=1,diff-type=compliant]
+----
+from pyspark.sql import SparkSession
+
+spark = SparkSession.builder.appName("Example").getOrCreate()
+
+data = [(1, "Alice", 28), (2, "Bob", 25)]
+df = spark.createDataFrame(data, ["id", "name", "age"])
+----
+
+== Resources
+=== Documentation
+- PySpark Documentation - https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/best_practices.html#do-not-use-duplicated-column-names[Best Practices]