-
-
Notifications
You must be signed in to change notification settings - Fork 358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes #220 : Add dataframe comparison without order #228
base: main
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #228 +/- ##
=======================================
Coverage 86.36% 86.36%
=======================================
Files 46 46
Lines 1005 1005
Branches 86 89 +3
=======================================
Hits 868 868
Misses 116 116
Partials 21 21
Continue to review full report at Codecov.
|
* finding elements in one DataFrame not in the other. The resulting DataFrame | ||
* should be empty inferring the two DataFrames have the same elements. | ||
*/ | ||
def assertDataFrameNoOrderEquals(expected: DataFrame, result: DataFrame) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love this functionality, however I think it won't handle duplicate elements super well. If we look the documentation for except on Datasets http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset@except(other:org.apache.spark.sql.Dataset[T]):org.apache.spark.sql.Dataset[T] is a little vague but we can test this quickly with:
`scala> case class A(a: Int)
defined class A
scala> val boop1 = List(A(1), A(1), A(2))
boop1: List[A] = List(A(1), A(1), A(2))
scala> val boop2 = List(A(1), A(2))
boop2: List[A] = List(A(1), A(2))
scala> session
:24: error: not found: value session
session
^
scala> spark
:24: error: value \ is not a member of org.apache.spark.sql.SparkSession
spark
^
scala> spark
res2: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@65698020
scala> val df1 = spark.createDataFrame(boop1)
2018-04-13 11:59:46 WARN ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException
df1: org.apache.spark.sql.DataFrame = [a: int]
scala>
scala> val df2 = spark.createDataFrame(boop2)
df2: org.apache.spark.sql.DataFrame = [a: int]
scala> df1.except(df2).collect()
res3: Array[org.apache.spark.sql.Row] = Array()
scala> df2.except(df1).collect()
res4: Array[org.apache.spark.sql.Row] = Array() `
Thanks for your contribution, sorry its taken me to so long to get around to reviewing it. Left a comment, I think we could something with groupby and counts for each element maybe. What do you think? |
Gentle ping? |
Hello,
Sorry I have been really busy at work. I can try to tackle it this weekend.
…On Mon, Jul 2, 2018, 00:11 Holden Karau ***@***.***> wrote:
Gentle ping?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#228 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AWVrBZpmxJeG3xHPxo1MUwjaJR89FWe0ks5uCQKvgaJpZM4SSPLN>
.
|
@holdenk Really sorry for the delay. I hope this looks good. If you have more suggestions for changes, I will be faster in making those changes as my workload has gotten less busy now. |
@smadarasmi happy to try to help get this one working if you give me edit access? |
No description provided.