Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes #220 : Add dataframe comparison without order #228

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

smadarasmi
Copy link
Contributor

No description provided.

@codecov-io
Copy link

codecov-io commented Feb 25, 2018

Codecov Report

Merging #228 into master will not change coverage.
The diff coverage is 0%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #228   +/-   ##
=======================================
  Coverage   86.36%   86.36%           
=======================================
  Files          46       46           
  Lines        1005     1005           
  Branches       86       89    +3     
=======================================
  Hits          868      868           
  Misses        116      116           
  Partials       21       21
Flag Coverage Δ
#python 85.87% <ø> (ø) ⬆️
#scala 70.67% <0%> (-0.27%) ⬇️
Impacted Files Coverage Δ
...holdenkarau/spark/testing/DataFrameSuiteBase.scala 0% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 39ad896...a577673. Read the comment docs.

* finding elements in one DataFrame not in the other. The resulting DataFrame
* should be empty inferring the two DataFrames have the same elements.
*/
def assertDataFrameNoOrderEquals(expected: DataFrame, result: DataFrame) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this functionality, however I think it won't handle duplicate elements super well. If we look the documentation for except on Datasets http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset@except(other:org.apache.spark.sql.Dataset[T]):org.apache.spark.sql.Dataset[T] is a little vague but we can test this quickly with:

`scala> case class A(a: Int)
defined class A

scala> val boop1 = List(A(1), A(1), A(2))
boop1: List[A] = List(A(1), A(1), A(2))

scala> val boop2 = List(A(1), A(2))
boop2: List[A] = List(A(1), A(2))

scala> session
:24: error: not found: value session
session
^

scala> spark
:24: error: value \ is not a member of org.apache.spark.sql.SparkSession
spark
^

scala> spark
res2: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@65698020

scala> val df1 = spark.createDataFrame(boop1)
2018-04-13 11:59:46 WARN ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException
df1: org.apache.spark.sql.DataFrame = [a: int]

scala>

scala> val df2 = spark.createDataFrame(boop2)
df2: org.apache.spark.sql.DataFrame = [a: int]

scala> df1.except(df2).collect()
res3: Array[org.apache.spark.sql.Row] = Array()

scala> df2.except(df1).collect()
res4: Array[org.apache.spark.sql.Row] = Array() `

@holdenk
Copy link
Owner

holdenk commented Apr 13, 2018

Thanks for your contribution, sorry its taken me to so long to get around to reviewing it. Left a comment, I think we could something with groupby and counts for each element maybe. What do you think?

@holdenk
Copy link
Owner

holdenk commented Jul 1, 2018

Gentle ping?

@smadarasmi
Copy link
Contributor Author

smadarasmi commented Jul 10, 2018 via email

@smadarasmi
Copy link
Contributor Author

@holdenk Really sorry for the delay. I hope this looks good. If you have more suggestions for changes, I will be faster in making those changes as my workload has gotten less busy now.

@nsutcliffe
Copy link

@smadarasmi happy to try to help get this one working if you give me edit access?

@cvaliente cvaliente mentioned this pull request Dec 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants