Fixes #220 : Add dataframe comparison without order #228

smadarasmi · 2018-02-25T10:00:41Z

No description provided.

codecov-io · 2018-02-25T10:14:54Z

Codecov Report

Merging #228 into master will not change coverage.
The diff coverage is 0%.

@@           Coverage Diff           @@
##           master     #228   +/-   ##
=======================================
  Coverage   86.36%   86.36%           
=======================================
  Files          46       46           
  Lines        1005     1005           
  Branches       86       89    +3     
=======================================
  Hits          868      868           
  Misses        116      116           
  Partials       21       21

Flag	Coverage Δ
#python	`85.87% <ø> (ø)`	⬆️
#scala	`70.67% <0%> (-0.27%)`	⬇️

Impacted Files	Coverage Δ
...holdenkarau/spark/testing/DataFrameSuiteBase.scala	`0% <0%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 39ad896...a577673. Read the comment docs.

holdenk · 2018-04-13T19:00:53Z

src/main/2.0/scala/com/holdenkarau/spark/testing/DataFrameSuiteBase.scala

+    * finding elements in one DataFrame not in the other. The resulting DataFrame
+    * should be empty inferring the two DataFrames have the same elements.
+    */
+  def assertDataFrameNoOrderEquals(expected: DataFrame, result: DataFrame) {


Love this functionality, however I think it won't handle duplicate elements super well. If we look the documentation for except on Datasets http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset@except(other:org.apache.spark.sql.Dataset[T]):org.apache.spark.sql.Dataset[T] is a little vague but we can test this quickly with:

`scala> case class A(a: Int)
defined class A

scala> val boop1 = List(A(1), A(1), A(2))
boop1: List[A] = List(A(1), A(1), A(2))

scala> val boop2 = List(A(1), A(2))
boop2: List[A] = List(A(1), A(2))

scala> session
:24: error: not found: value session
session
^

scala> spark
:24: error: value \ is not a member of org.apache.spark.sql.SparkSession
spark
^

scala> spark
res2: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@65698020

scala> val df1 = spark.createDataFrame(boop1)
2018-04-13 11:59:46 WARN ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException
df1: org.apache.spark.sql.DataFrame = [a: int]

scala>

scala> val df2 = spark.createDataFrame(boop2)
df2: org.apache.spark.sql.DataFrame = [a: int]

scala> df1.except(df2).collect()
res3: Array[org.apache.spark.sql.Row] = Array()

scala> df2.except(df1).collect()
res4: Array[org.apache.spark.sql.Row] = Array() `

holdenk · 2018-04-13T19:01:35Z

Thanks for your contribution, sorry its taken me to so long to get around to reviewing it. Left a comment, I think we could something with groupby and counts for each element maybe. What do you think?

holdenk · 2018-07-01T17:11:08Z

Gentle ping?

smadarasmi · 2018-07-10T08:21:37Z

Hello, Sorry I have been really busy at work. I can try to tackle it this weekend.

…

On Mon, Jul 2, 2018, 00:11 Holden Karau ***@***.***> wrote: Gentle ping? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#228 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AWVrBZpmxJeG3xHPxo1MUwjaJR89FWe0ks5uCQKvgaJpZM4SSPLN> .

smadarasmi · 2018-10-21T16:07:05Z

@holdenk Really sorry for the delay. I hope this looks good. If you have more suggestions for changes, I will be faster in making those changes as my workload has gotten less busy now.

…le duplicates in dataframe

nsutcliffe · 2019-02-19T14:28:46Z

@smadarasmi happy to try to help get this one working if you give me edit access?

smadarasmi force-pushed the fix-220 branch from a7cd8d0 to 154ba1e Compare February 25, 2018 10:04

holdenk reviewed Apr 13, 2018

View reviewed changes

nsutcliffe mentioned this pull request Feb 19, 2019

assertDataFrameEquals() Failure due to Different Row Order #220

Closed

smadarasmi and others added 3 commits February 19, 2019 17:07

Fixes holdenk#220 : Add dataframe comparison without order

b4c92e5

Fixes holdenk#220: modify method assertDataFrameNoOrderEquals to hand…

44e3b4d

…le duplicates in dataframe

fix comments from codacy

4b5d964

smadarasmi force-pushed the fix-220 branch from 600e3ae to 4b5d964 Compare February 19, 2019 10:16

update spark download url for travis build

a577673

cvaliente mentioned this pull request Dec 4, 2019

Fix 220 #308

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes #220 : Add dataframe comparison without order #228

Fixes #220 : Add dataframe comparison without order #228

smadarasmi commented Feb 25, 2018

codecov-io commented Feb 25, 2018 •

edited

Loading

holdenk Apr 13, 2018

holdenk commented Apr 13, 2018

holdenk commented Jul 1, 2018

smadarasmi commented Jul 10, 2018 via email

smadarasmi commented Oct 21, 2018

nsutcliffe commented Feb 19, 2019

Fixes #220 : Add dataframe comparison without order #228

Are you sure you want to change the base?

Fixes #220 : Add dataframe comparison without order #228

Conversation

smadarasmi commented Feb 25, 2018

codecov-io commented Feb 25, 2018 • edited Loading

Codecov Report

holdenk Apr 13, 2018

Choose a reason for hiding this comment

holdenk commented Apr 13, 2018

holdenk commented Jul 1, 2018

smadarasmi commented Jul 10, 2018 via email

smadarasmi commented Oct 21, 2018

nsutcliffe commented Feb 19, 2019

codecov-io commented Feb 25, 2018 •

edited

Loading