Inconsistent results with r package mdlp #12

ashtonwebster · 2017-08-07T18:00:13Z

First, I want to say your work in this library looks great and it is perfect for what I am trying to use it for. So thank you :)

I am getting very different results from what you obtained here: https://gist.github.com/hlin117/dee443d98af5ac3413c5 . When I print conv_X after running your example from the README, it looks like this:

[[  4.  18.   0.   0.]
 [  1.  11.   0.   0.]
 [  0.  13.   0.   0.]
 [  0.  13.   0.   0.]
 [  2.  19.   0.   0.]
 [  7.  21.   0.   0.]
 [  0.  17.   0.   0.]
...

Meanwhile the r package and your previous results give categories in the range of [1,3]. I would feel more confident in these results if they matched the results of the r packages.

I created a new environment and set versions for my packages to match the ones you specify in requirements.txt.

Any idea why I might be obtaining these inconsistent results?

The text was updated successfully, but these errors were encountered:

hlin117 · 2017-08-08T07:02:07Z

Hi Ashton! Thanks for your comments.

Hmm, seems the gist I posted is quite old. It looks like I made the script when I was trying to get the package into scikit-learn, but as of a few months ago, this project has been discontinued.

#7 seems to have certain breaking changes, and it has caused a unit test to fail in this package. (I need to get Travis working for this project sometime.)

With all this said, I'll certainly look into your issue when I have the time.

ashtonwebster · 2017-08-08T13:22:07Z

Thanks @hlin117 . I am not sure if this is a related error, but I am also failing the unit test (in fact, the output discretized values are floats instead of integers).

As a nasty hack, I ended up using rpy2 to simply call the r mdlp function. Perhaps when I am done with my current project I can help to take a look. It's too bad this wasn't accepted by scikit-learn; IMO this would be a useful discretization method.

ChingChuan-Chen · 2019-06-18T03:15:09Z

The inconsistency is caused by the different implementation. R code is here.

R:

for (i in 1:(n-1)){
	if(x[i+1]!=x[i]) {  # they check x not y
		ct <- (x[i]+x[i+1])/2
		wx <- which(x<=ct)
		wn <- length(wx)/n
		e1 <- wn*ent(y[wx])
		e2 <- (1-wn)*ent(y[-wx])
		val <- e1+e2
		if(val<entropy) {
			entropy <- val
			ci <- i
		}
	}
}

Yours:

for ind in range(start + 1, end):

	# I choose not to use a `min` function here for this optimization.
	if y[ind-1] == y[ind]:  # you check y instead of x
		continue

	# Find the partition entropy, and check if this entropy is minimum
	first_half = (<float> (ind - start)) / length \
						* slice_entropy(y, start, ind)[0]
	second_half = (<float> (end - ind)) / length \
						 * slice_entropy(y, ind, end)[0]
	curr_entropy = first_half + second_half

	if prev_entropy > curr_entropy:
		prev_entropy = curr_entropy
		k = ind

I think that checking y is wired because the algorithm is to discretize x.

ChingChuan-Chen · 2019-06-18T06:11:25Z

In the original paper, it also state that only boundary points of the attribute are considered. And it is defined in 2.2.

hlin117 mentioned this issue Sep 4, 2017

Make package pip installable. #15

Closed

hlin117 added the inconsistency label Sep 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent results with r package mdlp #12

Inconsistent results with r package mdlp #12

ashtonwebster commented Aug 7, 2017

hlin117 commented Aug 8, 2017

ashtonwebster commented Aug 8, 2017

ChingChuan-Chen commented Jun 18, 2019

ChingChuan-Chen commented Jun 18, 2019 •

edited

Loading

Inconsistent results with r package mdlp #12

Inconsistent results with r package mdlp #12

Comments

ashtonwebster commented Aug 7, 2017

hlin117 commented Aug 8, 2017

ashtonwebster commented Aug 8, 2017

ChingChuan-Chen commented Jun 18, 2019

ChingChuan-Chen commented Jun 18, 2019 • edited Loading

ChingChuan-Chen commented Jun 18, 2019 •

edited

Loading