Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent results with r package mdlp #12

Open
ashtonwebster opened this issue Aug 7, 2017 · 4 comments
Open

Inconsistent results with r package mdlp #12

ashtonwebster opened this issue Aug 7, 2017 · 4 comments

Comments

@ashtonwebster
Copy link

First, I want to say your work in this library looks great and it is perfect for what I am trying to use it for. So thank you :)

I am getting very different results from what you obtained here: https://gist.github.com/hlin117/dee443d98af5ac3413c5 . When I print conv_X after running your example from the README, it looks like this:

[[  4.  18.   0.   0.]
 [  1.  11.   0.   0.]
 [  0.  13.   0.   0.]
 [  0.  13.   0.   0.]
 [  2.  19.   0.   0.]
 [  7.  21.   0.   0.]
 [  0.  17.   0.   0.]
...

Meanwhile the r package and your previous results give categories in the range of [1,3]. I would feel more confident in these results if they matched the results of the r packages.

I created a new environment and set versions for my packages to match the ones you specify in requirements.txt.

Any idea why I might be obtaining these inconsistent results?

@hlin117
Copy link
Owner

hlin117 commented Aug 8, 2017

Hi Ashton! Thanks for your comments.

Hmm, seems the gist I posted is quite old. It looks like I made the script when I was trying to get the package into scikit-learn, but as of a few months ago, this project has been discontinued.

#7 seems to have certain breaking changes, and it has caused a unit test to fail in this package. (I need to get Travis working for this project sometime.)

With all this said, I'll certainly look into your issue when I have the time.

@ashtonwebster
Copy link
Author

Thanks @hlin117 . I am not sure if this is a related error, but I am also failing the unit test (in fact, the output discretized values are floats instead of integers).

As a nasty hack, I ended up using rpy2 to simply call the r mdlp function. Perhaps when I am done with my current project I can help to take a look. It's too bad this wasn't accepted by scikit-learn; IMO this would be a useful discretization method.

@ChingChuan-Chen
Copy link

The inconsistency is caused by the different implementation. R code is here.

R:

for (i in 1:(n-1)){
	if(x[i+1]!=x[i]) {  # they check x not y
		ct <- (x[i]+x[i+1])/2
		wx <- which(x<=ct)
		wn <- length(wx)/n
		e1 <- wn*ent(y[wx])
		e2 <- (1-wn)*ent(y[-wx])
		val <- e1+e2
		if(val<entropy) {
			entropy <- val
			ci <- i
		}
	}
}

Yours:

for ind in range(start + 1, end):

	# I choose not to use a `min` function here for this optimization.
	if y[ind-1] == y[ind]:  # you check y instead of x
		continue

	# Find the partition entropy, and check if this entropy is minimum
	first_half = (<float> (ind - start)) / length \
						* slice_entropy(y, start, ind)[0]
	second_half = (<float> (end - ind)) / length \
						 * slice_entropy(y, ind, end)[0]
	curr_entropy = first_half + second_half

	if prev_entropy > curr_entropy:
		prev_entropy = curr_entropy
		k = ind

I think that checking y is wired because the algorithm is to discretize x.

@ChingChuan-Chen
Copy link

ChingChuan-Chen commented Jun 18, 2019

In the original paper, it also state that only boundary points of the attribute are considered. And it is defined in 2.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants