Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The wrong skew measure? #90

Open
mcorrell opened this issue Dec 21, 2017 · 4 comments
Open

The wrong skew measure? #90

mcorrell opened this issue Dec 21, 2017 · 4 comments

Comments

@mcorrell
Copy link
Contributor

datalib currently calculates what is supposed to be Pearson's mode skewness, but does so as:
(avg - med) / std

This is non-parametric skewness, rather than:

Pearson's mode skewness which is
(avg - mode) / std

Or Pearson's median skewness, which is
3 * (avg - med) / std

So this is only giving correct values when med == mode.

Should I:

  1. Rename the function to reflect that it's nonparametric skew.
  2. Add a mode function, and then fix the modeskew function by calling mode.
  3. Add a factor of 3 and rename the function to reflect that it's median skew.
  4. Some combination of those things.
  5. None of those things.
@domoritz
Copy link
Member

I'd say mode skewness should compute the correct thing so 2. But we could keep the existing function around but with a different name, which is 1. And while you're at it, why not add Pearson's median skewness (3.)?

@mcorrell
Copy link
Contributor Author

mode is a little tricky, and I think the current datalib map function might be more useful in 99% of cases. For instance, mode([1,2]) should, in principle, return the unordered set {1,2}, since the distribution has two modes. But it seems silly for mode([1,1]) to return {1} instead of 1. And then there's the question of nulls and undefineds and so on. I think it's useful for somebody to know that the mode of [1,null,null] is null, but that's a judgment call. In either event, we'd have a lot of cases where we would not have a single numerical mode, and so mode skewness would be undefined.

I'm now leaning towards an option not on my list, computing skewness using the third standardized moment, rather than having to mess with a mode function. But I can submit a PR with a mode function (that would just call map and skim off the elements with the highest count) and see what people think.

@jheer
Copy link
Member

jheer commented Dec 28, 2017

I would recommend renaming the function to something more appropriate (npskew?), but leave a deprecated reference to the existing function name modeskew to support backwards compatibility. I can add appropriate warnings / caveats to the documentation. (FWIW, this method was removed from Vega in version 3 onward, though obviously persists here in datalib.)

I don't think inclusion of mode is needed, given existing functionality. A new skew method for the third standardized moment seems reasonable to add.

@mcorrell
Copy link
Contributor Author

Okay, I made a PR #91 that adds sample skew. I wasn't sure how to sunset modeskew exactly so I just punted on that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants