Skip to content

KL divergence test for missing data  #2

@vcabeli

Description

@vcabeli

I’ll probably have to fix this one, just taking notes here

We use the KL divergence for testing whether the joint distribution of (X,Y) on the samples for which the contributor Z is not NA P(XY)|Z_notNA is not too different from the original P(XY).

If it is very different, then the result of I(X;Y|Z) does not really give information about Z as a contributor, see this extreme example :
image

For now the value KL(P(XY)|Z_notNA, P(XY)) is compared to log(N_nonNA) which probably captures the worst cases of selection bias but may not be what we want.

One obvious flaw is that log(N_nonNA) is increasing, whereas we expect it to be harder to create a strong selection bias when adding more samples to the subsample.
image

In this image the blue distribution are empirical distributions of 10K KL divs for random subsampling (null hypothesis) and the red line is log(N_nonNA) (grey number to the right) along with its empirical pvalue.

The threshold should be defined on a pvalue agasint the null (what can we expect from the null distribution, i.e. if data were really missing at random?), probably relative to either I(X;Y) (or H(X,Y)?) : if I(X;Y) is already very low it may be a good idea to be very strict about the value of KL.

It may be considered as a special case of the two-sample test (many tests require the two samples to be independent)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions