Skip to content

[Proposal] Automatic Algorithm Recommendation #16

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
kunwuz opened this issue Apr 6, 2025 · 0 comments
Open

[Proposal] Automatic Algorithm Recommendation #16

kunwuz opened this issue Apr 6, 2025 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@kunwuz
Copy link

kunwuz commented Apr 6, 2025

Candidate 1: Automatically recommend methods based on the data analysis.

To avoid additional efforts/choices on the user side, we may just consider the following factors:

Data types

  • Continuous
  • Discrete
  • Mixed

Definitely want a unique DAG, or allow some directions to be undetermined

  • If some undirected edges are okay, then PC, FCI, and GES.

Missing value

  • If missing value, use MV-PC

Sample size & number of variables

This is mainly for KCI and Generalized Score

  • If < 3000 samples, use KCI and Generalized Score
  • If >= 3000 samples, use fastKCI or RCIT (PR link) to replace KCI
    • For Generalized Score, currently there is no scaled-up version, so we could just suggest GRaSP or BOSS
  • In general, if we have very large datasets, say >100 variables, suggest GRaSP or BOSS
    • They are both score-based methods, which can be combined with different score functions
    • They are scalable
  • We could always suggest GRaSP or BOSS as an option to make the running faster.

Is IID or not

  • If not, use VAR-LiNGAM, CD-NOD, or Granger causality

We do not consider the assumptions on the data distributions for this tab, since it will cause some unnecessary concerns.

For example, we may recommend PC with FisherZ as the top choice by default, given that linear methods are usually good at balancing between accuracy and complexity.

We do not want users to feel that all the parametric methods are not reliable. If we explicitly require the algorithm to match the distribution, users may always choose nonparametric methods, such as PC with KCI and GES with generalized score, which usually perform badly regarding scalability.


Candidate 2: Recommendation based on questions

The recommendation is based on the two flowcharts (see below). These could be three questions (in order):

  1. Are there hidden variables?
  2. Can we treat discrete variables as continuous?
    • If not, recommend those work only for discrete data
  3. What do you believe the data should be:
    • Follow the flowcharts, recommend based on the answer (e.g., linear gaussian, linear nongaussian, etc.)

Also, give our recommendation based on LLM analysis of the data:

“It seems that you are working on data. Together with your previous answers, we recommend ...”

Give a list of algorithms, and add notes on them

  • E.g., KCI tests may take a long time, consider fastKCI or RCIT if needed...

Image

Image

@v-shaal v-shaal self-assigned this Apr 6, 2025
@MantejGill MantejGill added the enhancement New feature or request label Apr 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants