Skip to content

Brainstorming: API for meta-data/side-information on the features #480

@nignatiadis

Description

@nignatiadis

This follows a discussion on Slack and @ablaom's suggestion to open an issue to brainstorm ideas.

The problem
Consider a machine learning model with p-dimensional features X. Now assume that for each feature j, the analyst has access to external information Z(j) and furthermore that this information can potentially be used by machine learning models to improve predictive performance. A concrete example of such external information would be the case of categorical Z(j), which induces a partition of features into groups of related features.

How could the MLJ API account for that?

Example use cases for grouping structure

  • Such information is quite frequently used in the context of supervised methods with linear predictors, with the most common method being the (sparse) Group Lasso. This is also the use case I am interested in.
  • It comes up less in the supervised non-linear setting, but one exception is https://link.springer.com/article/10.1186/s12859-017-1993-1, where they study Random Forests: the probability that a feature is selected within the candidate set for a split is modulated by the group it belongs to (so features of some groups are prioritized).
  • Grouping structure comes up frequently in the unsupervised setting, e.g., the following is a method popular in the computational biology community https://www.embopress.org/doi/full/10.15252/msb.20178124. In the Machine Learning community such problems sometimes come up under the name multi-view learning (e.g., https://arxiv.org/pdf/1604.04939.pdf).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions