Skip to content

Subset Selection Integration #542

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 35 commits into from
Apr 14, 2025

Conversation

eshwarprasadS
Copy link
Contributor

@eshwarprasadS eshwarprasadS commented Feb 4, 2025

Minimal Implementation of Subset Selection

This resolves #541

Important Incoming Changes

  • Added h5py>=3.12.1 to requirements
  • Added submodlib source to requirements
  • Added subset_selection.py
  • Added encoders dir housing bge and arctic encoders
  • Added subset_selection_utils.py under src/instructlab/sdg/utils/
  • Support for snowflake-arctic-embed-l-v2.0 enabled

@mergify mergify bot added ci-failure dependencies Pull requests that update a dependency file labels Feb 4, 2025
Signed-off-by: eshwarprasadS <[email protected]>
@mergify mergify bot added ci-failure and removed ci-failure labels Feb 4, 2025
@mergify mergify bot removed the ci-failure label Feb 4, 2025
@shivchander shivchander self-requested a review February 5, 2025 20:01
requirements.txt Outdated
httpx>=0.25.0,<1.0.0
instructlab-schema>=0.4.0
jinja2>=3.0.0
langchain-text-splitters
openai>=1.13.3,<2.0.0
sentencepiece>=0.2.0
# Note: this dependency has to be built from source
submodlib @ git+https://github.com/decile-team/submodlib.git
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what we were doing previously with GPTDolomite and Dolomite Engine. We will likely need to move this to live in its own repo as we did for GPTDolomite so we can do builds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have contacted the authors and we expect submodlib to have a public release soon, so, I think we can fall back to moving it into its own repo, if that does not go through?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RobotSail is it necessary to move it to its own repo? We're already resorting to forking over some code from another repo in this PR, so I'd hesitate to bring code in from another source that we're also committing to maintain

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@khaledsulayman We would likely maintain it to the same extent that we do for GPTDolomite -- i.e., it's a temporary workaround that we can get rid of once there is a more official and steady package to depend on. Otherwise, if that never happens, we can always just copy the new code over into the repo if we wanted to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyPI does not accept wheels that have a dependency on a package with a link. You will not be able to upload SDG wheels with this change.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are planning to add a check to the downstream builds that would prevent a dependency with a git URL like this from building, because we can't ensure that it is safe to pull code from an arbitrary git branch. Please work with the submodlib authors to produce a real release of that code so we do not have to express the dependency using a git URL.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this been resolved and can we replace this dependency with a real release yet? If not, is there progress on this in that upstream repository?

Signed-off-by: eshwarprasadS <[email protected]>
@abhi1092
Copy link
Member

abhi1092 commented Feb 7, 2025

@eshwarprasadS can we also add a functional test? Something like this.

#!/usr/bin/env python3
from instructlab.sdg.subset_selection import subset_datasets


if __name__ == "__main__":
    dataset_files = ["<dataset_path>"]
    subset_config = {
    "instruction": "conversation",
    "query_description": "conversation",
    "templates": {
      "message": "{% for msg in messages if msg.role != 'system' %}{{ msg.role }}: {{ msg.content }}\n{% endfor %}",
        "conversation": "{% for conv in conversation %}{{ conv.from }}: {{ conv.value }}\n{% endfor %}",
    },
    "batch_size": 100000,
    "num_folds": 25
    ,
    "subset_sizes": [0.97],
    "seed": 42,
    "template_name": "text",
    "combine_files": False,
    "encoder_type": "bge",
    "encoder_model": "BAAI/bge-base-en"
  }
    subset_datasets(input_files=dataset_files, **subset_config)

Also, are you planning on adding the artic snowflake model?

@eshwarprasadS
Copy link
Contributor Author

@eshwarprasadS can we also add a functional test? Something like this.

#!/usr/bin/env python3
from instructlab.sdg.subset_selection import subset_datasets


if __name__ == "__main__":
    dataset_files = ["<dataset_path>"]
    subset_config = {
    "instruction": "conversation",
    "query_description": "conversation",
    "templates": {
      "message": "{% for msg in messages if msg.role != 'system' %}{{ msg.role }}: {{ msg.content }}\n{% endfor %}",
        "conversation": "{% for conv in conversation %}{{ conv.from }}: {{ conv.value }}\n{% endfor %}",
    },
    "batch_size": 100000,
    "num_folds": 25
    ,
    "subset_sizes": [0.97],
    "seed": 42,
    "template_name": "text",
    "combine_files": False,
    "encoder_type": "bge",
    "encoder_model": "BAAI/bge-base-en"
  }
    subset_datasets(input_files=dataset_files, **subset_config)

Also, are you planning on adding the artic snowflake model?

Thanks @abhi1092 for the comments. Yes, I am planning to add a host of unit and functional tests in the coming week for the feature, thanks for the suggestion.

And yes, I am planning to incorporate the arctic-snowflake encoder class in encoders.py and change default encoder to that one. I seem to have gotten hold of the implementation for that from the author, so we should have it in here soon.

Copy link
Member

@RobotSail RobotSail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll need to go through in further detail but here are some comments so far

@eshwarprasadS
Copy link
Contributor Author

I'll need to go through in further detail but here are some comments so far

Thanks a lot for looking at this! Good parts of this implementation was adapted directly from the upstream repository, so this really helps to have extra eyes on it.

@mergify mergify bot added the one-approval label Feb 12, 2025
@mergify mergify bot added the ci-failure label Feb 13, 2025
Signed-off-by: eshwarprasadS <[email protected]>
@mergify mergify bot added ci-failure and removed ci-failure labels Mar 6, 2025
@bbrowning
Copy link
Contributor

Note that submodlib does not build properly - see the e2e-medium job logs, with a snippet copied here:

  writing manifest file '/home/tmp/pip-modern-metadata-ljrr7qrp/submodlib_py.egg-info/SOURCES.txt'
  creating '/home/tmp/pip-modern-metadata-ljrr7qrp/submodlib_py-0.0.1.dist-info'
  Preparing metadata (pyproject.toml): finished with status 'done'
  WARNING: Generating metadata for package submodlib produced metadata for project name submodlib-py. Fix your #egg=submodlib fragments.
Discarding git+https://github.com/decile-team/submodlib.git: Requested submodlib-py from git+https://github.com/decile-team/submodlib.git (from instructlab-sdg==0.7.1.dev124) has inconsistent name: expected 'submodlib', but metadata has 'submodlib-py'

...

ERROR: Could not find a version that satisfies the requirement submodlib (unavailable) (from instructlab-sdg) (from versions: none)
ERROR: No matching distribution found for submodlib (unavailable)

…t for subset sizes parameter

Signed-off-by: Oleg Silkin <[email protected]>
@mergify mergify bot added ci-failure and removed ci-failure labels Apr 10, 2025
@mergify mergify bot added ci-failure and removed ci-failure labels Apr 10, 2025
Signed-off-by: eshwarprasadS <[email protected]>
@mergify mergify bot added ci-failure and removed ci-failure labels Apr 11, 2025
@bbrowning
Copy link
Contributor

I briefly looked at the failing Mac tests, and it looks like submodlib didn't release anything that works with Macs. Their only release is binary-only and I only see wheels that will work for some x86_64 linux systems. I'm not sure how they created or uploaded this release, but it doesn't appear to be done in the typical way where pypi can build from source for systems that don't have precompiled wheels. We either need to change this back to building from source OR adjust this requirement to only pull in submodlib for linux systems.

@eshwarprasadS
Copy link
Contributor Author

I briefly looked at the failing Mac tests, and it looks like submodlib didn't release anything that works with Macs. Their only release is binary-only and I only see wheels that will work for some x86_64 linux systems. I'm not sure how they created or uploaded this release, but it doesn't appear to be done in the typical way where pypi can build from source for systems that don't have precompiled wheels. We either need to change this back to building from source OR adjust this requirement to only pull in submodlib for linux systems.

Yes @bbrowning, I just realized that the failures are localized to MacOS. I was thinking along the lines of the python versions, since we had an issue with the wheels only working for 3.10 last week.

@mergify mergify bot added ci-failure and removed ci-failure labels Apr 12, 2025
Subset selection only works on Linux and CUDA devices for now, so
don't add it as a requirement except on Linux machines.

Signed-off-by: Ben Browning <[email protected]>
@mergify mergify bot added ci-failure and removed ci-failure labels Apr 14, 2025
@bbrowning
Copy link
Contributor

I adjusted the submodlib requirement to only get pulled in for Linux, since that's the only OS its pypi binary is built for. And, I marked the functional test that uses it as one that requires GPU, since the test uses cuda stuff directly. That got CI green, so things should be good to merge if the others involved are comfortable with this.

Copy link
Member

@aakankshaduggal aakankshaduggal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, lgtm!

@mergify mergify bot merged commit b5e2ae6 into instructlab:main Apr 14, 2025
28 checks passed
@mergify mergify bot removed the one-approval label Apr 14, 2025
@eshwarprasadS
Copy link
Contributor Author

Thanks @bbrowning for adjusting the requirements. Thanks everyone for the reviews along the way as well.

@RobotSail
Copy link
Member

Congrats on getting this in @eshwarprasadS !!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file testing Relates to testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Minimal Integration of Subset Selection into SDG
9 participants