-
Notifications
You must be signed in to change notification settings - Fork 116
Fix HuggingFace dataset column name handling for invalid identifiers #1226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: ivan <[email protected]>
Co-authored-by: ivan <[email protected]>
Reviewer's GuideEnhance HuggingFace dataset integration to normalize invalid Python identifier column names by passing original_names to the data model generator and reinforce with comprehensive unit and functional tests. Class diagram for handling HuggingFace dataset column names with invalid Python identifiersclassDiagram
class read_hf {
+dict_to_data_model(model_name, output, original_names)
}
class dict_to_data_model {
+normalize_col_names()
+validation_alias=AliasChoices()
}
class _feature_to_chain_type {
+dict_to_data_model(name, sequence_dict, original_names)
}
read_hf --> dict_to_data_model : uses
_feature_to_chain_type --> dict_to_data_model : uses
dict_to_data_model --> normalize_col_names : uses
dict_to_data_model --> AliasChoices : uses
Flow diagram for column name normalization and mapping in HuggingFace datasetsflowchart TD
A[User loads HuggingFace dataset] --> B[Extract original column names]
B --> C[Normalize column names to valid Python identifiers]
C --> D[Pass original and normalized names to dict_to_data_model]
D --> E[Create data model with alias mapping]
E --> F[User accesses data using normalized names]
F --> G[Original data preserved and accessible]
File-Level Changes
Assessment against linked issues
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
for more information, see https://pre-commit.ci
Deploying datachain-documentation with
|
Latest commit: |
3a9f5f8
|
Status: | ✅ Deploy successful! |
Preview URL: | https://b0583011.datachain-documentation.pages.dev |
Branch Preview URL: | https://cursor-connect-and-proceed-w-rpfu.datachain-documentation.pages.dev |
- Updated unit tests to expect original column names rather than normalized names - Fixed mock paths in all tests to use correct import path - Updated assertions to match actual behavior where original field names are preserved - All unit and functional tests now pass
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1226 +/- ##
=======================================
Coverage 88.66% 88.66%
=======================================
Files 152 152
Lines 13606 13608 +2
Branches 1893 1893
=======================================
+ Hits 12064 12066 +2
Misses 1095 1095
Partials 447 447
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Closing in favor of #1241 |
Fix HuggingFace dataset column name handling for invalid identifiers
🐛 Issue Fixed
Fixes #1204 - Reading from huggingface dataset fails with
KeyError
when column names contain invalid Python identifiers.📋 Problem Description
Some datasets on the HuggingFace Hub have column names that are not valid Python identifiers (e.g., containing
?
,-
, spaces, or starting with numbers). When usingdc.read_hf()
with such datasets, the function would fail with aKeyError
because the system was trying to access these invalid column names directly as Python attribute names.Example that failed before this fix:
🔧 Solution
The fix leverages the existing
normalize_col_names()
function anddict_to_data_model()
support for handling invalid Python identifiers by passing theoriginal_names
parameter.Changes Made:
src/datachain/lib/dc/hf.py
- Main fix inread_hf()
function:dict_to_data_model()
functionsrc/datachain/lib/hf.py
- Fix in_feature_to_chain_type()
function:Comprehensive test coverage - Added tests in both unit and functional test files
🧪 Testing
Unit Tests (
tests/unit/lib/test_hf.py
):test_hf_invalid_column_names()
- Tests basic functionality with invalid column namestest_hf_invalid_column_names_with_read_hf()
- Tests theread_hf()
function directlytest_hf_sequence_dict_with_invalid_names()
- Tests nested dictionary features with invalid namesoriginal_names
parameterFunctional Tests (
tests/func/test_hf_invalid_column_names.py
):test_hf_invalid_column_names_functional()
- Comprehensive test with various invalid column name patternstest_toxigen_dataset_simulation()
- Simulates the exact issue from the GitHub issueTest Coverage:
?
,-
,.
,/
)📊 Column Name Transformations
The fix automatically transforms invalid column names to valid Python identifiers:
factual?
→factual_
user-name
→user_name
123column
→c0_123column
has spaces
→has_spaces
with.dots
→with_dots
with/slashes
→with_slashes
✅ Benefits
🔍 How It Works
The
dict_to_data_model()
function has built-in support for handling invalid Python identifiers:normalize_col_names()
to create valid Python identifiersvalidation_alias=AliasChoices()
to map between normalized names and original names📁 Files Modified
src/datachain/lib/dc/hf.py
- Main fix implementationsrc/datachain/lib/hf.py
- Nested dictionary fixtests/unit/lib/test_hf.py
- Unit teststests/func/test_hf_invalid_column_names.py
- Functional testsBUGFIX_SUMMARY.md
- Detailed documentation🧪 Manual Testing
The fix can be verified with:
📚 References
test_hf.py
Summary by Sourcery
Fix HuggingFace dataset column name handling for invalid Python identifiers by passing original column names through model generation
Bug Fixes:
Documentation:
Tests: