Skip to content

Conversation

@khaeru
Copy link
Owner

@khaeru khaeru commented Sep 24, 2025

Building on #243:

  1. Allow passing to_pandas(..., labels="both") directly, instead of wrapping in FormatOptions.
  2. Simplify .convert.pandas.
  3. Improve BaseDataStructureDefinition.make_key(): if a dimension is enumerated (by a Codelist), replace a string code ID with a reference to the actual Code.

PR checklist

  • Checks all ✅
  • Update documentation
  • Update doc/whatsnew.rst

@khaeru khaeru self-assigned this Sep 24, 2025
@khaeru khaeru added bug enh Enhancements & new features xml SDMX-ML format reader Read file formats defined by the SDMX standards labels Sep 24, 2025
khaeru added a commit that referenced this pull request Sep 24, 2025
@khaeru khaeru force-pushed the enh/to_pandas-labels branch from e469e36 to 10f3358 Compare September 24, 2025 15:11
@khaeru
Copy link
Owner Author

khaeru commented Sep 24, 2025

@aboddie would you mind to give this branch a try?

Some context: in #243 I did most of the 'grunt work' I referred to in #242 (comment) and the comment before that.

In this PR I've specifically targeted your snippet:

import sdmx

dm = smdx.Client("ESTAT").data(
    "UNE_RT_A", key={"geo": "EL+ES+IE"}, params={"startPeriod": "2014"},
)
data = sdmx.to_pandas(dm, labels="both")  # Note "id", "both", "name" per the standard
print(data.head(5))

This revealed some further issues:

  • The specific usage of key= here triggers a pre-request for the DSD, in order to construct/validate the key for the actual data request.
  • This DSD is passed on when the data message is read.
  • However, BaseDataStructureDefinition.make_key() was not making complete use of this information. For example, a key/value pair like geo="EL" from the message was stored as Python str ("EL") instead of as a reference to the Code (EL: Greece) from the codelist for the "geo" dimension—even though this latter was already available (attached to the DSD).

So I've corrected this issue. Now, when the data message/data set is read, those Code references (technically, CodedKeyValue) are established right away. Thus, when to_pandas()/PandasConverter receives them, it only needs to format them correctly, and doesn't need to traverse the DSD itself to look up the codes. I see now this is what your code in #242 was doing; but I think I prefer this fix because it generates "more correct" data structures at the moment of reading the message.

The above snippet now gives:

freq  Time frequency  age     Age class            unit    Unit of measure                               sex  Sex      geo  Geopolitical entity (reporting)  TIME_PERIOD  Time
A     Annual          Y15-24  From 15 to 24 years  PC_ACT  Percentage of population in the labour force  F    Females  EL   Greece                           2014         2014    58.5
                                                                                                                                                             2015         2015    54.8
                                                                                                                                                             2016         2016    52.1
                                                                                                                                                             2017         2017    49.0
                                                                                                                                                             2018         2018    45.4
Name: value, dtype: float64

This is a bit different from your example, which more aligns with labels="name" per the SDMX-CSV 2.0.0 standard. I can try to add that in a later PR or maybe this one, but in the meanwhile if you can please try out the branch and report if it gives roughly the behaviour you expect, that would be much appreciated.

khaeru added a commit that referenced this pull request Sep 24, 2025
@khaeru khaeru force-pushed the enh/to_pandas-labels branch from 10f3358 to 9878118 Compare September 24, 2025 15:57
khaeru added a commit that referenced this pull request Sep 24, 2025
@khaeru khaeru force-pushed the enh/to_pandas-labels branch from 9878118 to 0aedcf8 Compare September 24, 2025 16:05
- Rely only on .format_options.
- Use base/abstract .csv.common.CSVFormatOptions to indicate "no
  particular CSV format".
- Add ._strict bool attribute.
Replace repeated code with function calls.
khaeru added a commit that referenced this pull request Sep 24, 2025
@khaeru khaeru force-pushed the enh/to_pandas-labels branch from 0aedcf8 to 3d12dff Compare September 24, 2025 16:14
@codecov
Copy link

codecov bot commented Sep 24, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.25%. Comparing base (07683cc) to head (1eca925).
⚠️ Report is 15 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #244      +/-   ##
==========================================
- Coverage   99.02%   98.25%   -0.77%     
==========================================
  Files         113      114       +1     
  Lines        9297     9326      +29     
==========================================
- Hits         9206     9163      -43     
- Misses         91      163      +72     
Files with missing lines Coverage Δ
sdmx/convert/pandas.py 99.70% <100.00%> (-0.04%) ⬇️
sdmx/format/csv/common.py 100.00% <100.00%> (ø)
sdmx/format/csv/v1.py 100.00% <100.00%> (ø)
sdmx/model/common.py 99.71% <100.00%> (+<0.01%) ⬆️
sdmx/model/internationalstring.py 100.00% <100.00%> (ø)
sdmx/tests/convert/test_pandas.py 100.00% <100.00%> (ø)
sdmx/tests/format/test_csv.py 100.00% <100.00%> (ø)
sdmx/tests/format/test_xml.py 100.00% <ø> (ø)
sdmx/tests/reader/test_csv.py 100.00% <ø> (ø)
sdmx/tests/reader/test_json.py 100.00% <ø> (ø)
... and 7 more

... and 20 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@aboddie
Copy link

aboddie commented Sep 25, 2025

Took a brief look at the code, agree this looks like a better approach. Some thoughts:

  1. My understanding althought maybe I misread is your output above is actually labels=name. Labels=both should put id and name in the same column i.e. "A: Annual" i.e. this line doesn't look right (csv_v2, Labels.both): [KeyValueID, KeyValueName].
  2. Might want to go on and include formatting options for keys (none, obs, series, and both) for CSV 2.0 even if they are not implemented for now.
  3. The measure(s) should also have headers depending on the label format.

I can do an in-depth test in two weeks after the global conference, if you want.

@khaeru
Copy link
Owner Author

khaeru commented Sep 25, 2025

  1. My understanding althought maybe I misread is your output above is actually labels=name. Labels=both should put id and name in the same column i.e. "A: Annual" i.e. this line doesn't look right (csv_v2, Labels.both): [KeyValueID, KeyValueName].

You're right! I misread and thought that labels=both was different in SDMX CSV 1.0 ("ID: Name") versus 2.x ("ID", "Name" in 2 columns). The latter is indeed labels=name as you say, and the former is the same across versions.

Thanks for that brief feedback—it's easy to make these small oversights when doing bigger refactoring as in #243. I'll expand the PR to address these points, merge, and then release. There are several other improvements on deck that I'd like to get out the door.

If there are further bugs found, those can be fixed in a point release.

@khaeru
Copy link
Owner Author

khaeru commented Sep 26, 2025

  1. Might want to go on and include formatting options for keys (none, obs, series, and both) for CSV 2.0 even if they are not implemented for now.

These are already on main per the last PR:

class Keys(Enum):
"""SDMX-CSV 2.x 'keys' parameter."""
#: No related columns.
none = auto()
#: Both :attr:`obs` and :attr:`series`.
both = auto()
#: Include ``OBS_KEY`` column with key values for all dimension(s).
obs = auto()
#: Include ``SERIES_KEY`` column with key values for all dimension(s) *except* the
#: one(s) attached to each observation.
series = auto()

But indeed I can (a) mention in the docs that only key=none is currently supported, (b) validate, and (c) test these. Will do this.

  1. The measure(s) should also have headers depending on the label format.

This is something I am sure differs between SDMX-CSV v1.0 and v2.x. In the former, it is only ever "OBS_VALUE", even if labels=both or the primary measure has an ID other than "OBS_VALUE". See:

So I'll have to put in logic that does this only for SDMX-CSV 2.x.

khaeru added a commit that referenced this pull request Sep 30, 2025
@khaeru khaeru force-pushed the enh/to_pandas-labels branch from 3d12dff to 72fc370 Compare September 30, 2025 19:45
khaeru added a commit that referenced this pull request Sep 30, 2025
@khaeru khaeru force-pushed the enh/to_pandas-labels branch from 72fc370 to b974f21 Compare September 30, 2025 19:55
- Update docs.
@khaeru khaeru force-pushed the enh/to_pandas-labels branch from b974f21 to 1eca925 Compare September 30, 2025 20:10
@khaeru khaeru merged commit 59f003d into main Sep 30, 2025
20 checks passed
@khaeru khaeru deleted the enh/to_pandas-labels branch September 30, 2025 20:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug enh Enhancements & new features reader Read file formats defined by the SDMX standards xml SDMX-ML format

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants