Give Pandas writer option to write labels in place of codes #242

aboddie · 2025-09-08T20:54:55Z

The goal is to provide support for labels in the simple case where the data message includes the structure.

I added a parameter to write_dataset to allow for labels. If True write dataset will check is dsd is present (similar to _maybe_convert_datetime).

I would like to get feedback on the general approach before updating doc, checking against additional sources, error handling, etc.

Example usage:

import sdmx

estat = sdmx.Client("ESTAT") 

dm = estat.data(
    "UNE_RT_A",
    key={"geo": "EL+ES+IE"},
    params={"startPeriod": "2014"},
)
data = sdmx.to_pandas(dm, labels=True)
print(data.head(5))

output:

Time frequency  Age class            Unit of measure                               Sex      Geopolitical entity (reporting)  Time
Annual          From 15 to 24 years  Percentage of population in the labour force  Females  Greece                           2014    58.5
                                                                                                                             2015    54.8
                                                                                                                             2016    52.1
                                                                                                                             2017    49.0
                                                                                                                             2018    45.4
Name: value, dtype: float64

- Add headers for all requests. - Drop header entries for these sources in sources.json.

codecov · 2025-09-08T20:56:52Z

Codecov Report

❌ Patch coverage is 8.33333% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 97.98%. Comparing base (59a90ec) to head (168a6ba).
⚠️ Report is 45 commits behind head on main.

Files with missing lines	Patch %	Lines
sdmx/writer/pandas.py	8.33%	11 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #242      +/-   ##
==========================================
- Coverage   98.91%   97.98%   -0.93%     
==========================================
  Files         105      105              
  Lines        8910     8922      +12     
==========================================
- Hits         8813     8742      -71     
- Misses         97      180      +83

Files with missing lines	Coverage Δ
sdmx/writer/pandas.py	`89.38% <8.33%> (-4.18%)`	⬇️

... and 20 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

khaeru · 2025-09-09T17:20:29Z

Howdy, thanks—this is a welcome direction, but I am not sure what is the right implementation.

A few bits of info about the implementation:

The submodules writer.csv, writer.xml, and writer.pandas are slightly awkward in how they're grouped:
1. writer.csv and writer.xml correspond to the official/standard formats SDMX-CSV and SDMX-ML. We could think of a future writer.json (Add writer for SDMX-JSON #35) the same way.
2. On the other hand, there is no "standard" Pandas representation of SDMX objects. The standards are totally language-agnostic. So writer.pandas is not exactly "writing SDMX"; it's converting SDMX to Pandas objects.
writer.pandas also includes a kind of eclectic mix of stuff:
1. For data sets, writer.csv and writer.pandas produce 2-dimensional tables of very similar, but slightly different, layout.
2. For everything else (StructureMessage and its contents), there is no official SDMX-CSV representation, and the objects returned by writer.pandas are kind of arbitrary, chosen by contributors to pandaSDMX or previous.
Some changes that I've mulled but (sorry) had not yet written down, that could might make this less awkward, are like so:
1. Use common code to convert DataSet objects → pandas.DataFrame, and then pandas.DataFrame → CSV. Thus if the user wants Pandas objects, only the first stage of conversion is done; if they want SDMX-CSV, then both.
  This would:
  - Absorb any existing capabilities of writer.pandas (with respect to data sets) into .writer.csv
  - Prefer the option names (like "labels=True") set out in the SDMX-CSV standards. (See here in the docs where it says "The two optional parameters are exactly as described in the specification." I would probably expand this to a dataclass that includes all the SDMX-CSV format options/variants described in the standards.)
  - Avoid that there are two chunks of code with similar behaviour but slightly different output format.
  - Maybe offload the actual serialization of pd.DataFrame → str, bytes, or file to pandas or even PyArrow, which could be more performant than the code currently in writer.csv.
2. Move the remaining bits of writer.pandas, dealing with non-standard items like data sets, to somewhere else—not sure where.

That being said, I don't want to force 3(i) on you, since it would be a bunch of work. So I think it is worth expanding this PR with tests (using specimens, instead of network calls), docs, etc., and then using that as a stepping stone to that larger improvement.

aboddie · 2025-09-10T13:24:26Z

I can certainly use labels with values id (as default), name (add to what is mentioned in your docs but its in the standard), and both. I can align my implementation with the output of CSV and remove the not implemented on labels from the csv writer.

What I'm not sure about is how you'll best support multiple versions of the writers for example 1.0.0 and 2.0.0 for SDMX-CSV formats. I think it would be important to have some idea around this before fully trying to do 3(i).

khaeru · 2025-09-12T20:34:29Z

What I'm not sure about is how you'll best support multiple versions of the writers for example 1.0.0 and 2.0.0 for SDMX-CSV formats. I think it would be important to have some idea around this before fully trying to do 3(i).

The reader.xml implementation is probably the most sophisticated in terms of support for multiple versions of a given format (SDMX-ML 2.1 and 3.0). The general way this is accomplished is:

There are submodules reader.xml.v21 and writer.xml.v30.
The code in reader.xml.v21 handles SDMX-ML v2.1. In some cases (where the two versions of the format differ slightly, but not by much), it handles both versions (i.e. the same implementations defined in this file are used in .reader.xml.v30 without any override).
reader.xml.v30 contains code for (a) things that are in SDMX-ML 3.0 but not present at all in 2.1, and (b) overrides for things that are very different from SDMX-ML 2.1, such that having a dual-purpose function in .xml.v21 would be needlessly complex.

So the idea would be to have a roughly similar approach for other reader and writer submodules, whether for SDMX-CSV, JSON, or other.

However, again, I feel bad that this architecture is not thoroughly documented, such that I have to type this explanation here instead of linking to some existing developer docs. Please, if you don't mind, I'll try to start a separate branch that creates a natural (and hopefully easier to understand) place for this feature to land. Then we can incorporate the changes from your branch here.

In the meantime, if you could go ahead with creating tests/specimens that express the behaviour you have in mind for this new feature, we can incorporate those in the new branch along with the code you've already written.

aboddie and others added 22 commits March 14, 2025 14:21

Update sources.json

2f777d7

Update sources.rst

de3be47

Update sources.yaml

8c36a41

Update test_sources.py

ad7ad0d

Update sources.json

ed98bdd

Update test_sources.py

9d20a1e

Update sources.json

4689142

Update sources.json

85d2f44

Update sources.rst

d9e2c6f

Update test_sources.py

eeb0edb

Update test_sources.py

a36ac7e

Add tests for other artifacts

11df860

Add Commas

93fc67d

Update whatsnew.rst

83bdd60

Add .source.imf_data{,3}

a828bef

- Add headers for all requests. - Drop header entries for these sources in sources.json.

Update documentation and tests for imf_data and imf_data3

b67b4f6

Remove unused import from sdmx/source/imf_data

6f925fb

Fix formating in doc/sources.rst

c463d78

Fix handling of <str:Codelist>URN… in SDMX-ML 3.0

ab8ee6b

Copyedit IMF_DATA{,3} documentation

c95ed03

Merge branch 'khaeru:main' into main

338ab50

Add labels parameter when writing a dataset to pandas

168a6ba

aboddie temporarily deployed to publish September 8, 2025 20:55 — with GitHub Actions Inactive

aboddie marked this pull request as draft September 8, 2025 20:55

This was referenced Sep 22, 2025

Convert/write SDMX-CSV 2.x #243

Merged

Refine handling of labels="both" #244

Merged

aboddie closed this Oct 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Give Pandas writer option to write labels in place of codes #242

Give Pandas writer option to write labels in place of codes #242

Uh oh!

aboddie commented Sep 8, 2025

Uh oh!

codecov bot commented Sep 8, 2025 •

edited

Loading

Uh oh!

khaeru commented Sep 9, 2025

Uh oh!

aboddie commented Sep 10, 2025

Uh oh!

khaeru commented Sep 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Give Pandas writer option to write labels in place of codes #242

Give Pandas writer option to write labels in place of codes #242

Uh oh!

Conversation

aboddie commented Sep 8, 2025

Uh oh!

codecov bot commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

khaeru commented Sep 9, 2025

Uh oh!

aboddie commented Sep 10, 2025

Uh oh!

khaeru commented Sep 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Sep 8, 2025 •

edited

Loading