Skip to content

Conversation

@aboddie
Copy link

@aboddie aboddie commented Sep 8, 2025

The goal is to provide support for labels in the simple case where the data message includes the structure.

I added a parameter to write_dataset to allow for labels. If True write dataset will check is dsd is present (similar to _maybe_convert_datetime).

I would like to get feedback on the general approach before updating doc, checking against additional sources, error handling, etc.

Example usage:

import sdmx

estat = sdmx.Client("ESTAT") 

dm = estat.data(
    "UNE_RT_A",
    key={"geo": "EL+ES+IE"},
    params={"startPeriod": "2014"},
)
data = sdmx.to_pandas(dm, labels=True)
print(data.head(5))

output:

Time frequency  Age class            Unit of measure                               Sex      Geopolitical entity (reporting)  Time
Annual          From 15 to 24 years  Percentage of population in the labour force  Females  Greece                           2014    58.5
                                                                                                                             2015    54.8
                                                                                                                             2016    52.1
                                                                                                                             2017    49.0
                                                                                                                             2018    45.4
Name: value, dtype: float64

@aboddie aboddie marked this pull request as draft September 8, 2025 20:55
@codecov
Copy link

codecov bot commented Sep 8, 2025

Codecov Report

❌ Patch coverage is 8.33333% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 97.98%. Comparing base (59a90ec) to head (168a6ba).
⚠️ Report is 45 commits behind head on main.

Files with missing lines Patch % Lines
sdmx/writer/pandas.py 8.33% 11 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #242      +/-   ##
==========================================
- Coverage   98.91%   97.98%   -0.93%     
==========================================
  Files         105      105              
  Lines        8910     8922      +12     
==========================================
- Hits         8813     8742      -71     
- Misses         97      180      +83     
Files with missing lines Coverage Δ
sdmx/writer/pandas.py 89.38% <8.33%> (-4.18%) ⬇️

... and 20 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@khaeru
Copy link
Owner

khaeru commented Sep 9, 2025

Howdy, thanks—this is a welcome direction, but I am not sure what is the right implementation.

A few bits of info about the implementation:

  1. The submodules writer.csv, writer.xml, and writer.pandas are slightly awkward in how they're grouped:
    1. writer.csv and writer.xml correspond to the official/standard formats SDMX-CSV and SDMX-ML. We could think of a future writer.json (Add writer for SDMX-JSON #35) the same way.
    2. On the other hand, there is no "standard" Pandas representation of SDMX objects. The standards are totally language-agnostic. So writer.pandas is not exactly "writing SDMX"; it's converting SDMX to Pandas objects.
  2. writer.pandas also includes a kind of eclectic mix of stuff:
    1. For data sets, writer.csv and writer.pandas produce 2-dimensional tables of very similar, but slightly different, layout.
    2. For everything else (StructureMessage and its contents), there is no official SDMX-CSV representation, and the objects returned by writer.pandas are kind of arbitrary, chosen by contributors to pandaSDMX or previous.
  3. Some changes that I've mulled but (sorry) had not yet written down, that could might make this less awkward, are like so:
    1. Use common code to convert DataSet objects → pandas.DataFrame, and then pandas.DataFrame → CSV. Thus if the user wants Pandas objects, only the first stage of conversion is done; if they want SDMX-CSV, then both.
      This would:
      • Absorb any existing capabilities of writer.pandas (with respect to data sets) into .writer.csv
      • Prefer the option names (like "labels=True") set out in the SDMX-CSV standards. (See here in the docs where it says "The two optional parameters are exactly as described in the specification." I would probably expand this to a dataclass that includes all the SDMX-CSV format options/variants described in the standards.)
      • Avoid that there are two chunks of code with similar behaviour but slightly different output format.
      • Maybe offload the actual serialization of pd.DataFrame → str, bytes, or file to pandas or even PyArrow, which could be more performant than the code currently in writer.csv.
    2. Move the remaining bits of writer.pandas, dealing with non-standard items like data sets, to somewhere else—not sure where.

That being said, I don't want to force 3(i) on you, since it would be a bunch of work. So I think it is worth expanding this PR with tests (using specimens, instead of network calls), docs, etc., and then using that as a stepping stone to that larger improvement.

@aboddie
Copy link
Author

aboddie commented Sep 10, 2025

I can certainly use labels with values id (as default), name (add to what is mentioned in your docs but its in the standard), and both. I can align my implementation with the output of CSV and remove the not implemented on labels from the csv writer.

What I'm not sure about is how you'll best support multiple versions of the writers for example 1.0.0 and 2.0.0 for SDMX-CSV formats. I think it would be important to have some idea around this before fully trying to do 3(i).

@khaeru
Copy link
Owner

khaeru commented Sep 12, 2025

What I'm not sure about is how you'll best support multiple versions of the writers for example 1.0.0 and 2.0.0 for SDMX-CSV formats. I think it would be important to have some idea around this before fully trying to do 3(i).

The reader.xml implementation is probably the most sophisticated in terms of support for multiple versions of a given format (SDMX-ML 2.1 and 3.0). The general way this is accomplished is:

  • There are submodules reader.xml.v21 and writer.xml.v30.
  • The code in reader.xml.v21 handles SDMX-ML v2.1. In some cases (where the two versions of the format differ slightly, but not by much), it handles both versions (i.e. the same implementations defined in this file are used in .reader.xml.v30 without any override).
  • reader.xml.v30 contains code for (a) things that are in SDMX-ML 3.0 but not present at all in 2.1, and (b) overrides for things that are very different from SDMX-ML 2.1, such that having a dual-purpose function in .xml.v21 would be needlessly complex.

So the idea would be to have a roughly similar approach for other reader and writer submodules, whether for SDMX-CSV, JSON, or other.

However, again, I feel bad that this architecture is not thoroughly documented, such that I have to type this explanation here instead of linking to some existing developer docs. Please, if you don't mind, I'll try to start a separate branch that creates a natural (and hopefully easier to understand) place for this feature to land. Then we can incorporate the changes from your branch here.

In the meantime, if you could go ahead with creating tests/specimens that express the behaviour you have in mind for this new feature, we can incorporate those in the new branch along with the code you've already written.

@aboddie aboddie closed this Oct 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants