Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize performance of _process_statistics_timeseries #1277

Open
lewisblake opened this issue Jul 24, 2024 · 10 comments
Open

Optimize performance of _process_statistics_timeseries #1277

lewisblake opened this issue Jul 24, 2024 · 10 comments
Assignees
Labels
CAMS2_83 Issues related to the CAMS2_83 contract enhancement ✨ New feature or request
Milestone

Comments

@lewisblake
Copy link
Member

lewisblake commented Jul 24, 2024

Describe the bug
Please provide a clear and concise description of what the bug is.

There is a performance bug in the computing of the regional timeseries. Culprit is _process_statistics_timeseries in coldatatojson_helpers.py

  • Pyaerocom version: All
  • Computing platform: All
  • Configuration file (if applicable):
  • Error message (if applicable):

To Reproduce
Steps to reproduce the behavior:

  1. Create a test of the aforementioned function.
  2. Profile test
  3. Optimize performance
  4. Profile again

Expected behavior
Hope to significantly reduce runtime of this function.

Additional context
Pretty much all projects would benefit from this.

@lewisblake lewisblake added the bug 🐛 Something isn't working label Jul 24, 2024
@lewisblake lewisblake added this to the m2024-09 milestone Jul 24, 2024
@lewisblake lewisblake self-assigned this Jul 24, 2024
@lewisblake lewisblake added the CAMS2_83 Issues related to the CAMS2_83 contract label Jul 24, 2024
@lewisblake
Copy link
Member Author

Results from profiling one instance of the test:
combined

calc_statistics takes only about 11% of the time whereas load_region_mask_xr takes about 60% of the time. A lot of this in built-in numpy methods, but still ideally we wouldn't be calling a function which loads in a data set within a for-loop.

@heikoklein
Copy link
Member

Has this test been run with a long time-series with more than one year of hourly data (8760 timesteps at least)? We don't see a bottleneck for short timeseries.

@heikoklein
Copy link
Member

It might also be writing of ts-json files.
The processing of regional statistics of timeseries for a 3year dataset of hourly data takes ~46min and writes out 1686 json-files (one for each station) with a total of 5.3G

@lewisblake
Copy link
Member Author

Has this test been run with a long time-series with more than one year of hourly data (8760 timesteps at least)? We don't see a bottleneck for short timeseries.

No, this was just a quick first look at profiling tests/aeroval/test_coldatatojson_helpers2.py::test__process_statistics_timeseries

It might also be writing of ts-json files.
The processing of regional statistics of timeseries for a 3year dataset of hourly data takes ~46min and writes out 1686 json-files (one for each station) with a total of 5.3G

Agreed, but I'd need to take a deeper look into the logs.

@heikoklein
Copy link
Member

A short IO test with one of the output-files:

import timeit

from pyaerocom.aeroval.json_utils import read_json, write_json

# file with 46156617 bytes size
data = read_json("ALL-EEA-UTD-concno2-Surface.json")

i = 0


def write():
    global i
    i += 1
    write_json(
        data_dict=data, file_path=f"out{i}.json", ignore_nan=True, round_floats=True
    )


number = 5
timeout = timeit.timeit(
    "write()",
    setup="from pyaerocom.aeroval.json_utils import read_json, write_json",
    globals=globals(),
    number=number,
)

print(
    f"write of {number} files took {timeout}s -> {number*46156617/1024/1024/timeout} MB/s"
)

Running on the same machine and queue, I get a output speed of:

  • 2.8MB/s when round_floats is True
  • 49.0MB/s when round_floats is False

We are writing 7.7Gb to /lustre/storeB/users/charlien/cams2-83-tests/data/cams2-83/test-forecast-long-hourlystats/hm/ts/, with 2.8MB/s it takes: 2816s, with 49.0MB/s this is reduced to 160s.

The dataformat of this file does not allow for efficient numpy-based processing (dict of dicts). The data in .../test-forecast-long-hourlystats/ts/ is 5.3Gb and written much faster since all processing is numpy based (dict of numpy arrays), even if it still uses float-conversion.

Questions to @AugustinMortier :

  1. Can we drop rounding of this file? (short term)
  2. Can we change the file format: Current:
{
    "concno2": {
        "EEA-UTD": {
            "Surface": {
                "ENSEMBLE": {
                    "concno2": {
                        "Germany": {
                            "1622507400000": {
                                "totnum": 245.0,
                                "weighted": 0.0,
                                "num_valid": 0.0,
                                "refdata_mean": null,
                                "refdata_std": null,
                                "data_mean": null,

to

{
    "concno2": {
        "EEA-UTD": {
            "Surface": {
                "ENSEMBLE": {
                    "concno2": {
                        "Germany": {
                            "timesteps": [1622507400000, ...],
                            "totnum": [245.0, ...],
                            "weighted": [0.0, ...],
                            "num_valid": [0.0, ...],
                            "refdata_mean": [null, ...],
                            "refdata_std": [null, ...],
                            "data_mean": [null, ...],

This would allow us to process the statistics for all timesteps in the same numpy operation (using axis=X) rather than processing each timestep after each other.

@heikoklein
Copy link
Member

Why are stats already rounded by default?

result[name] = round(stat(data, ref_data, weights), 10)

(Introduced as a fairmode rounding issue fix? @thorbjoernl And it is python rounding of numpy values, why not np.around and avoid np/float conversion?)

Currently, we are rounding twice.

@thorbjoernl
Copy link
Collaborator

It fixed an issue where R values where out of bounds which caused a test to fail: #1142

There is probably a better way to fix this.

@heikoklein
Copy link
Member

Next step might be to vectorize statistics:

data = np.random.random((10000, 100))

def flattenmean():
    for i in np.arange(data.shape[0]):
        df = data[i,:].flatten()
        np.nanmean(df)

def axismean():
    np.nanmean(data, axis=1)

Currently we are working purely on the flattened arrays. If we manage to vectorize that as in the axismean version, it will be 50x faster (just benchmarked).

@lewisblake
Copy link
Member Author

Part of the reason we compute the statistics on the flattened arrays is (used to be) because the filtering is applied at the station time series level. So we would like a way with numpy to apply the 75% temporal coverage requirement to the array directly, or really just some way to apply the filtering before computing the statistics. We should also check that we are not doing the temporal filtering twice.

@lewisblake
Copy link
Member Author

Idea: pytest has a plugin called timeout. Once optimized, we can write a test to check that the code does not take more than a certain amount of time. If changes are introduced which cause the test to take longer, the test fails.

@thorbjoernl thorbjoernl self-assigned this Sep 6, 2024
@heikoklein heikoklein modified the milestones: m2024-09, m2024-10 Sep 10, 2024
@thorbjoernl thorbjoernl removed their assignment Sep 11, 2024
@heikoklein heikoklein modified the milestones: m2024-10, m2024-11 Sep 17, 2024
@lewisblake lewisblake added enhancement ✨ New feature or request and removed bug 🐛 Something isn't working labels Oct 28, 2024
@heikoklein heikoklein modified the milestones: m2024-11, m2025-01 Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CAMS2_83 Issues related to the CAMS2_83 contract enhancement ✨ New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants