Optimize performance of _process_statistics_timeseries #1277

lewisblake · 2024-07-24T09:23:39Z

Describe the bug
Please provide a clear and concise description of what the bug is.

There is a performance bug in the computing of the regional timeseries. Culprit is _process_statistics_timeseries in coldatatojson_helpers.py

Pyaerocom version: All
Computing platform: All
Configuration file (if applicable):
Error message (if applicable):

To Reproduce
Steps to reproduce the behavior:

Create a test of the aforementioned function.
Profile test
Optimize performance
Profile again

Expected behavior
Hope to significantly reduce runtime of this function.

Additional context
Pretty much all projects would benefit from this.

The text was updated successfully, but these errors were encountered:

lewisblake · 2024-07-24T09:53:54Z

Results from profiling one instance of the test:

calc_statistics takes only about 11% of the time whereas load_region_mask_xr takes about 60% of the time. A lot of this in built-in numpy methods, but still ideally we wouldn't be calling a function which loads in a data set within a for-loop.

heikoklein · 2024-07-24T10:08:14Z

Has this test been run with a long time-series with more than one year of hourly data (8760 timesteps at least)? We don't see a bottleneck for short timeseries.

heikoklein · 2024-07-24T10:17:16Z

It might also be writing of ts-json files.
The processing of regional statistics of timeseries for a 3year dataset of hourly data takes ~46min and writes out 1686 json-files (one for each station) with a total of 5.3G

lewisblake · 2024-07-24T10:39:40Z

Has this test been run with a long time-series with more than one year of hourly data (8760 timesteps at least)? We don't see a bottleneck for short timeseries.

No, this was just a quick first look at profiling tests/aeroval/test_coldatatojson_helpers2.py::test__process_statistics_timeseries

It might also be writing of ts-json files.
The processing of regional statistics of timeseries for a 3year dataset of hourly data takes ~46min and writes out 1686 json-files (one for each station) with a total of 5.3G

Agreed, but I'd need to take a deeper look into the logs.

heikoklein · 2024-07-24T14:54:12Z

A short IO test with one of the output-files:

import timeit

from pyaerocom.aeroval.json_utils import read_json, write_json

# file with 46156617 bytes size
data = read_json("ALL-EEA-UTD-concno2-Surface.json")

i = 0


def write():
    global i
    i += 1
    write_json(
        data_dict=data, file_path=f"out{i}.json", ignore_nan=True, round_floats=True
    )


number = 5
timeout = timeit.timeit(
    "write()",
    setup="from pyaerocom.aeroval.json_utils import read_json, write_json",
    globals=globals(),
    number=number,
)

print(
    f"write of {number} files took {timeout}s -> {number*46156617/1024/1024/timeout} MB/s"
)

Running on the same machine and queue, I get a output speed of:

2.8MB/s when round_floats is True
49.0MB/s when round_floats is False

We are writing 7.7Gb to /lustre/storeB/users/charlien/cams2-83-tests/data/cams2-83/test-forecast-long-hourlystats/hm/ts/, with 2.8MB/s it takes: 2816s, with 49.0MB/s this is reduced to 160s.

The dataformat of this file does not allow for efficient numpy-based processing (dict of dicts). The data in .../test-forecast-long-hourlystats/ts/ is 5.3Gb and written much faster since all processing is numpy based (dict of numpy arrays), even if it still uses float-conversion.

Questions to @AugustinMortier :

Can we drop rounding of this file? (short term)
Can we change the file format: Current:

{
    "concno2": {
        "EEA-UTD": {
            "Surface": {
                "ENSEMBLE": {
                    "concno2": {
                        "Germany": {
                            "1622507400000": {
                                "totnum": 245.0,
                                "weighted": 0.0,
                                "num_valid": 0.0,
                                "refdata_mean": null,
                                "refdata_std": null,
                                "data_mean": null,

to

{
    "concno2": {
        "EEA-UTD": {
            "Surface": {
                "ENSEMBLE": {
                    "concno2": {
                        "Germany": {
                            "timesteps": [1622507400000, ...],
                            "totnum": [245.0, ...],
                            "weighted": [0.0, ...],
                            "num_valid": [0.0, ...],
                            "refdata_mean": [null, ...],
                            "refdata_std": [null, ...],
                            "data_mean": [null, ...],

This would allow us to process the statistics for all timesteps in the same numpy operation (using axis=X) rather than processing each timestep after each other.

heikoklein · 2024-07-25T06:36:29Z

Why are stats already rounded by default?

pyaerocom/pyaerocom/stats/stats.py

Line 83 in 8913284

result[name] = round(stat(data, ref_data, weights), 10)

(Introduced as a fairmode rounding issue fix? @thorbjoernl And it is python rounding of numpy values, why not np.around and avoid np/float conversion?)

Currently, we are rounding twice.

thorbjoernl · 2024-07-25T07:08:38Z

It fixed an issue where R values where out of bounds which caused a test to fail: #1142

There is probably a better way to fix this.

heikoklein · 2024-07-26T11:15:21Z

Next step might be to vectorize statistics:

data = np.random.random((10000, 100))

def flattenmean():
    for i in np.arange(data.shape[0]):
        df = data[i,:].flatten()
        np.nanmean(df)

def axismean():
    np.nanmean(data, axis=1)

Currently we are working purely on the flattened arrays. If we manage to vectorize that as in the axismean version, it will be 50x faster (just benchmarked).

lewisblake · 2024-07-29T09:36:01Z

Part of the reason we compute the statistics on the flattened arrays is (used to be) because the filtering is applied at the station time series level. So we would like a way with numpy to apply the 75% temporal coverage requirement to the array directly, or really just some way to apply the filtering before computing the statistics. We should also check that we are not doing the temporal filtering twice.

lewisblake · 2024-07-29T09:42:13Z

Idea: pytest has a plugin called timeout. Once optimized, we can write a test to check that the code does not take more than a certain amount of time. If changes are introduced which cause the test to take longer, the test fails.

lewisblake added the bug 🐛 Something isn't working label Jul 24, 2024

lewisblake added this to the m2024-09 milestone Jul 24, 2024

lewisblake self-assigned this Jul 24, 2024

lewisblake mentioned this issue Jul 24, 2024

Revise FAIRMODE plots #1003

Open

lewisblake added the CAMS2_83 Issues related to the CAMS2_83 contract label Jul 24, 2024

thorbjoernl mentioned this issue Jul 25, 2024

Fix 70% of rounding performance issue #1278

Merged

9 tasks

lewisblake mentioned this issue Aug 20, 2024

add units systematically when mean and stdev are provided #1021

Open

thorbjoernl self-assigned this Sep 6, 2024

heikoklein modified the milestones: m2024-09, m2024-10 Sep 10, 2024

thorbjoernl removed their assignment Sep 11, 2024

heikoklein modified the milestones: m2024-10, m2024-11 Sep 17, 2024

lewisblake added enhancement ✨ New feature or request and removed bug 🐛 Something isn't working labels Oct 28, 2024

heikoklein modified the milestones: m2024-11, m2025-01 Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize performance of _process_statistics_timeseries #1277

Optimize performance of _process_statistics_timeseries #1277

lewisblake commented Jul 24, 2024 •

edited

Loading

lewisblake commented Jul 24, 2024

heikoklein commented Jul 24, 2024

heikoklein commented Jul 24, 2024

lewisblake commented Jul 24, 2024

heikoklein commented Jul 24, 2024

heikoklein commented Jul 25, 2024

thorbjoernl commented Jul 25, 2024

heikoklein commented Jul 26, 2024

lewisblake commented Jul 29, 2024

lewisblake commented Jul 29, 2024

Optimize performance of _process_statistics_timeseries #1277

Optimize performance of _process_statistics_timeseries #1277

Comments

lewisblake commented Jul 24, 2024 • edited Loading

lewisblake commented Jul 24, 2024

heikoklein commented Jul 24, 2024

heikoklein commented Jul 24, 2024

lewisblake commented Jul 24, 2024

heikoklein commented Jul 24, 2024

heikoklein commented Jul 25, 2024

thorbjoernl commented Jul 25, 2024

heikoklein commented Jul 26, 2024

lewisblake commented Jul 29, 2024

lewisblake commented Jul 29, 2024

lewisblake commented Jul 24, 2024 •

edited

Loading