-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize performance of _process_statistics_timeseries #1277
Comments
Has this test been run with a long time-series with more than one year of hourly data (8760 timesteps at least)? We don't see a bottleneck for short timeseries. |
It might also be writing of ts-json files. |
No, this was just a quick first look at profiling
Agreed, but I'd need to take a deeper look into the logs. |
A short IO test with one of the output-files:
Running on the same machine and queue, I get a output speed of:
We are writing 7.7Gb to /lustre/storeB/users/charlien/cams2-83-tests/data/cams2-83/test-forecast-long-hourlystats/hm/ts/, with 2.8MB/s it takes: 2816s, with 49.0MB/s this is reduced to 160s. The dataformat of this file does not allow for efficient numpy-based processing (dict of dicts). The data in Questions to @AugustinMortier :
{
"concno2": {
"EEA-UTD": {
"Surface": {
"ENSEMBLE": {
"concno2": {
"Germany": {
"1622507400000": {
"totnum": 245.0,
"weighted": 0.0,
"num_valid": 0.0,
"refdata_mean": null,
"refdata_std": null,
"data_mean": null, to {
"concno2": {
"EEA-UTD": {
"Surface": {
"ENSEMBLE": {
"concno2": {
"Germany": {
"timesteps": [1622507400000, ...],
"totnum": [245.0, ...],
"weighted": [0.0, ...],
"num_valid": [0.0, ...],
"refdata_mean": [null, ...],
"refdata_std": [null, ...],
"data_mean": [null, ...], This would allow us to process the statistics for all timesteps in the same numpy operation (using |
Why are stats already rounded by default? pyaerocom/pyaerocom/stats/stats.py Line 83 in 8913284
(Introduced as a fairmode rounding issue fix? @thorbjoernl And it is python rounding of numpy values, why not np.around and avoid np/float conversion?) Currently, we are rounding twice. |
It fixed an issue where R values where out of bounds which caused a test to fail: #1142 There is probably a better way to fix this. |
Next step might be to vectorize statistics: data = np.random.random((10000, 100))
def flattenmean():
for i in np.arange(data.shape[0]):
df = data[i,:].flatten()
np.nanmean(df)
def axismean():
np.nanmean(data, axis=1) Currently we are working purely on the flattened arrays. If we manage to vectorize that as in the axismean version, it will be 50x faster (just benchmarked). |
Part of the reason we compute the statistics on the flattened arrays is (used to be) because the filtering is applied at the station time series level. So we would like a way with numpy to apply the 75% temporal coverage requirement to the array directly, or really just some way to apply the filtering before computing the statistics. We should also check that we are not doing the temporal filtering twice. |
Idea: pytest has a plugin called timeout. Once optimized, we can write a test to check that the code does not take more than a certain amount of time. If changes are introduced which cause the test to take longer, the test fails. |
Describe the bug
Please provide a clear and concise description of what the bug is.
There is a performance bug in the computing of the regional timeseries. Culprit is
_process_statistics_timeseries
incoldatatojson_helpers.py
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Hope to significantly reduce runtime of this function.
Additional context
Pretty much all projects would benefit from this.
The text was updated successfully, but these errors were encountered: