Duplicate samples during inference due to different length assumtions in MultiStreamDataReader

### What happened?

### setup
run inference on a model trained with $n_{fstep}$ forecast steps and $n_{samples}*dt\geq t_{end} - t_{start}$.
Where  $n_{fstep}=$`--forecast_steps`, $n_{samples}=$`--samples`, $dt=$`--step_hours`, $t_{start}=$`--start`, $t_{end}=$`--end`
### expected result:
No duplicate samples are generated

### actual result:
`n+1` duplicate samples are generated

### Hedgedoc link to logs and more information. This ticket is public, do not attach files directly.

As an example inference can be run on [w6khbe9g](https://dbc-080c3210-c159.cloud.databricks.com/ml/experiments/384213844828345/runs/047c6529805942668d28ed437cdc42c9?o=3342786197435636) using:

`uv run inference --samples 244` => will contain 4 duplicate samples

For information on the history of this issue and how to avoid triggering the issue refer to #1085

This code can be used to check if the duplicate samples persist:
```python
from pathlib import Path
import numpy as np
from weathergen.common.io import ZarrIO

RUN_ID = "<MY_INFERENCE_RUN_ID>"
results = Path(f"results/{RUN_ID}/validation_chkpt00000_rank0000.zarr")

with ZarrIO(results) as zio:
  samples = zio.samples
  forecast_step = 0
  times = []
  for sample in samples:
    # one unique datetime per sample
    data = zio.get_data(sample, "ERA5", forecast_step).prediction.as_xarray().valid_time
    print("|", end="")
    times.append(np.unique(data).squeeze())

print()
times = np.array(times)
n_duplication = times.size - np.unique(times).size
print(times.size, np.unique(times).size, n_duplication)
indexes = np.argsort(times)
duplicate_indexes = indexes[:n_duplication*2]
print(f"duplicate indices: {duplicate_indexes}") # randomly distributed
print(f"duplicate times: {times[duplicate_indexes]}")
print(f"min: {times.min()}, max: {times.max()}")
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate samples during inference due to different length assumtions in MultiStreamDataReader #1438

What happened?

setup

expected result:

actual result:

Hedgedoc link to logs and more information. This ticket is public, do not attach files directly.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Duplicate samples during inference due to different length assumtions in MultiStreamDataReader #1438

Description

What happened?

setup

expected result:

actual result:

Hedgedoc link to logs and more information. This ticket is public, do not attach files directly.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions