Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible bug in impact calculation #933

Open
aleeciu opened this issue Aug 5, 2024 · 17 comments
Open

Possible bug in impact calculation #933

aleeciu opened this issue Aug 5, 2024 · 17 comments
Labels
accepting pull request Contribute by raising a pull request to resolve this issue! bug

Comments

@aleeciu
Copy link
Collaborator

aleeciu commented Aug 5, 2024

Describe the bug
I noticed a weird behavior in the impact calculation. Inputting a valid exposure, hazard and impact function set leads to:

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (3,) + inhomogeneous part.

the error occurs at this line, i.e. when looping the impact matrix generator. After spending few hours, I still couldn't figure out what conditions lead to the error.

The code below reproduces it using one exposure point, one centroid and two hazard events. Interestingly enough, using an exposure with twice the same point makes it run just fine.

To Reproduce
Run the code below.

Code example:

import numpy as np
import pandas as pd
from scipy import sparse
import geopandas as gpd
from shapely.geometry import Point

from climada.entity import Exposures
from climada.entity.impact_funcs import ImpactFuncSet, ImpfTropCyclone
from climada.engine import ImpactCalc
from climada.hazard import Centroids, Hazard

intensity = np.array([[31.5], [19.0]])
centroids = Centroids.from_lat_lon([-26.16], [28.20])

tc = Hazard(
    intensity=sparse.csr_matrix(intensity),
    event_id=np.arange(2),
    event_name=[0,1],
    frequency=np.ones(2) / 2,
    fraction=sparse.csr_matrix(np.zeros((2,1))),
    date=np.array([736018, 736019]),
    centroids=centroids,
    haz_type='TC'
)

impf_tc = ImpfTropCyclone.from_emanuel_usa()
impf_set = ImpactFuncSet([impf_tc])
impf_set.check()

exposure = gpd.GeoDataFrame(
    {'value': 1., 'geometry': [Point(28.22, -26.17)]}, crs="EPSG:4326"
)
exposure['latitude'] = exposure.geometry.y
exposure['longitude'] = exposure.geometry.x
exposure['impf_TC'] = 1

exp = Exposures(exposure)
#exp = Exposures(pd.concat([exposure, exposure]).reset_index()) # Use this and it works

impact = ImpactCalc(exp, impf_set, tc).impact(save_mat=True)

Expected behavior
The expected behaviour would be to get an impact object, the actual result is:

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (3,) + inhomogeneous part.

Climada Version: 4.1.2.dev0

System Information (please complete the following information):

  • Operating system and version: macOS Sonoma 14.6
  • Python version: 3.11.6

Any idea/clue?

@aleeciu aleeciu added the bug label Aug 5, 2024
@chahank
Copy link
Member

chahank commented Aug 6, 2024

Given the error, I would first try to have a correctly made Hazard and maybe this solves the problem. event_name should be a list, not a numpy array. Otherwise, I do not see directly what would be the cause.

@aleeciu
Copy link
Collaborator Author

aleeciu commented Aug 6, 2024

Thanks - I now specified event_name as a list and get the same, as expected.

@chahank
Copy link
Member

chahank commented Aug 6, 2024

Ok, thanks for checking. Could you please try with an exposure with at least 2 locations?

@aleeciu
Copy link
Collaborator Author

aleeciu commented Aug 6, 2024

That's a bit the whole point - with two exposure points (essentially duplicating the single exposure point) it then works. See bug description and commented line in the example code.

@chahank
Copy link
Member

chahank commented Aug 6, 2024

Can you please check that your exposures are correctly initialized? Geopandas does not like it that much if one uses a single data point and that has for me led to failures in the past (but this is maybe not the case anymore).

@aleeciu
Copy link
Collaborator Author

aleeciu commented Aug 6, 2024

What do you mean by correctly initialized? Anyway, it is correct as far as I can tell. Also, this is a code that replicates the error but I got it in different contexts where the exposure was much larger. I don't think it's a geopandas issue, I think it's really about the looping of mat objects that takes place within np.hstack at the above mentioned line that creates the issue. For some reason, dimensions don't fit. I think it has to do with having a unique exposure point being analysed in the impact calculation (the same error is thrown if one has a multi-points exposure, but where only one point has centr_TC != -1, thereby only one is considered in the mininal exp_gdf) but cannot really figure out why.

@peanutfun
Copy link
Member

peanutfun commented Aug 7, 2024

@aleeciu @chahank I debugged the issue and found the culprit digging through the source code: csr_matrix.multiply behaves differently when multiplying with a scalar. For scalar multiplication (which also applies for arrays of shape (1, 1) and (1, ), as in @aleeciu's case), the data attribute is simply multiplied with the scalar. In your particular example, this leaves one value of the resulting impact matrix as zero. This zero is explicitly stored. The explicit zero leads to a mismatch between the shape of mat.data and mat.nonzero().

To be clear, this issue only occurs if BOTH

  • The exposure only has a single value
  • The MDR and/or fraction is zero somewhere

For sparse array multiplication, this will not happen, as a new matrix constructor is called and explicit zeros are removed.

One remedy is to explicitly call eliminate_zeros on the impact matrix in the imp_mat_gen. However, this incurs a performance penalty of O(n), where n are the stored values, because each single one has to be checked for a value of zero.

@aleeciu
Copy link
Collaborator Author

aleeciu commented Aug 8, 2024

Thanks a lot @peanutfun for investing the bug. Your explanation makes a lot of sense. Regarding the solution, I don't have strong opinions tbh. This is definetely an edge case so probably it does not make sense if the fix deteriorates the overall performance, although the impact quantification in my experience has never been a bottleneck in performance.

@peanutfun
Copy link
Member

I think it should be fine to add a check whether the size of the exposure is 1, and only then call eliminate_zeros.

@aleeciu
Copy link
Collaborator Author

aleeciu commented Aug 8, 2024

Cool - at what place in the code you think this check should be placed?

@peanutfun
Copy link
Member

peanutfun commented Aug 8, 2024

@aleeciu Inside impact_matrix:

def impact_matrix(self, exp_values, cent_idx, impf):

The check has to be if the shape of exp_values_csr is (1, 1). Instead of immediately returning the multiplication result, one then has to apply eliminate_zeros on it before returning.

Edit: Check can also be if n_exp_pnt == 1

@peanutfun peanutfun added the accepting pull request Contribute by raising a pull request to resolve this issue! label Aug 8, 2024
@chahank
Copy link
Member

chahank commented Aug 8, 2024

Thanks a lot @peanutfun for investing the bug. Your explanation makes a lot of sense. Regarding the solution, I don't have strong opinions tbh. This is definetely an edge case so probably it does not make sense if the fix deteriorates the overall performance, although the impact quantification in my experience has never been a bottleneck in performance.

Just to comment on the last point: for most non-trivial applications such as uncertainty quantification, calibration or cost-benefit analysis the impact quantification is the main bottleneck in performance, so it is very central to not loose time there.

@peanutfun
Copy link
Member

@chahank Correct, but if n_exp_pnt == 1 should be acceptable, I think

@peanutfun
Copy link
Member

I thought of an even better solution, I hope:

We should call eliminate_zeros on the matrices created in Hazard.get_mdr and Hazard._get_fraction before returning them. This avoids multiplications with zeros, which later on would be eliminated anyway. Then we don't need an additional check in ImpactCalc. In theory, this should actually improve performance a bit.

@peanutfun
Copy link
Member

@aleeciu would you have the time to create a PR?

@aleeciu
Copy link
Collaborator Author

aleeciu commented Aug 14, 2024

Hi @peanutfun , yes we will address this within the next two weeks. If you know from the top of your head where this call should likely happen in the code, please share it - otherwise I will make a first guess and we iterate over the PR.

@peanutfun
Copy link
Member

peanutfun commented Aug 14, 2024

@aleeciu Excellent, looking forward to it!

Probably within here:

def get_mdr(self, cent_idx, impf):

and here:

def _get_fraction(self, cent_idx=None):

Edit: It's not required in get_fraction, because the fraction matrix is only sliced, and by calling Hazard.check_matrices from ImpactCalc, it should not contain explicit zeros anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepting pull request Contribute by raising a pull request to resolve this issue! bug
Projects
None yet
Development

No branches or pull requests

3 participants