-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] add spatial correlogram function #259
base: main
Are you sure you want to change the base?
Changes from 14 commits
ef35c4b
5e7807e
59f41b2
062560c
3b20096
82a359b
b686d6d
7e38346
6d7c0cd
77bce5f
c0260bc
e5cc443
98afa33
f2b952e
d8708fa
94c6b56
943b659
a73b94b
c643524
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,153 @@ | ||||||
# Spatial Correlograms | ||||||
|
||||||
import geopandas as gpd | ||||||
import pandas as pd | ||||||
from joblib import Parallel, delayed | ||||||
from libpysal.cg.kdtree import KDTree | ||||||
from libpysal.weights import KNN, DistanceBand | ||||||
from libpysal.weights.util import get_points_array | ||||||
|
||||||
from .geary import Geary | ||||||
from .getisord import G | ||||||
from .moran import Moran | ||||||
|
||||||
|
||||||
def _get_autocorrelation_stat(inputs): | ||||||
"""helper function for computing parallel autocorrelation statistics | ||||||
|
||||||
Parameters | ||||||
---------- | ||||||
inputs : tuple | ||||||
tuple of (y, tree, W, statistic, STATISTIC, dist, weights_kwargs, stat_kwargs) | ||||||
|
||||||
Returns | ||||||
------- | ||||||
pandas.Series | ||||||
a pandas series with the computed autocorrelation statistic and its simulated p-value | ||||||
""" | ||||||
( | ||||||
y, # y variable | ||||||
tree, # kd tree | ||||||
W, # weights class DistanceBand or KNN | ||||||
STATISTIC, # class of statistic (Moran, Geary, etc) | ||||||
dist, # threshold/k parameter for the weights | ||||||
weights_kwargs, # additional args | ||||||
stat_kwargs, # additional args | ||||||
) = inputs | ||||||
|
||||||
w = W(tree, dist, silence_warnings=True, **weights_kwargs) | ||||||
autocorr = STATISTIC(y, w, **stat_kwargs) | ||||||
attrs = [] | ||||||
all_attrs = list(dict(vars(autocorr)).keys()) | ||||||
for attribute in all_attrs: | ||||||
attrs.append(getattr(autocorr, str(attribute))) | ||||||
return pd.Series(attrs, index=all_attrs, name=dist) | ||||||
|
||||||
|
||||||
def correlogram( | ||||||
gdf: gpd.GeoDataFrame, | ||||||
variable: str, | ||||||
distances: list, | ||||||
distance_type: str = "band", | ||||||
statistic: str = "I", | ||||||
weights_kwargs: dict = None, | ||||||
stat_kwargs: dict = None, | ||||||
n_jobs: int = -1, | ||||||
backend: str = "loky", | ||||||
): | ||||||
"""Generate a spatial correlogram | ||||||
|
||||||
A spatial correlogram is a set of spatial autocorrelation statistics calculated for | ||||||
a set of increasing distances. It is a useful exploratory tool for examining | ||||||
how the relationship between spatial units changes over different notions of scale. | ||||||
|
||||||
Parameters | ||||||
---------- | ||||||
gdf : gpd.GeoDataFrame | ||||||
geodataframe holding spatial and attribute data | ||||||
variable: str | ||||||
column on the geodataframe used to compute autocorrelation statistic | ||||||
distances : list | ||||||
list of distances to compute the autocorrelation statistic | ||||||
distance_type : str, optional | ||||||
which concept of distance to increment. Options are {`band`, `knn`}. | ||||||
by default 'band' (for `libpysal.weights.DistanceBand` weights) | ||||||
statistic : str, by default 'I' | ||||||
which spatial autocorrelation statistic to compute. Options in {`I`, `G`, `C`} | ||||||
weights_kwargs : dict | ||||||
additional keyword arguments passed to the libpysal.weights.W class | ||||||
stat_kwargs : dict | ||||||
additional keyword arguments passed to the `esda` autocorrelation statistic class. | ||||||
For example for faster results with no statistical inference, set the number | ||||||
of permutations to zero with {permutations: 0} | ||||||
n_jobs : int | ||||||
number of jobs to pass to joblib. If -1 (default), all cores will be used | ||||||
backend : str | ||||||
backend parameter passed to joblib | ||||||
|
||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
Returns | ||||||
------- | ||||||
outputs : pandas.DataFrame | ||||||
table of autocorrelation statistics at increasing distance bandwidths | ||||||
""" | ||||||
if stat_kwargs is None: | ||||||
stat_kwargs = dict() | ||||||
if weights_kwargs is None: | ||||||
weights_kwargs = dict() | ||||||
if statistic == "I": | ||||||
STATISTIC = Moran | ||||||
elif statistic == "G": | ||||||
STATISTIC = G | ||||||
elif statistic == "C": | ||||||
STATISTIC = Geary | ||||||
else: | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The extraction functions look sufficiently general to allow us to admit any callable like I'd love to allow users to send a callable, since this gives us pretty major extensibility for free... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. cool, that makes sense. I think the first version I wrote actually took the Moran class as the first argument, but i thought the dispatching actually helped clarify what the function was doing. The internals ended up a lot more generic than i figured they'd need to be so the most abstract version returns the output of a callable for a range of W/Graph specifications (in table form), and there are probably plenty of uses for that. But then do we want to call the function something more generic since it no longer necessarily computes some measure of correlation? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Totally, and then define a specific For the general function, @TaylorOshan and I have been calling these "profile" functions (bottom of page 302), so I'm partial to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well, I guess even like.... I have a nascent implementation of a nonparametric robust Moran/Local Moran statistic that would benefit from being able to plug in directly? I think it's OK to call the function There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. profile is the same nomenclature over in segregation but maybe we should centralize some of this logic then |
||||||
with NotImplementedError as e: | ||||||
raise e("Only I, G, and C statistics are currently implemented") | ||||||
|
||||||
if distance_type == "band": | ||||||
W = DistanceBand | ||||||
elif distance_type == "knn": | ||||||
if max(distances) > gdf.shape[0] - 1: | ||||||
with ValueError as e: | ||||||
raise e("max number of neighbors must be less than or equal to n-1") | ||||||
W = KNN | ||||||
else: | ||||||
with NotImplementedError as e: | ||||||
raise e("distance_type must be either `band` or `knn` ") | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
# should be able to build the tree once and reuse it? | ||||||
# but in practice, im not seeing any real difference from starting a new W from scratch each time | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not even for large data? I suppose that repeated creation of the tree could make a dent on this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm a bit confused by @knaaptime's comment, and also this response. This should not be building a KDTree repeatedly, but it is instantiating a new However, the corresponding There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was "testing" on a dataset with ~65k points, but just not terribly thoroughly what i meant was, I'd also done what i was hoping for was something like @ljwolf said, where i'd do the expensive call once for the largest distance, then repeatedly use the same structure for repeated calls (sorta like how pandana works with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (a) I think it's fine to keep this current implementation for now. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. gotcha. Same way i did it for the isochrones with pandana. Get the adjacency once and trim it down for each distance. So here it would be something like maxtree = KDTree(max(distances))
mintree = KDTree(min(distances))
distmat = maxtree.sparse_distance_matrix(mintree)
for dist in distances
w = Graph.build_knn(distmat, k=k) or something? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i guess what im asking is, what's the most efficient way to do this with the new Graph? If we have a Graph based on the max distance, then it's already got all the information we need for every lesser distance, and all we need to do is filter on the adjlist where wij<distance. Presumably thats what simething like for this problem its easy enough to just build the adjlist once and literally filter it down each distance and instantiate a new Graph/W from that adjlist, but is that the best way? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I suppose this would need to be tested. My sense is that a filter of adjacency based on a distance will be significantly slower than a tree query but I might be wrong. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. so, i'm playing with this, and it would be straightforward to filter the adjlist, which is fast, since its just ...But once we subset the W, now we have to subset the original dataframe to align it, which is something we said we wouldnt do. Is there a good way to do this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To keep the structure, you can just assign weight=0 to everything above the set distance threshold. |
||||||
pts = get_points_array(gdf[gdf.geometry.name]) | ||||||
tree = KDTree(pts) | ||||||
y = gdf[variable].values | ||||||
|
||||||
inputs = [ | ||||||
tuple( | ||||||
[ | ||||||
y, | ||||||
tree, | ||||||
W, | ||||||
STATISTIC, | ||||||
dist, | ||||||
weights_kwargs, | ||||||
stat_kwargs, | ||||||
] | ||||||
) | ||||||
for dist in distances | ||||||
] | ||||||
|
||||||
outputs = Parallel(n_jobs=n_jobs, backend=backend)( | ||||||
delayed(_get_autocorrelation_stat)(i) for i in inputs | ||||||
) | ||||||
|
||||||
return ( | ||||||
pd.DataFrame(outputs) | ||||||
.select_dtypes(["number"]) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. so, in the abstract version, this function becomes in that case, should we get rid of this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think just keep it as correlogram. Allowing a user defined function already means we can't assume/ensure the output is "correlation", but it's a relatively advanced way to work with the function. And yes, the output type of the callable has to be a namespace/namedtuple/object kind of thing, but I think that's an ok requirement |
||||||
.drop(columns=["permutations", "n"]) | ||||||
) | ||||||
|
||||||
|
||||||
## Note: To be implemented: | ||||||
|
||||||
## non-parametric version used in geoda https://geodacenter.github.io/workbook/5a_global_auto/lab5a.html#spatial-correlogram | ||||||
## as given in https://link.springer.com/article/10.1023/A:1009601932481' |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
from libpysal import examples | ||
from esda import correlogram | ||
import numpy as np | ||
from numpy.testing import assert_array_almost_equal | ||
|
||
import geopandas as gpd | ||
|
||
|
||
sac = gpd.read_file(examples.load_example("Sacramento1").get_path("sacramentot2.shp")) | ||
sac = sac.to_crs(sac.estimate_utm_crs()) # now in meters) | ||
|
||
distances = [i + 500 for i in range(0, 2000, 500)] | ||
|
||
def test_correlogram(): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we have some ecosystem-wide standards on what is the minimal acceptable test suite? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No... We've talked about separating "correctness" tests (does the statistic recover the same result every time?) and option tests/user tests (does the function handle all combination of arguments/input types correctly?), but we've never agreed on what is necessary for any contribution... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we should put this on the agenda for the monthly call and/or steering committee. tbh i dont think an ecosystem-wide policy is realistic. We don't have the resources. There's a lot of heterogeneity in the maturity of different subpackages, so a lot of variation in how the lead maintainers want to manage their stack. Most often, I'm happy to accept new functionality once the test shows that it gets the 'right answer', even if i dont have the capacity to write a fully-parameterized test suite that hits every line--so that's how i manage tobler, segregation, geosnap, etc, otherwise they won't move forward. I know which lines are untested, so I'm ok with that until (a) an issue is raised, (b) i get some time to focus on testing, or (c) the total coverage drops too low (which sometimes triggers b) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yup, makes sense. I asked because I am not sure when should I be fine with the state of tests when doing the review, so an agreed guideline would be nice. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Very much agree with Martin on this. Seems like our general rule of thumb has been that somewhere between 70-90% is the acceptable coverage. I always try to get as close to 100% as possible to (attempt to) hit any weird edge cases, but I know that isn't required a lot of the time. If we want to decide on an actual number, this is surely something for at least the steering committee to vote on. Moreover, (1) as a step towards catching edge cases we may want to consider seeing about implementing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
again, i dont think an ecosystem-wide policy is realistic, but definitely something we should discuss. The question is about cost-benefit. If we're too strict about test coverage, it will discourage contributions, and i think for many packages, that's a greater risk than having some untested code. I don't have time to get absorbed in edge-cases, so I'd prefer to let those issues surface before letting them block new functionality getting merged. case in point, please feel free to write out the rest of the test suite here :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
FWIW @knaaptime I am "on your side" here (though I think we're all on the same side). In my opinion, getting new functionality merged should take precedence over a strict coverage number for individual PRs, so long as that new functionality is being "mostly" covered (where "mostly" can either be more lines covered or majority cases covered). More testing can be added later as time & energy permit. I hope my comments did not come off as combative. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 😅 sorry my style is always blunt! no offense taken or intended. We're all on the same page. We need some guidelines on merging--definitely. I just want to avoid a situation where we let stuff sit here, because it's usually easier for us core maintainers to take incremental passes at improving test coverage, instead of imposing a big burden on the PR opener (for this one, I certainly don't need this PR merged, but while it's sitting here it makes for a good example. I already have the function myself, but if other folks want to use it, i'd prefer not to let corner cases prevent that, since i know i don't have a ton of time to write out the test suite. I can usually commit to responding to bug reports though) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My question was primarily coming from my GeoPandas experience and understanding that what we aim for there is not applicable in PySAL. So I was just wondering what is the level people feel is enough. Totally agree with all you said above, there's no need to block stuff just because tests are not 100%. Happy to have a further chat about it during the next call. And happy to merge this with the test suite as is :). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Once again I agree with Martin, but maybe after addressing #259 (comment)? |
||
|
||
corr = correlogram(sac, "HH_INC", distances) | ||
|
||
test_data = np.array( | ||
[0.05822723177762817, 0.49206877942505206, 0.45494217612839183, 0.5625914469490942] | ||
) | ||
|
||
assert_array_almost_equal(corr.I, test_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add a key to what I, G, C mean? Saying that it is Moran's I, Geary C, Getis-Ord G?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice suggestion.