Implementation sanity check

Hi @tayloroshan,

as promised, here's some stuff related to current state of gwlearn. At the moment, we have some classification models we wanted to play with but the idea is very much extensible. I need to refactor some bits to make it more modular but that is a detail. What I'd like to get now is some sanity check to make sure that what we're doing here is sensible. See the overview below.

# An overview of the implementation of GW classification in gwlearn

- The BaseClassifier creates a libpysal Graph encoding the kernel and bandwidth - this is quite fast but it is to be noted that for large datasets and larger bandwidths, this can eat quite some memory.
	- [here](https://github.com/uscuni/gwlearn/blob/5af948905293e852d2cda4668d97904205ba5806/gwlearn/base.py#L215-L230)
- The fitting process is, in principle, a groupby over focal IDs in the graph adjacency. The fitting runs in parallel across groups (focal geometries).
	- [here]https://github.com/uscuni/gwlearn/blob/5af948905293e852d2cda4668d97904205ba5806/gwlearn/base.py#L328
- The models are currently restricted to binary classification case only. with multi-class, there's a likelihood that each model will see different classes due to their geographic distribution. All the performance metrics are then incomparable and prediction complicated (though we could inject 0 for unseen classes). In theory, we can eventually enable this but it requires deeper understanding of the whole process from the user perspective.
- Even in binary cases, we can get locations with only 0 or 1 in the whole bandwidth - we currently skip those and do not report anything in such locations. That yields outputs like [here](https://uscuni.org/talks/slides/202504_GISRUK_Anna.html#/3/27). To be precise, we skip all focals where the proportion of minority is below a set threshold. ([here]https://github.com/uscuni/gwlearn/blob/5af948905293e852d2cda4668d97904205ba5806/gwlearn/base.py#L422)
- In case of imbalanced data, we have an option to undersample majority class.
	- [here](https://github.com/uscuni/gwlearn/blob/5af948905293e852d2cda4668d97904205ba5806/gwlearn/base.py#L456-L463)
- When fitting, we use the leave-one-out approach of not including the values from focal in the training, only the neighbors. We could easily include focal if we want by assigning self weights in the Graph but that is not an option and the moment.

Performance is measured in multiple ways, which also depends on the model. Right now we have logistic regression, random forest and gradient boosted trees. 

1. Focal prediction - thanks to the leave-one-out approach, we can use the model to predict class on focal as its values were not used in the training (beside spatial dependency leakage). We use this to measure score, precision, recall... (see [here](https://github.com/uscuni/gwlearn/blob/5af948905293e852d2cda4668d97904205ba5806/gwlearn/base.py#L299-L324))
2. Out-of-bag score - For RF, we can get OOB predictions which are fetched from all individual models, concatenated and used as a single array of predictions against relevant true values. ([here](https://github.com/uscuni/gwlearn/blob/5af948905293e852d2cda4668d97904205ba5806/gwlearn/ensemble.py#L137)). Given this is pulled from individual models, we can report global metrics as well as local metrics. ([here](https://github.com/uscuni/gwlearn/blob/5af948905293e852d2cda4668d97904205ba5806/gwlearn/ensemble.py#L165-L187))
3. Feature importances - Each local model (if RF or GB) has an array of feature importances
4. For logistic regression, we are pulling all the arrays from individual models with modelled predictions and measure performance based on those in a similar way we do for OOB data. ([here](https://github.com/uscuni/gwlearn/blob/5af948905293e852d2cda4668d97904205ba5806/gwlearn/base.py#L490-L497) and then [here](https://github.com/uscuni/gwlearn/blob/5af948905293e852d2cda4668d97904205ba5806/gwlearn/linear_model.py#L56)). So both global (concat data) and local.
5. LR also returns all local coefficients and intercepts.

Prediction is implemented both as `predict_proba` and `predict`, though there might be some changes to how it works. Right now, we get a location, build a kernel to identify all the models within the bandwidth and their weights. Then we do the prediction using all of these models and report a weighted average. ([here](https://github.com/uscuni/gwlearn/blob/5af948905293e852d2cda4668d97904205ba5806/gwlearn/base.py#L596)). [Geogranos et al](https://www.tandfonline.com/doi/full/10.1080/10106049.2019.1595177#abstract) also combine this with a prediction from a global model, which we don't do at the moment.

@tayloroshan do you think this makes sense or are we doing something we should not here? :D it is perfectly possible.

Also, I have not found anything on using classification models within the GW context, are you aware of anything? Thanks!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implementation sanity check #3

An overview of the implementation of GW classification in gwlearn

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implementation sanity check #3

Description

An overview of the implementation of GW classification in gwlearn

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions