Predicting the prices of housing in different block groups accross California with regression modelling based on the California Housing Dataset (1990). The intended method of regression is linear regression for the sake of statstical modelling under the the restrictions of the Ordinary least squares algorithm. Several other regression algorithms such as decision tree regression were used for comparison, and finally a best performing model was obtained without the restrictions of OLS. The dataset contains sociodemographic, real estate and geographical data. Models were assessed in terms of error metrics and mean differences from actual values.
This dataset is a modified version of the California Housing dataset available from Luís Torgo's page (University of Porto). Luís Torgo obtained it from the StatLib repository (which is closed now). The dataset may also be downloaded from StatLib mirrors.
This dataset appeared in a 1997 paper titled Sparse Spatial Autoregressions by Pace, R. Kelley and Ronald Barry, published in the Statistics and Probability Letters journal. They built it using the 1990 California census data. It contains one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).
Data understanding
- EDA
- Geographic visualisation
- Data visualiation
Data understanding II
- EDA
- Feature engineering
- Data visualisation
- Data cleaning
Data preparation (pre-processing)
- Feature engineering
- Feature selction
Data modelling
- Checking Ordinary Least Squares assumptions
- Choosing the best model
- Metrics
- Baseline linear regression model
- Data train validate test split
- Linear regression model
- Other models
- Testing best model
- Conclusions
Data modelling variation
- Models
- Conclusions