RealValue is a machine learning project for predicting home prices in Toronto. Using TensorFlow convolutional neural networks in conjunction with a dense network component, owners can take a couple of pictures of their home, enter a few simple details and they will be provided an accurate price range of what their home is worth. This ease of use allows homeowners to be confident about their residential decisions and be more informed about the real estate market than ever before.
For more details about our project, please take a look at our Medium article [here](insert link here).
Also, check out our website at real-value.ca to try out some of our algorithms!
In the field of real estate, the idea of predicting the "right price" for a property is growing heavily in interest. Most current algorithms solely use statistical information about given properties as a form of input to predict its right price. However, these algorithms fail to include a notable form of data that often influences the perception of a buyer: visual data of the house. Recently, convolutional neural networks (CNNs) have increased in prominence for their ability to generate strong feature representations out of images and use those representations to accurately map visual inputs to scalar/vectorized outputs.
Our goal was to create a custom convolutional neural network to accurately predict Toronto housing prices with less than 20% error.
- Combined CNN and dense network model
- Easy to swap CNN model architectures
- Easy to change dense network size
- Transfer learning using California and Toronto housing datasets
- California Dataset
- Also modified dataset to use latitude/longitude values in place of postal codes
- Custom Toronto Dataset we collected in February 2021 (157 houses)
- Image data augmentation (crop, rotate, mirroring, saturation, brightness)
- Inputs to Network:
- 2x2 Mosaic image (bedroom, bathroom, kitchen, frontal view)
- Price, Number of bedrooms, bathrooms, square feet, and postal code
- Configurable training (hyperparameters, model architecture) with
config.yaml
To download our code:
git clone “url”
Dependent Packages:
tensorflow 2.3.0, matplotlib, opencv-python, numpy, pandas, keras, sklearn
To install the dependent packages, run:
pip install -r requirements.txt
We initially trained our network on a dataset of California houses, created by Ahmed and Moustafa, consisting of both structured data (statistical property information in tabular form) and unstructured data (images). This dataset contains information for over 500 houses, each with 4 images of a bathroom, bedroom, kitchen, and frontal view. Statistical information for each house includes the number of bedrooms, bathrooms, square footage, postal code, and price.
We created our own Toronto real estate dataset by compiling the images, prices, number of rooms, surface areas, and postal codes of houses on the Toronto Regional Real Estate Board (TRREB) website. A Python script was used to accurately calculate the area of the house from provided measurements of each room. Like the California dataset, in our Toronto dataset, we had four images for each house for the frontal view, bedroom, bathroom, and kitchen.
Our model’s hyperparameters are stored in a config.yaml file. To start training, modify the config.yaml
if needed and issue the following command
python pipeline.py
Since data augmentation can take considerable time, we can set the import_mode
in config.yaml
to skip augmentation to start training immediately.
On the first run, set import_mode: False
in config.yaml
to perform data augmentation. On future runs, you can set import_mode: True
to skip data augmentation and use previous augmented data. You can always use import_mode: False
without issues; it just might be slower.
Note: If you switch/modify the dataset or augmentation multiplier, make sure to use import_mode: True
for the first run.
To change hyperparameters like learning rate, optimizer, etc change the parameters on the corresponding lines in the config.yaml
In particular, the CNN model and dense model layers are set by the following lines
# Train using RegNet as CNN and a 2 layer dense network (8 units in first layer, 4 units in second layer)
CNN_model: 'RegNet'
dense_model:
- 8
- 4
The number of dense layers and their size can be changed using config.yaml
.
Changing the CNN network is more involved, but still straightforward. If you want to add your CustomNet
, follow the instructions below. As a basic working example, check out how we defined LeNet
as a CNN in models/CNN_models/lenet.py
and then used it in get_network()
in models/__init__.py
.
Define a function that returns your custom CNN as a tf.keras.Model
in a new file at models/CNN_models/CustomNet.py
Modify get_network()
in models/__init__.py
to call your new function with your custom CNN
Change your config.yaml
to have `CNN_model: 'CustomNet'
To train on the California dataset, specify directory: 'raw_dataset'
in the first line of the config.yaml
file. The California dataset is located in the raw_dataset
directory.
To apply transfer learning on the Toronto Dataset, specify directory: 'toronto_raw_dataset'
in the first line of the config.yaml
file. The Toronto dataset is located in the toronto_raw_dataset
directory.
Remember to set import_mode: False
in between switching datasets.
We achieved a test error of 21% using a Zip Code approach on the California dataset, and a test error of 15% using a Latitude and Longitude approach.
The Zip Code accuracy is nearly 6% better compared to contemporary approaches such as https://www.pyimagesearch.com/2019/02/04/keras-multiple-inputs-and-mixed-data/.