Skip to content

Home-Prices-Team/project_four

Repository files navigation

project_four

  • Objectives
    • Develop a predictive model for accurately forecasting housing prices.
      • Investigate: Can we create a reliable model to predict future home prices with precision?
    • Provide visualizations for what our training data looks like showing Real Estate market trends and outcomes over time using Matplotlib, Seaborn, and Tableau.
    • Compile an extensive set of features to strengthen our Home Sales Price Prediction Model. This project leverages Redfin real estate data and Freddie Mac mortgage rates to analyze and anticipate home prices.

Data Sources

  • Data provided by Redfin, a national real estate brokerage. Redfin Data-Center

  • Freddie Mac, 30-Year Fixed Rate Mortgage Average in the United States [MORTGAGE30US], retrieved from FRED, Federal Reserve Bank of St. Louis;
    Freddie Mac 30 year fixed rates

Pre-Processing

  • Used Jupyter Notebook with standard imports of Pandas, Numpy for data preprocessing. Started with 5 data files, through cleaning and merging elimiated the national set as the state held the most data, approximately 250 datapoints per date.
  • Updated dates, data formats, merged the price index with the month over month index and reordered columns.
  • Formatted the MORTGAGE30US and corrected the date column from weekly to monthly rates_df_copy to dupliecate for each month
  • Created copies of all DataFrames and merged state_df_copy with hpi_index_copy to make combined_df. With the first 3 DataFrames done, merged combined_df with rates_df_copy to have complete_df.
  • Applied .fillna(0) to appropriate features based on percentages.
  • Applied .mean() calculation to missing row data
  • Saved new csv complete_data_csv and provided for SQL Database creation and further cleaning and exploration.

Database Creation

  • Libraries Used:
    • Pandas: Used for data manipulation and analysis
    • OS: Getting the current working directory to understand file paths
    • PySpark: Creating a SparkSession, reading CSV files into Spark DataFrames, performing data transformations and aggregations.
    • gdown:Downloading the CSV data file from a Google Drive link
  • Example SQL Queries:
Max listing price per state:
  SELECT state, MAX(median_list_price) ...
Avg price: MN vs others:
 CASE WHEN state = 'Minnesota' THEN 'Minnesota' ELSE 'Other States' END ...
 Price category classification:
 CASE WHEN median_sale_price > 750k THEN 'Luxury' ...
  • Key Metrics Analyzed:
    • median_sale_price, median_list_price
    • price_drops (% of homes with reductions)
    • inventory vs homes sold (supply-demand)
    • median_dom (Days on Market)
    • YoY price change and volatility

Tableau

  • Designed and developed an interactive dashboard using the data loaded from the database and model results. See link below

Linear Regression

  • Model: LinearRegression()
    • train_test_split:
      • Training data: 01/01/2014 - 02/28/2023
      • Testing data: 03/01/2023 - 09/30/2024
    • Results:
      • Variance explained: ~76%
      • Average difference between actual and predicted values: $51,756.97
      • Mean Absolute Error (MAE): 51756.971365011486
      • R² Score: 0.760306234063732

XgBoost

  • Model 1 is xg_reg = xgb.XGBRegressor(n_estimators=100, random_state=42, n_jobs=-1)

    • train_test_split:
      • Training data: 01/01/2014 - 02/28/2023
      • Testing data: 03/01/2023 - 09/30/2024
    • Results:
      • Variance Explained: ~87%
      • Average difference between actual and predicted values: $31,085.42
      • Mean Absolute Error: 31085.416844472165
      • R² Score: 0.8782984667958691
  • Model 2 added in the Top 25 Features, xg_reg_top25 = xgb.XGBRegressor(n_estimators=75, random_state=42, n_jobs=-1)

    • train_test_split: No change
    • Results: {Reducing the n_estimators to 75 reduces the optimization}
      • Variance Explained: ~88%
      • Average difference between actual and predicted values: $33,659.20
      • Mean Absolute Error: 33659.201000361085
      • R² Score: 0.8825637115322471

Random Forest Regressor

  • Model 3: Iterated several versions to achieve the highest possible accuracy.
    • The first step was to set a baseline to determine the number of n_estimators to use. Worked up from 50 to 75 to 100.
      • rf_model = RandomForestRegressor(n_estimators=100,random_state=42, n_jobs=-1)
      • Determined how to establish the top contributors from the results as the results were mirroring the XgBoost model.
  • Results:
    • Mean Absolute Error: 31887.499789915968
    • R² Score: 0.8714313381885137
# Calculate the top 25 features and sort in descending order
importances = rf_model.feature_importances_ 
feature_names = X_train.columns
sorted_indices = importances.argsort()[::-1]  

top_25_features = feature_names[sorted_indices][:25]  
print("Top 25 Features:", top_25_features)
  • Added engineered features to improve feature reliance on calcuation between 2 existing numeric features.
# Multiply interactions that have higher influence in feature set 
X_train['ppsf_inventory_interaction'] = X_train['median_ppsf_log'] * X_train['inventory']
X_test['ppsf_inventory_interaction'] = X_test['median_ppsf_log'] * X_test['inventory']
  • train_test_split:

    • Training data: 01/01/2014 - 02/28/2023
    • Testing data: 03/01/2023 - 09/30/2024
    • Results: {Reducing the n_estimators to 75 reduces the optimization}
      • Variance Explained: ~94%
      • Average difference between actual and predicted values: $8629.38
      • Mean Absolute Error: 8629.376680672269
      • R² Score: 0.9473431378488051
  • Further optimization steps:

    • upper_limit_price = 1500000 This filtered the max median_list_price at $1.5 million.
    • upper_limit_dom = 730 This filtered out the outlier Days on Market and put max of 2 years.
    • Adjusted the train_test_split to begin (1/1/2013 - 2/28/2023), no change to the test split.
    • Completed GridSearchCV to determine what was the optimal set for n_estimators, max_depth and min_samples. After running with and without the min_samples it was determined that increasing n_estimators and putting a limit on the depth improved performance.
    • model - best_rf_model = RandomForestRegressor(max_depth=20, n_estimators=300, random_state=42, n_jobs=-1)
    • Results:
      • Variance Explained: ~94%
      • Average difference between actual and predicted values: $8234.14
      • Optimized Mean Absolute Error: 8234.135995214614
      • Optimized R² Score: 0.9479948988106021
  • Final Model Prediction: Added in the min_samples from optimization results.

    • final_model = RandomForestRegressor(max_depth=20, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=42, n_jobs=-1)
    • train_test_split Training (03/01/2013 - 06/30/2024) and Testing (07/2024 - 09/2024) to predict future results.
    • Results:
      • Variance Explained: ~95%
      • Average difference between actual and predicted values: $7051.72
      • Mean Absolute Error (Holdout): 7051.7177115952245
      • R² Score (Holdout): 0.9526112229978853

Tableau and Google classroom presentation links

Project 4 Home Price Team

Google Slides

Team

  • Chad Hillman
  • Mohamoud Jama
  • DJ Dimetros
  • Daniel Simonson

License

This project is licensed under the terms of the GNU General Public License v3.0. For more details, see the LICENSE file.

About

Home Price Modeling using Redfin data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •