-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Description
Before you create this issue
- Please make sure you are using the latest updated code and libraries: see https://github.com/ageron/handson-ml3/blob/main/INSTALL.md#update-this-project-and-its-libraries
- Also please make sure to read the FAQ (https://github.com/ageron/handson-ml3#faq) and search for existing issues (both open and closed), as your question may already have been answered: https://github.com/ageron/handson-ml3/issues
Is your feature request related to a problem? Please describe.
It would be really helpful for learners if there were complete workflows that bring all of the steps together into a single finished product. This request is related to issue #167.
Describe the solution you'd like
I like to work along with the book and explore the features, methods and output of the code I'm running. This helps me solidify my learning and understanding. Providing a single cell or group of cells that provide a fully functional example would solve this and allow learners to investigate with a working solution.
See cell 68 in (also included below) from my fork of chapter 2.
Additional context
As I was pulling this together to make it work outside of the sample notebooks, I could not figure out where the training data was coming into the pipeline. I remembered that housing had been assigned to the unprocessed raw data pulled from the tarball in cell 4 of the Chapter 2 notebook. What I forgot, is that somewhere in cell 30 or so, it was reassigned such that it contained the stratified training set.
Figuring this out took me way longer than I'd like to admit and I found it super frustrating that I couldn't make it work until a close reading found the reassignment.
Example
from pathlib import Path
import pandas as pd
from pandas.plotting import scatter_matrix
import tarfile
import urllib.request
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.metrics.pairwise import rbf_kernel
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler #, OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
def load_housing_data():
tarball_path = Path("datasets/housing.tgz")
if not tarball_path.is_file():
Path("datasets").mkdir(parents=True, exist_ok=True)
url = "https://github.com/ageron/data/raw/main/housing.tgz"
urllib.request.urlretrieve(url, tarball_path)
with tarfile.open(tarball_path) as housing_tarball:
housing_tarball.extractall(path="datasets")
return pd.read_csv(Path("datasets/housing/housing.csv"))
class ClusterSimilarity(BaseEstimator, TransformerMixin):
def __init__(self, n_clusters=10, gamma=1.0, random_state=None):
self.n_clusters = n_clusters
self.gamma = gamma
self.random_state = random_state
def fit(self, X, y=None, sample_weight=None):
self.kmeans_ = KMeans(self.n_clusters, n_init=10,
random_state=self.random_state)
self.kmeans_.fit(X, sample_weight=sample_weight)
return self # always return self!
def transform(self, X):
return rbf_kernel(X, self.kmeans_.cluster_centers_, gamma=self.gamma)
def get_feature_names_out(self, names=None):
return [f"Cluster {i} similarity" for i in range(self.n_clusters)]
def column_ratio(X):
return X[:, [0]] / X[:, [1]]
def ratio_name(function_transformer, feature_names_in):
return ["ratio"] # feature names out
def ratio_pipeline():
return make_pipeline(
SimpleImputer(strategy="median"), # (A) impute missing values
FunctionTransformer(column_ratio, feature_names_out=ratio_name), # (C) Create ratio features
StandardScaler()) # (F) scale all the values
# load unprocessed data
housing = load_housing_data()
housing["income_cat"] = pd.cut(housing["median_income"],
bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
labels=[1, 2, 3, 4, 5])
# split into train/test split
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
strat_train_set, strat_test_set = train_test_split(
housing,
test_size=0.2,
stratify=housing["income_cat"],
random_state=42)
# drop the income-cat column
for set_ in (strat_test_set, strat_train_set):
set_.drop("income_cat", axis=1, inplace=True)
housing_labels = strat_train_set["median_house_value"].copy()
housing = strat_train_set.drop("median_house_value", axis=1)
cat_pipeline = make_pipeline(
SimpleImputer(strategy="most_frequent"), # (A) impute missing values
OneHotEncoder(handle_unknown="ignore")) # (B) encode categorical data as binary one-hot columns
# (E) transform "long-tail" data into more gaussian (normal) distributions
log_pipeline = make_pipeline(
SimpleImputer(strategy="median"), #(A) impute missing values
FunctionTransformer(np.log, feature_names_out="one-to-one"),
StandardScaler()) # (F) scale all the values
cluster_simil = ClusterSimilarity(n_clusters=10, gamma=1., random_state=42)
default_num_pipeline = make_pipeline(SimpleImputer(strategy="median"), # (A) impute missing values
StandardScaler()) # (F) scale all the values
preprocessing = ColumnTransformer([
("bedrooms", ratio_pipeline(), ["total_bedrooms", "total_rooms"]),
("rooms_per_house", ratio_pipeline(), ["total_rooms", "households"]),
("people_per_house", ratio_pipeline(), ["population", "households"]),
("log", log_pipeline, ["total_bedrooms", "total_rooms", "population",
"households", "median_income"]),
("geo", cluster_simil, ["latitude", "longitude"]),
("cat", cat_pipeline, make_column_selector(dtype_include=object)),
],
remainder=default_num_pipeline) # one column remaining: housing_median_age