Est. read time: 3 minutes | Last updated: July 17, 2024 by John Gentile


Open In Colab

Here we’ll use housing price dataset from the 1990 CA census from StatLib repository which contains metric such as population, income, and housing price for each block group (the smallest geographical unit for which the census publishes sample data, typically 600 to 3,000 people). Our goal is for our ML model to accurately predict the median housing price in any district, given all other metrics.

Model Approach

Since we are given labeled training examples (accurate census sample data which gives the expected output- median housing price- for each set of features), we can conclude that this is definitely a supervised learning task. Since we are looking to predict a value, it is a regression task, but more specifically, a multiple regression proble, since we need to consider multiple features to make the output prediction (e.g. features like population, median income, etc.). It is also considered a univariate regression problem since we only need to predict a single value for each district; conversely if we needed to predict multiple values, it would be a multivariate regression problem. Finally, since data is not continuously streaming into the system and the dataset is small enough to fit in memory, batch learning is fine for this model.

Performance Measure

Typical for regression problems, we will use the Root Mean Square Error (RMSE) measurement to give an idea of how much error the system is making with predictions at any given time. RMSE is calculated by:

RMSE(X,h)=1mi=1m(h(x(i))y(i))2RMSE(\boldsymbol{X},h) = \sqrt{ \frac{1}{m} \sum^{m}_{i=1} \left (h(\boldsymbol{x}^{(i)}) - y^{(i)} \right )^{2} }


  • mm is the number of samples in the dataset being currently measured
  • x(i)\boldsymbol{x^{(i)}} is a vector of all feature values (excluding label, y(i)y^{(i)}) of the i-th instance in the dataset
    • For instance, if a district in the dataset has a longitude location of -118.29deg, latitude of 33.91deg, population of 1416, and a median income of $38,372 - with the label/median house value of $156,400- then the vector and label would look like:
x(i)=(118.2933.91141638372)\boldsymbol{x^{(i)}} = \begin{pmatrix} -118.29 \\ 33.91 \\ 1416 \\ 38372 \end{pmatrix} y(1)=156400y^{(1)} = 156400
  • X\boldsymbol{X} is the matrix containing all feature values (excluding labels) for all instances in the dataset, which with the above example values, looks like:
X=[(x(1))T(x(2))T(x(N))T]=(118.2933.91141638372)\boldsymbol{X} = \begin{bmatrix} \left ( \boldsymbol{x^{(1)}} \right )^{T} \\ \left ( \boldsymbol{x^{(2)}} \right )^{T} \\ \vdots \\ \left ( \boldsymbol{x^{(N)}} \right )^{T} \\ \end{bmatrix} = \begin{pmatrix} -118.29 & 33.91 & 1416 & 38372 \\ \vdots & \vdots & \vdots & \vdots \end{pmatrix}
  • hh is the system’s prediction function (aka hypothesis); it’s the system’s output given a feature vector x(i)\boldsymbol{x}^{(i)}, y^(i)=h(x(i))\hat{y}^{(i)}=h(\boldsymbol{x}^{(i)})

Once could use another performance function which measures deltas between the prediction vectors and target value vectors, called Mean Absolute Error (MAE):

MAE(X,h)=1mi=1mh(x(i)y(i)MAE(\boldsymbol{X},h) = \frac{1}{m}\sum^{m}_{i=1}\left | h(\boldsymbol{x}^{(i)} - y^{(i)} \right |

These various distance measures are also called norms:

  • Computing RMSE corresponds to the Euclidean norm, or the 2\ell_{2} norm, denoted colloquially as \begin{Vmatrix} \cdot \end{Vmatrix} (or 2\begin{Vmatrix} \cdot \end{Vmatrix}_{2} more specifically)
  • Computing MAE correspongs to the Manhattan norm (because it can measure the distance between two city points where you can only travel in orthogonal blocks), or the 1\ell_{1} norm, denoted 1\begin{Vmatrix} \cdot \end{Vmatrix}_{1}

In general, the k\ell_{k} norm of a vector v\boldsymbol{v} containing nn elements is defined as:

vk=(v0k+v1k++vnk)1/k\begin{Vmatrix} v \end{Vmatrix}_{k} = \left ( |v_{0}|^{k} + |v_{1}|^{k} + \dotsb + |v_{n}|^{k} \right )^{1/k}

The higher the norm index, the more it focuses on large values and neglects small ones, hence why RMSE is more sensitive to outliers than MAE, however when outliers are exponentially rare, RMSE performs very well and is preferred.

Dataset Creation

Dataset Download

Here we will download the comma-separated values (CSV) file that contains our housing dataset, and load it into memory using pandas.

import os
import tarfile
import urllib
import pandas as pd

DL_FOLDER    = ""
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL  = DL_FOLDER + "datasets/housing/housing.tgz"

# create function to easily download & extract housing dataset tarball
def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    os.makedirs(housing_path, exist_ok=True)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz =
def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)
fetch_housing_data() # download now
housing = load_housing_data()
housing.describe() # show a summary of the numerical attributes

longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
count 20640.000000 20640.000000 20640.000000 20640.000000 20433.000000 20640.000000 20640.000000 20640.000000 20640.000000
mean -119.569704 35.631861 28.639486 2635.763081 537.870553 1425.476744 499.539680 3.870671 206855.816909
std 2.003532 2.135952 12.585558 2181.615252 421.385070 1132.462122 382.329753 1.899822 115395.615874
min -124.350000 32.540000 1.000000 2.000000 1.000000 3.000000 1.000000 0.499900 14999.000000
25% -121.800000 33.930000 18.000000 1447.750000 296.000000 787.000000 280.000000 2.563400 119600.000000
50% -118.490000 34.260000 29.000000 2127.000000 435.000000 1166.000000 409.000000 3.534800 179700.000000
75% -118.010000 37.710000 37.000000 3148.000000 647.000000 1725.000000 605.000000 4.743250 264725.000000
max -114.310000 41.950000 52.000000 39320.000000 6445.000000 35682.000000 6082.000000 15.000100 500001.000000
# render plots within notebook itself
%matplotlib inline 
import matplotlib.pyplot as plt

housing.hist(bins=50, figsize=(20,15))


Note that in the above histogram plots, there are a couple important points:

  • median_income values were normalized to values between 0.5 and 15 (e.g. a value of 3 is equivalent to about $30k). This preprocessing is fine, and common in ML tasks.
  • housing_median_age and median_house_value values are capped, which may cause an issue since the house value is our target attribute (label), and you don’t want the ML model to learn that prices never go above that limit.
  • Attributes have very different scales, which we’ll need to tackle with feature scaling.
  • Many of the plots are tail-heavy (the distribution of values is not symmetrical about the mean) which can be difficult for some ML algorithms to detect patterns.

Creating Test Set

Creating a test set could be as simple as picking some random subset of the dataset (usually around 20%, or less with larger datasets) and set them aside:

import numpy as np

def split_train_test(data, test_ratio):
    shuffled_idx  = np.random.permutation(len(data))
    test_set_size = int(len(data) * test_ratio)
    test_idx      = shuffled_idx[:test_set_size]
    train_idx     = shuffled_idx[test_set_size:]
    return data.iloc[train_idx], data.iloc[test_idx]
train_set, test_set = split_train_test(housing, 0.2)
print("Training dataset size: %d" % len(train_set))
print("Testing dataset size: %d" % len(test_set))

Training dataset size: 16512 Testing dataset size: 4128

While the above works, each time you run the code, a different test data set is generated; over time this means your model will see the whole dataset, which you want to avoid. One way to prevent this would be to set the random number generator’s seed (e.x. np.random.seed()) before calling np.random.permutation().