Housing Rental price in Canada

Intro

I am a recent graduate of UBC and currently living in a small city called Kelowna in B.C. After graduating, I’ve faced a question commonly asked by recently graduated students (especially international ones). Now, where should I live in Canada? This question becomes even more important if we consider that there are not enough job opportunities for skilled people in small cities.  

One of the factors in deciding on a city to call it “home” is housing rental price. Unfortunately, most of the analysis in Github or elsewhere is focused on the housing price. Although there is a close relationship between these two, the exploration of housing rental price may show interesting patterns. 

Objectives

I ask the following questions in this analysis:

  1. Which cities are the cheapest in Canada?
  2. What are the factors explaining the rental price in Canada?
  3. Which cities are the cheapest even after controlling for a wide range of factors impacting rental price like size, number of bedrooms, etc.
  4. Create forecast models for the housing rental price.

Tools

I used Python and Jupyter notebooks for the analysis. All of the codes are available in the project’s respiratory on Github.

Data collection

Data is scraped from the following two websites:

1. English Canada

2. Montreal

Click here to see the sample code I used to scrape data.

Data cleaning

1. Combining all the collected data

I first collected data by city, then merged all the collected data into one panda’s data frame. See Montreal and English Canada folder in the in project’s respiratory on Github.

2. Missing values

I dropped missing values in this analysis.

3. Adding longitude and Latitude columns

I used google API to add longitude and Latitude columns.

4. Outliers

I dropped outliers. I limited the analysis to houses whose rental price is less than $7000 because few houses have very high prices which negatively impact regression analysis later.

5. Exporting the final cleaned data,

The final cleaned data name is “cleaned.csv”.

Data visualization

See “Data visualization 0.ipynb” and ““Data visualization 1.ipynb”” for more information

I used folium.Map to geographically show the data points on Canada’s map

Click on the picture to see the scatter plot
Numbers of rentals by provinces
Scatter plot of rental size by rent per month
Violin graph showing the relationship between the province and rental price.
Dashboard showing the average rental price in Canada. Click on the picture to see the dashboard

Machine learning

As I mentioned at the start of this article, I aim to predict housing rental prices. Therefore, we are dealing with numerical values and supervised machine learning. I choose the following model for the analysis

  1. Linear regression model
  2. Decision tree
  3. Random forest
  4. Ridge regression
  5. Lasso regression

Feature Engineering
1. I create dummy variables for the number of bedrooms because it is possible that the impact of having one more bedroom from 1 bedroom places to 2-bedroom places on rental would not be similar to having one more bedroom from 4 bedrooms to 5 bedrooms places.
2. I add the numbers of bathrooms as ordinal.
3. I create dummy variables for city and housing rental types
4. I remove province from analysis because cities are nested in provinces and I control for cities in the analysis
5. Clearly, I will remove latitude and longitude columns

Creating training and testing datasets

training_data, testing_data = train_test_split(df_dummies, test_size=0.2, random_state=25)
y_train=training_data['price']
X_train=training_data.drop(columns=['price'])
y_test=testing_data['price']
X_test=testing_data.drop(columns=['price'])
Distribution of price in the training dataset
Distribution of price in the testing dataset

Linear regression

X_train_ols=X_train.drop(columns=['bedroom_0','types_Apt/Condo','city_Abbotsford']) #droping reference categories
model_ols = sm.OLS(y_train,X_train_ols).fit()
model_ols.summary()err_series = model_ols.params - model_ols.conf_int()[0] ##calculating error bar length
coef_df = pd.DataFrame({'coef': model_ols.params.values[1:],
                        'err': err_series.values[1:],
                        'varname': err_series.index.values[1:]
                       })
fig, ax = plt.subplots(figsize=(8, 5))
coef_df.plot(x='varname', y='coef', kind='bar', 
             ax=ax, color='none', 
             yerr='err', legend=False)
ax.set_ylabel('')
ax.set_xlabel('')
ax.scatter(x=pd.np.arange(coef_df.shape[0]), 
           marker='s', s=10, 
           y=coef_df['coef'], color='red')
ax.axhline(y=0, linestyle='--', color='black', linewidth=4)
ax.xaxis.set_ticks_position('none')
The impact of all independent variables on rental price in Canada

The above graph shows the impact of different independent variables on the rental price. As we can see, Vancouver is the most expensive city for rentals in Canada. On average, rentals located in Vancouver are $1303 more expensive than rentals in Abbotsford (reference group). Furthermore, rentals located in Toronto are cheaper than in Vancouver but more expensive than in Abbotsford.
It is also clear that the numbers of bedrooms have a non-linear impact on the rental price. For example, 1 bedroom places and 2 bedroom places are clearly more expensive than 0 bedroom places, holding other variables constant. However, there is no difference between 2 bedrooms and 3 bedroom places.

#finding root mean square error and saving it to a dictonary
mse={}
y_hat=model_ols.predict(X_test_ols)
sqr=mean_squared_error(y_test, y_hat)
ols_mse=np.sqrt(sqr).round(2)
mse['OLS']=ols_mse

Linear regression with the log transformation

The distribution plot for the rental price is skewed to the right. I log transform the price values to make it more look like a normal distribution.

y_train_log=np.log1p(y_train)
sns.set_style('whitegrid')
sns.distplot(y_train_log)
plt.show()
The training distribution of log(price+1)
model_ols_log = sm.OLS(y_train_log,X_train).fit()
y_hat=model_ols_log.predict(X_test)
y_hat_1=np.exp(y_hat)-1
sqr=mean_squared_error(y_test, y_hat_1)
ols_mse_log=np.sqrt(sqr).round(2)
mse['OLS with log']=ols_mse_log

Decision tree

clf = DecisionTreeRegressor(random_state=0)
path = clf.cost_complexity_pruning_path(X_train,y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities 
test=[]
nodes=[]
for val in ccp_alphas:
        model=DecisionTreeRegressor(random_state = 0,ccp_alpha= val)
        score = cross_val_score(model,X=X_train,y=y_train,scoring="neg_mean_squared_error",cv=5)
        test.append((val, score.mean()))
        model.fit(X_train,y_train)
        nodes.append((val,model.tree_.node_count))
df1 = pd.DataFrame(test, 
             columns=['alpha', 
                      'mse'])
df2 = pd.DataFrame(nodes, 
             columns=['alpha', 
                      'nodes'])
df2['mse']= -df1['mse']
df2.plot(x ='nodes', y='mse', kind = 'line')
plt.ylabel('MSE')
plt.xlabel('Nodes')
plt.title('The relationship between the number of nodes and MSE')
df2.plot(x ='alpha', y='mse', kind = 'line')
The relationship between number of nodes and MSE
The relationship between alpha and MSE
model_cross=tree.DecisionTreeRegressor(random_state = 0, ccp_alpha=11647.355982)
model_cross.fit(X_train,y_train)
y_hat=model_cross.predict(X_test)
mean_squared_error(y_test, y_hat)
mse_Decision_tree=np.sqrt(sqr).round(2)
mse['Decision Tree']=mse_Decision_tree #adding root mse to the created dictionary 

Random forest

# Set paramters for Grid Search
param_grid =  {'n_estimators':[200, 300, 400, 500, 600],
               'max_features':[0.1, 0.3, 0.6]
              }
# Initialise the random forest model 
RandForest = RandomForestRegressor(n_jobs= -1, random_state = 0, bootstrap=True)

# Initialise Gridsearch CV with 5 fold corssvalidation and neggative root_mean_squared_error
Tuned_RandForest = GridSearchCV(estimator=RandForest, param_grid=param_grid, scoring='neg_root_mean_squared_error', cv=5)
Tuned_RandForest.fit(X_train, y_train)
Results = pd.DataFrame(Tuned_RandForest.cv_results_)
Results_Best = Results.loc[Results.rank_test_score==1]
    
print('Random Forest Regressor')
#Results = Results.loc[Results.rank_test_score==1]

# Create a plot to show all models 

col = ['param_max_features']

for col in col:
    grid = sns.FacetGrid(Results, col=col, hue='rank_test_score', palette="tab20c", legend_out=False,
                         col_wrap=5, height=5)

    # Draw a horizontal line to show the starting point
    grid.map(plt.axhline, y=0, ls=":", c=".5")

    # Draw marker on plot and decide what parameters to plot
    grid.map(plt.plot, "param_n_estimators", "mean_test_score", marker="o")

    # Adjust the arrangement of the plots
    grid.fig.tight_layout(w_pad=1)
    
    # Add legend to gird 
    grid.add_legend()
Random forest tuning

The above graph shows the results of all models. As it is is obvious, best result is param_n_estimators==600 and param_max_features=0.6

RandForest = RandomForestRegressor(n_jobs= -1, random_state = 0, bootstrap=True,n_estimators=600,max_features=0.6)
RandForest.fit(X_train, y_train)
y_hat=RandForest.predict(X_test)
mm=mean_squared_error(y_test, y_hat)
mse_random_forest=np.sqrt(mm).round(2)
mse['Random Forest']=mse_random_forest # saving root mse

Ridge

alphas = 10**np.linspace(10,-2,100)*0.5 #defining a possible range for alpha
regr_cv = RidgeCV(alphas=alphas)
reg_model=regr_cv.fit(X_train, y_train)
y_hat=reg_model.predict(X_test)
sqr=mean_squared_error(y_test, y_hat)
mse_ridge=np.sqrt(sqr).round(2)
mse['Ridge']=mse_ridge #saving root MSE of model

Lasso

#First you have to define all the possible alphas then create GridSearchCV. alphas = 10**np.linspace(10,-2,100)*0.5 #selecting a large alpha
param_grid={'alpha':alphas}
Lasso = linear_model.Lasso(random_state=42,normalize=True, tol=1e-2)
Tuned_RandForest = GridSearchCV(estimator=Lasso, param_grid=param_grid, scoring='neg_root_mean_squared_error', cv=5)
Tuned_RandForest.fit(X_train,y_train)
y_hat=Tuned_RandForest.predict(X_test)
sqr=mean_squared_error(y_test, y_hat)
mse_lasso=np.sqrt(sqr).round(2)
mse['Lasso']=mse_lasso
mse

Model Comparison

The table shows root MSE comparison of different models for the testing dataset
Mehdi Mohamadian