Housing Rental price in Canada
Intro
I am a recent graduate of UBC and currently living in a small city called Kelowna in B.C. After graduating, I’ve faced a question commonly asked by recently graduated students (especially international ones). Now, where should I live in Canada? This question becomes even more important if we consider that there are not enough job opportunities for skilled people in small cities.
One of the factors in deciding on a city to call it “home” is housing rental price. Unfortunately, most of the analysis in Github or elsewhere is focused on the housing price. Although there is a close relationship between these two, the exploration of housing rental price may show interesting patterns.
Objectives
I ask the following questions in this analysis:
- Which cities are the cheapest in Canada?
- What are the factors explaining the rental price in Canada?
- Which cities are the cheapest even after controlling for a wide range of factors impacting rental price like size, number of bedrooms, etc.
- Create forecast models for the housing rental price.
Tools
I used Python and Jupyter notebooks for the analysis. All of the codes are available in the project’s respiratory on Github.
Data collection
Data is scraped from the following two websites:
1. English Canada
2. Montreal
Click here to see the sample code I used to scrape data.
Data cleaning
1. Combining all the collected data
I first collected data by city, then merged all the collected data into one panda’s data frame. See Montreal and English Canada folder in the in project’s respiratory on Github.
2. Missing values
I dropped missing values in this analysis.
3. Adding longitude and Latitude columns
I used google API to add longitude and Latitude columns.
4. Outliers
I dropped outliers. I limited the analysis to houses whose rental price is less than $7000 because few houses have very high prices which negatively impact regression analysis later.
5. Exporting the final cleaned data,
The final cleaned data name is “cleaned.csv”.
Data visualization
See “Data visualization 0.ipynb” and ““Data visualization 1.ipynb”” for more information
I used folium.Map to geographically show the data points on Canada’s map
Machine learning
As I mentioned at the start of this article, I aim to predict housing rental prices. Therefore, we are dealing with numerical values and supervised machine learning. I choose the following model for the analysis
- Linear regression model
- Decision tree
- Random forest
- Ridge regression
- Lasso regression
Feature Engineering
1. I create dummy variables for the number of bedrooms because it is possible that the impact of having one more bedroom from 1 bedroom places to 2-bedroom places on rental would not be similar to having one more bedroom from 4 bedrooms to 5 bedrooms places.
2. I add the numbers of bathrooms as ordinal.
3. I create dummy variables for city and housing rental types
4. I remove province from analysis because cities are nested in provinces and I control for cities in the analysis
5. Clearly, I will remove latitude and longitude columns
Creating training and testing datasets
training_data, testing_data = train_test_split(df_dummies, test_size=0.2, random_state=25)
y_train=training_data['price']
X_train=training_data.drop(columns=['price'])
y_test=testing_data['price']
X_test=testing_data.drop(columns=['price'])
Linear regression
X_train_ols=X_train.drop(columns=['bedroom_0','types_Apt/Condo','city_Abbotsford']) #droping reference categories
model_ols = sm.OLS(y_train,X_train_ols).fit()
model_ols.summary()err_series = model_ols.params - model_ols.conf_int()[0] ##calculating error bar length
coef_df = pd.DataFrame({'coef': model_ols.params.values[1:],
'err': err_series.values[1:],
'varname': err_series.index.values[1:]
})
fig, ax = plt.subplots(figsize=(8, 5))
coef_df.plot(x='varname', y='coef', kind='bar',
ax=ax, color='none',
yerr='err', legend=False)
ax.set_ylabel('')
ax.set_xlabel('')
ax.scatter(x=pd.np.arange(coef_df.shape[0]),
marker='s', s=10,
y=coef_df['coef'], color='red')
ax.axhline(y=0, linestyle='--', color='black', linewidth=4)
ax.xaxis.set_ticks_position('none')
The above graph shows the impact of different independent variables on the rental price. As we can see, Vancouver is the most expensive city for rentals in Canada. On average, rentals located in Vancouver are $1303 more expensive than rentals in Abbotsford (reference group). Furthermore, rentals located in Toronto are cheaper than in Vancouver but more expensive than in Abbotsford.
It is also clear that the numbers of bedrooms have a non-linear impact on the rental price. For example, 1 bedroom places and 2 bedroom places are clearly more expensive than 0 bedroom places, holding other variables constant. However, there is no difference between 2 bedrooms and 3 bedroom places.
#finding root mean square error and saving it to a dictonary
mse={}
y_hat=model_ols.predict(X_test_ols)
sqr=mean_squared_error(y_test, y_hat)
ols_mse=np.sqrt(sqr).round(2)
mse['OLS']=ols_mse
Linear regression with the log transformation
The distribution plot for the rental price is skewed to the right. I log transform the price values to make it more look like a normal distribution.
y_train_log=np.log1p(y_train)
sns.set_style('whitegrid')
sns.distplot(y_train_log)
plt.show()
model_ols_log = sm.OLS(y_train_log,X_train).fit()
y_hat=model_ols_log.predict(X_test)
y_hat_1=np.exp(y_hat)-1
sqr=mean_squared_error(y_test, y_hat_1)
ols_mse_log=np.sqrt(sqr).round(2)
mse['OLS with log']=ols_mse_log
Decision tree
clf = DecisionTreeRegressor(random_state=0)
path = clf.cost_complexity_pruning_path(X_train,y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
test=[]
nodes=[]
for val in ccp_alphas:
model=DecisionTreeRegressor(random_state = 0,ccp_alpha= val)
score = cross_val_score(model,X=X_train,y=y_train,scoring="neg_mean_squared_error",cv=5)
test.append((val, score.mean()))
model.fit(X_train,y_train)
nodes.append((val,model.tree_.node_count))
df1 = pd.DataFrame(test,
columns=['alpha',
'mse'])
df2 = pd.DataFrame(nodes,
columns=['alpha',
'nodes'])
df2['mse']= -df1['mse']
df2.plot(x ='nodes', y='mse', kind = 'line')
plt.ylabel('MSE')
plt.xlabel('Nodes')
plt.title('The relationship between the number of nodes and MSE')
df2.plot(x ='alpha', y='mse', kind = 'line')
model_cross=tree.DecisionTreeRegressor(random_state = 0, ccp_alpha=11647.355982)
model_cross.fit(X_train,y_train)
y_hat=model_cross.predict(X_test)
mean_squared_error(y_test, y_hat)
mse_Decision_tree=np.sqrt(sqr).round(2)
mse['Decision Tree']=mse_Decision_tree #adding root mse to the created dictionary
Random forest
# Set paramters for Grid Search
param_grid = {'n_estimators':[200, 300, 400, 500, 600],
'max_features':[0.1, 0.3, 0.6]
}
# Initialise the random forest model
RandForest = RandomForestRegressor(n_jobs= -1, random_state = 0, bootstrap=True)
# Initialise Gridsearch CV with 5 fold corssvalidation and neggative root_mean_squared_error
Tuned_RandForest = GridSearchCV(estimator=RandForest, param_grid=param_grid, scoring='neg_root_mean_squared_error', cv=5)
Tuned_RandForest.fit(X_train, y_train)
Results = pd.DataFrame(Tuned_RandForest.cv_results_)
Results_Best = Results.loc[Results.rank_test_score==1]
print('Random Forest Regressor')
#Results = Results.loc[Results.rank_test_score==1]
# Create a plot to show all models
col = ['param_max_features']
for col in col:
grid = sns.FacetGrid(Results, col=col, hue='rank_test_score', palette="tab20c", legend_out=False,
col_wrap=5, height=5)
# Draw a horizontal line to show the starting point
grid.map(plt.axhline, y=0, ls=":", c=".5")
# Draw marker on plot and decide what parameters to plot
grid.map(plt.plot, "param_n_estimators", "mean_test_score", marker="o")
# Adjust the arrangement of the plots
grid.fig.tight_layout(w_pad=1)
# Add legend to gird
grid.add_legend()
The above graph shows the results of all models. As it is is obvious, best result is param_n_estimators==600 and param_max_features=0.6
RandForest = RandomForestRegressor(n_jobs= -1, random_state = 0, bootstrap=True,n_estimators=600,max_features=0.6)
RandForest.fit(X_train, y_train)
y_hat=RandForest.predict(X_test)
mm=mean_squared_error(y_test, y_hat)
mse_random_forest=np.sqrt(mm).round(2)
mse['Random Forest']=mse_random_forest # saving root mse
Ridge
alphas = 10**np.linspace(10,-2,100)*0.5 #defining a possible range for alpha
regr_cv = RidgeCV(alphas=alphas)
reg_model=regr_cv.fit(X_train, y_train)
y_hat=reg_model.predict(X_test)
sqr=mean_squared_error(y_test, y_hat)
mse_ridge=np.sqrt(sqr).round(2)
mse['Ridge']=mse_ridge #saving root MSE of model
Lasso
#First you have to define all the possible alphas then create GridSearchCV. alphas = 10**np.linspace(10,-2,100)*0.5 #selecting a large alpha
param_grid={'alpha':alphas}
Lasso = linear_model.Lasso(random_state=42,normalize=True, tol=1e-2)
Tuned_RandForest = GridSearchCV(estimator=Lasso, param_grid=param_grid, scoring='neg_root_mean_squared_error', cv=5)
Tuned_RandForest.fit(X_train,y_train)
y_hat=Tuned_RandForest.predict(X_test)
sqr=mean_squared_error(y_test, y_hat)
mse_lasso=np.sqrt(sqr).round(2)
mse['Lasso']=mse_lasso
mse
Model Comparison