The Boston Housing Dataset

The Boston Housing Dataset This dataset contains information collected by the U.S Census Service concerning housing in t

Views 99 Downloads 18 File size 683KB

Report DMCA / Copyright

DOWNLOAD FILE

Recommend stories

Citation preview

The Boston Housing Dataset This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It was obtained from the StatLib archive (http://lib.stat.cmu.edu/datasets/boston (http://lib.stat.cmu.edu/datasets/boston)). The analytics task of this dataset is to predict the median value of the home. This dataset is available with sklearn library and we will utilize the same for our analytics problem. In [1]: import numpy as np import pandas as pd import scipy.stats as stats import matplotlib.pyplot as plt import sklearn import seaborn as sns from sklearn.datasets import load_boston %matplotlib inline

We will load the dataset and read the available description of the boston dataset In [2]: boston = load_boston() print(boston.DESCR) .. _boston_dataset: Boston house prices dataset --------------------------**Data Set Characteristics:** :Number of Instances: 506 :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usual ly the target. :Attribute Information (in order): - CRIM per capita crime rate by town - ZN proportion of residential land zoned for lots over 25,000 sq.ft. - INDUS proportion of non-retail business acres per town - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) - NOX nitric oxides concentration (parts per 10 million) - RM average number of rooms per dwelling - AGE proportion of owner-occupied units built prior to 1940 - DIS weighted distances to five Boston employment centres - RAD index of accessibility to radial highways - TAX full-value property-tax rate per $10,000 - PTRATIO pupil-teacher ratio by town - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town - LSTAT % lower status of the population - MEDV Median value of owner-occupied homes in $1000's :Missing Attribute Values: None :Creator: Harrison, D. and Rubinfeld, D.L. This is a copy of UCI ML housing dataset. https://archive.ics.uci.edu/ml/machine-learning-databases/housing/ This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics ...', Wiley, 1980. N.B. Various transformations are used in the table on pages 244-261 of the latter. The Boston house-price data has been used in many machine learning papers that address regression problems. .. topic:: References - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of C ollinearity', Wiley, 1980. 244-261. - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the T enth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

In [3]: df = pd.DataFrame(columns=boston.feature_names,data=boston.data) df['PRICE'] = boston.target

In [4]: df.info()

RangeIndex: 506 entries, 0 to 505 Data columns (total 14 columns): CRIM 506 non-null float64 ZN 506 non-null float64 INDUS 506 non-null float64 CHAS 506 non-null float64 NOX 506 non-null float64 RM 506 non-null float64 AGE 506 non-null float64 DIS 506 non-null float64 RAD 506 non-null float64 TAX 506 non-null float64 PTRATIO 506 non-null float64 B 506 non-null float64 LSTAT 506 non-null float64 PRICE 506 non-null float64 dtypes: float64(14) memory usage: 55.4 KB

We can see from the above information that the data contains no missing values and all features are numeric data In [51]: plt.style.use('seaborn-whitegrid') plot=pd.plotting.scatter_matrix(df,figsize=(21,20),alpha=0.8,grid=True)

We can see some relationships from the above pairplot Price ~ 1/LSTAT Price ~ RM NOX ~ 1/DIS RM ~ LSTAT AGE ~ DIS We can also see outliers in INDUS~RAD, INDUS~TAX, NOX~TAX, NOX~RAD plots. Further we will visualize data to check correlation between variables and presence of outliers.

In [32]: f, axs = plt.subplots(2,4,figsize=(18,6)) plt.subplot(121) sns.heatmap(abs(df.corr()),cmap='Blues') plt.subplot(243) df[['NOX']].boxplot() plt.subplot(244) df[['TAX']].boxplot() plt.subplot(247) df[['RAD']].boxplot() plt.subplot(248) df[['INDUS']].boxplot() Out[32]:

The above heatmap shows strong correlation between RAD & TAX but couldn't conclude presence of outliers from the boxplots since there are no dots outside the IQR. We created two new variables in the dataset since we saw inverse relationship in the scatterplots, these variables should be useful when we fit Linear Regression line. In [33]: df1 = df.copy() df1['I_LSTAT'] = 1/df1['LSTAT'] df1['I_DIS'] = 1/df1['DIS'] x = df1.drop('PRICE',axis=1) y = df1['PRICE']

We will now try to fit Linear Regression model in the dataset and see how it performs on predicting House prices. In [113]: from sklearn.linear_model import LinearRegression,Ridge,Lasso from sklearn.model_selection import cross_val_score lm = LinearRegression() scores = cross_val_score(lm,x,y, cv=7) scores.mean() Out[113]: 0.5117435579867984 In [114]: from sklearn.ensemble import RandomForestRegressor rf = RandomForestRegressor(n_estimators=100, min_samples_leaf=8) scores = cross_val_score(rf,x,y, cv=7) scores.mean() Out[114]: 0.6081295296332616

We see that Random Forrest Regressor which is an ensemble of multiple regressor significantly improves the score. In below plot we see importance of individual features. In [115]: rf.fit(x,y) feat_imp = pd.Series(rf.feature_importances_, index=x.columns) feat_imp.nlargest(18).plot(kind='barh') Out[115]:

We will remove less significant features from the model and see whether it improves the score In [117]: xx= x.drop(['CHAS','ZN','RAD','INDUS','B'],axis=1) scores = cross_val_score(rf,xx,y, cv=7) scores.mean() Out[117]: 0.6074480065154999

There isn't much improvement in the score by dropping less significant features, hence we keep the model as it is. So we have built a model which predict the House prices with 60% accuracy.