Continuous Variable Algorithms - Class Notes

Day #4 - Continuous Variable Algorithms

Introduction to Machine Learning

Types of Algorithms to use and when:

Continuous Variables = Regression Analysis Categorical Variables/Data = Classification Algorithms Clustering Algorithms

Algorithm = Input (Independent Variables) -> Model -> Output

How to Build a Mode: 1) Get Data Set 2) Clean Data 3) Split Data 4) Train Model 5) Iterate Unit Model is Optimized 6) Test Model 7) Make Predictions on New Data with optimized model

In order to test your model, you use a percentage of the data (about 80%) for training your model and the remaining percentage (about 20%) to test the accuracy of your model.

In [1]:
import numpy as np
from sklearn.cross_validation import train_test_split

y = np.arange(0,5)
x = np.arange(0,10).reshape(5,2)
C:\Users\nwerner\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
In [2]:
# 60% training data, 40% testing data

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.40)
print x_train
print x_test
print y_train
print y_test
[[0 1]
 [4 5]
 [8 9]]
[[6 7]
 [2 3]]
[0 2 4]
[3 1]
In [3]:
# 60% training data, 40% testing data
# Random State gives you a random set of train and testing value, but remembers how the data was broken up for validation purposes later

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.40, random_state = 0)
print x_train
print x_test
print y_train
print y_test
[[2 3]
 [6 7]
 [8 9]]
[[4 5]
 [0 1]]
[1 3 4]
[2 0]

Cross Validation - Run the model on different breaks of the data and average the multiple folds of the model

In [4]:
from sklearn.cross_validation import KFold

x = np.arange(16).reshape(8,2)
y = np.arange(8)
print x
print y
[[ 0  1]
 [ 2  3]
 [ 4  5]
 [ 6  7]
 [ 8  9]
 [10 11]
 [12 13]
 [14 15]]
[0 1 2 3 4 5 6 7]
In [5]:
kf = KFold(len(x),n_folds=4)
print len(kf)
print len(x)
4
8
In [6]:
for train_index, test_index in kf:
    print 'TRAIN:',train_index, 'TEST:',test_index
TRAIN: [2 3 4 5 6 7] TEST: [0 1]
TRAIN: [0 1 4 5 6 7] TEST: [2 3]
TRAIN: [0 1 2 3 6 7] TEST: [4 5]
TRAIN: [0 1 2 3 4 5] TEST: [6 7]

Regression Analysis

Most common Regression Techniques: 1) Multiple Regressions 2) Lasso Regression 3) Ridge Regression

Linear Regression: r^2 = correlation between the 2 variables (value from -1 to 1, low correlation to high correlation) R^2 = how close the data points are to the least squares line (value from 0 to 1, not close to very close)

In [7]:
%matplotlib inline
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv('heightweight.csv')
df.head(10)
Out[7]:
sex ageYear ageMonth heightIn weightLb
0 f 11.91667 143 56.3 85.0
1 f 12.91667 155 62.3 105.0
2 f 12.75000 153 63.3 108.0
3 f 13.41667 161 59.0 92.0
4 f 15.91667 191 62.5 112.5
5 f 14.25000 171 62.5 112.0
6 f 15.41667 185 59.0 104.0
7 f 11.83333 142 56.5 69.0
8 f 13.33333 160 62.0 94.5
9 f 11.66667 140 53.8 68.5
In [8]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 237 entries, 0 to 236
Data columns (total 5 columns):
sex         237 non-null object
ageYear     237 non-null float64
ageMonth    237 non-null int64
heightIn    237 non-null float64
weightLb    237 non-null float64
dtypes: float64(3), int64(1), object(1)
memory usage: 9.3+ KB
In [9]:
df.describe()
Out[9]:
ageYear ageMonth heightIn weightLb
count 237.000000 237.000000 237.000000 237.000000
mean 13.702532 164.430380 61.364557 101.308017
std 1.535481 18.425767 3.945402 19.440698
min 11.583330 139.000000 50.500000 50.500000
25% 12.333330 148.000000 58.800000 85.000000
50% 13.583330 163.000000 61.500000 101.000000
75% 14.833330 178.000000 64.300000 112.000000
max 20.833330 250.000000 72.000000 171.500000
In [10]:
df.isnull().sum()
Out[10]:
sex         0
ageYear     0
ageMonth    0
heightIn    0
weightLb    0
dtype: int64
In [11]:
df.sex.nunique()
Out[11]:
2
In [12]:
df.sex.value_counts()
Out[12]:
m    126
f    111
Name: sex, dtype: int64
In [13]:
# Creating Dummy Variables for the Columnn 'Sex' because 'Sex' is an object
In [14]:
df_dummies = pd.get_dummies(df,drop_first=True)
# Only creates dummy variables for 'Object' Class
# drop_first = True, drops the one column of dummy variables because the other couumn is implied
df_dummies.head()
Out[14]:
ageYear ageMonth heightIn weightLb sex_m
0 11.91667 143 56.3 85.0 0
1 12.91667 155 62.3 105.0 0
2 12.75000 153 63.3 108.0 0
3 13.41667 161 59.0 92.0 0
4 15.91667 191 62.5 112.5 0
In [15]:
df_dummies.corr()

# Gives you the correlation between the different variables
Out[15]:
ageYear ageMonth heightIn weightLb sex_m
ageYear 1.000000 1.000000 0.648857 0.634636 0.001275
ageMonth 1.000000 1.000000 0.648857 0.634636 0.001275
heightIn 0.648857 0.648857 1.000000 0.774876 0.199880
weightLb 0.634636 0.634636 0.774876 1.000000 0.117550
sex_m 0.001275 0.001275 0.199880 0.117550 1.000000
In [16]:
plt.hist(df.heightIn,30)
Out[16]:
(array([  1.,   2.,   1.,   2.,   2.,   2.,   6.,   5.,  11.,  12.,  14.,
         10.,  13.,  19.,   7.,  28.,  14.,  16.,  11.,  13.,  12.,   8.,
         12.,   6.,   3.,   2.,   2.,   0.,   2.,   1.]),
 array([ 50.5       ,  51.21666667,  51.93333333,  52.65      ,
         53.36666667,  54.08333333,  54.8       ,  55.51666667,
         56.23333333,  56.95      ,  57.66666667,  58.38333333,
         59.1       ,  59.81666667,  60.53333333,  61.25      ,
         61.96666667,  62.68333333,  63.4       ,  64.11666667,
         64.83333333,  65.55      ,  66.26666667,  66.98333333,
         67.7       ,  68.41666667,  69.13333333,  69.85      ,
         70.56666667,  71.28333333,  72.        ]),
 <a list of 30 Patch objects>)
In [17]:
plt.scatter(df.weightLb,df.heightIn)
Out[17]:
<matplotlib.collections.PathCollection at 0x2195d668>
In [18]:
df[df['weightLb']>171]
Out[18]:
sex ageYear ageMonth heightIn weightLb
130 m 20.83333 250 67.5 171.5
213 m 17.16667 206 69.5 171.5
In [19]:
from scipy import stats

estheight = stats.linregress(df.weightLb,df.heightIn)
estheight
Out[19]:
LinregressResult(slope=0.15725760723145538, intercept=45.433100634484191, rvalue=0.77487610662760176, pvalue=1.0286858314030378e-48, stderr=0.0083683585485207733)
In [20]:
def predict(x):
    return estheight.slope*x + estheight.intercept
In [21]:
# Creating a new column called 'Predicted Height' based off of least squares line

df['Predicted Height'] = predict(df.weightLb)
df['Predicted Height']
df.head()
Out[21]:
sex ageYear ageMonth heightIn weightLb Predicted Height
0 f 11.91667 143 56.3 85.0 58.799997
1 f 12.91667 155 62.3 105.0 61.945149
2 f 12.75000 153 63.3 108.0 62.416922
3 f 13.41667 161 59.0 92.0 59.900800
4 f 15.91667 191 62.5 112.5 63.124581
In [22]:
df['Height Error'] = abs(df['heightIn'] - df['Predicted Height'])
df.head()
Out[22]:
sex ageYear ageMonth heightIn weightLb Predicted Height Height Error
0 f 11.91667 143 56.3 85.0 58.799997 2.499997
1 f 12.91667 155 62.3 105.0 61.945149 0.354851
2 f 12.75000 153 63.3 108.0 62.416922 0.883078
3 f 13.41667 161 59.0 92.0 59.900800 0.900800
4 f 15.91667 191 62.5 112.5 63.124581 0.624581
In [23]:
from sklearn.metrics import mean_squared_error, r2_score
(mean_squared_error(df.heightIn,df['Predicted Height']))**0.5
Out[23]:
2.4886733321477559
In [24]:
r2_score(df.heightIn,df['Predicted Height'])
Out[24]:
0.60043298062235007

Multi-Variable Regression

In [25]:
df = pd.read_excel('cars.xls')
df.head()
Out[25]:
Price Mileage Make Model Trim Type Cylinder Liter Doors Cruise Sound Leather
0 17314.103129 8221 Buick Century Sedan 4D Sedan 6 3.1 4 1 1 1
1 17542.036083 9135 Buick Century Sedan 4D Sedan 6 3.1 4 1 1 0
2 16218.847862 13196 Buick Century Sedan 4D Sedan 6 3.1 4 1 1 0
3 16336.913140 16342 Buick Century Sedan 4D Sedan 6 3.1 4 1 0 0
4 16339.170324 19832 Buick Century Sedan 4D Sedan 6 3.1 4 1 0 1
In [26]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 804 entries, 0 to 803
Data columns (total 12 columns):
Price       804 non-null float64
Mileage     804 non-null int64
Make        804 non-null object
Model       804 non-null object
Trim        804 non-null object
Type        804 non-null object
Cylinder    804 non-null int64
Liter       804 non-null float64
Doors       804 non-null int64
Cruise      804 non-null int64
Sound       804 non-null int64
Leather     804 non-null int64
dtypes: float64(2), int64(6), object(4)
memory usage: 75.4+ KB
In [27]:
df.nunique()
Out[27]:
Price       798
Mileage     791
Make          6
Model        32
Trim         47
Type          5
Cylinder      3
Liter        16
Doors         2
Cruise        2
Sound         2
Leather       2
dtype: int64
In [28]:
df.describe()
Out[28]:
Price Mileage Cylinder Liter Doors Cruise Sound Leather
count 804.000000 804.000000 804.000000 804.000000 804.000000 804.000000 804.000000 804.000000
mean 21343.143767 19831.934080 5.268657 3.037313 3.527363 0.752488 0.679104 0.723881
std 9884.852801 8196.319707 1.387531 1.105562 0.850169 0.431836 0.467111 0.447355
min 8638.930895 266.000000 4.000000 1.600000 2.000000 0.000000 0.000000 0.000000
25% 14273.073870 14623.500000 4.000000 2.200000 4.000000 1.000000 0.000000 0.000000
50% 18024.995019 20913.500000 6.000000 2.800000 4.000000 1.000000 1.000000 1.000000
75% 26717.316636 25213.000000 6.000000 3.800000 4.000000 1.000000 1.000000 1.000000
max 70755.466717 50387.000000 8.000000 6.000000 4.000000 1.000000 1.000000 1.000000
In [29]:
df.corr()
Out[29]:
Price Mileage Cylinder Liter Doors Cruise Sound Leather
Price 1.000000 -0.143051 0.569086 0.558146 -0.138750 0.430851 -0.124348 0.157197
Mileage -0.143051 1.000000 -0.029461 -0.018641 -0.016944 0.025037 -0.026146 0.001005
Cylinder 0.569086 -0.029461 1.000000 0.957897 0.002206 0.354285 -0.089704 0.075520
Liter 0.558146 -0.018641 0.957897 1.000000 -0.079259 0.377509 -0.065527 0.087332
Doors -0.138750 -0.016944 0.002206 -0.079259 1.000000 -0.047674 -0.062530 -0.061969
Cruise 0.430851 0.025037 0.354285 0.377509 -0.047674 1.000000 -0.091730 -0.070573
Sound -0.124348 -0.026146 -0.089704 -0.065527 -0.062530 -0.091730 1.000000 0.165444
Leather 0.157197 0.001005 0.075520 0.087332 -0.061969 -0.070573 0.165444 1.000000
In [30]:
# Input for model in x-axis
x = df[['Mileage','Cylinder','Liter','Cruise']]

# Predict on the Y-Axis is the price
y = df[['Price']]
In [31]:
print x.head()
y.head()
   Mileage  Cylinder  Liter  Cruise
0     8221         6    3.1       1
1     9135         6    3.1       1
2    13196         6    3.1       1
3    16342         6    3.1       1
4    19832         6    3.1       1
Out[31]:
Price
0 17314.103129
1 17542.036083
2 16218.847862
3 16336.913140
4 16339.170324
In [32]:
import statsmodels.api as sm
import pandas as pd
import matplotlib.pyplot as plt
import seaborn
C:\Users\nwerner\AppData\Local\Continuum\Anaconda2\lib\site-packages\statsmodels\compat\pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools
In [33]:
# Adding in a constant for a y-intercept. Will obtain a y-intercept later
x1 = sm.add_constant(x)
x1.head()
Out[33]:
const Mileage Cylinder Liter Cruise
0 1.0 8221 6 3.1 1
1 1.0 9135 6 3.1 1
2 1.0 13196 6 3.1 1
3 1.0 16342 6 3.1 1
4 1.0 19832 6 3.1 1
In [34]:
# Stats Module. Ordinary Least Squares
# Fit 
est =sm.OLS(y,x1)
est_price = est.fit()
est_price.summary()
Out[34]:
OLS Regression Results
Dep. Variable: Price R-squared: 0.403
Model: OLS Adj. R-squared: 0.400
Method: Least Squares F-statistic: 134.6
Date: Sun, 18 Feb 2018 Prob (F-statistic): 6.79e-88
Time: 18:32:24 Log-Likelihood: -8329.0
No. Observations: 804 AIC: 1.667e+04
Df Residuals: 799 BIC: 1.669e+04
Df Model: 4
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 2799.8569 1543.265 1.814 0.070 -229.475 5829.189
Mileage -0.1644 0.033 -4.977 0.000 -0.229 -0.100
Cylinder 3007.8027 679.271 4.428 0.000 1674.436 4341.169
Liter 455.7311 860.596 0.530 0.597 -1233.565 2145.027
Cruise 6076.0488 676.570 8.981 0.000 4747.984 7404.114
Omnibus: 187.995 Durbin-Watson: 0.217
Prob(Omnibus): 0.000 Jarque-Bera (JB): 395.298
Skew: 1.297 Prob(JB): 1.45e-86
Kurtosis: 5.253 Cond. No. 1.38e+05
In [35]:
# Use Built-In Function in Stats Model to give you the
y_pred = est_price.predict(x1)
type(y_pred)
Out[35]:
pandas.core.series.Series
In [36]:
x1['Predicted Price'] = y_pred
x1.head()
Out[36]:
const Mileage Cylinder Liter Cruise Predicted Price
0 1.0 8221 6 3.1 1 26984.050024
1 1.0 9135 6 3.1 1 26833.798929
2 1.0 13196 6 3.1 1 26166.217203
3 1.0 16342 6 3.1 1 25649.050961
4 1.0 19832 6 3.1 1 25075.335072
In [37]:
from sklearn.metrics import mean_squared_error, r2_score
(mean_squared_error(df['Price'],x1['Predicted Price']))**0.5
Out[37]:
7635.5141159257228
In [38]:
r2_score(df['Price'],x1['Predicted Price'])
Out[38]:
0.40258426189243068
In [39]:
x1.drop('Liter',axis=1,inplace=True)
x1.head()
Out[39]:
const Mileage Cylinder Cruise Predicted Price
0 1.0 8221 6 1 26984.050024
1 1.0 9135 6 1 26833.798929
2 1.0 13196 6 1 26166.217203
3 1.0 16342 6 1 25649.050961
4 1.0 19832 6 1 25075.335072
In [40]:
print mean_squared_error(df['Price'],x1['Predicted Price'])**0.5
r2_score(df['Price'],x1['Predicted Price'])
7635.51411593
Out[40]:
0.40258426189243068

SKlearn

In [41]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x1,y,train_size=0.8)
In [42]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape
Out[42]:
((643, 5), (643, 1), (161, 5), (161, 1))

Steps: 1) Define the Algorithm 2) Fit the Algorithm 3) Predict from the Algorithm

In [43]:
from sklearn import linear_model

# Linear Regression
reg = linear_model.LinearRegression()
regmodel = reg.fit(X_train,y_train)
In [44]:
y_predtest = regmodel.predict(X_test)
In [45]:
from sklearn.metrics import mean_squared_error, r2_score
In [46]:
mean_squared_error(y_test, y_predtest)**0.5
Out[46]:
7507.4298119835139
In [47]:
r2_score(y_test,y_predtest)
Out[47]:
0.43135697172021248
In [48]:
print regmodel.intercept_
print regmodel.coef_
[ 3698.06614391]
[[  0.00000000e+00  -2.13857816e-01   4.43557098e+03   8.80595925e+03
   -3.68265040e-01]]

Ridge Regression

In [49]:
from sklearn.linear_model import Ridge,Lasso
In [50]:
# Create Model using Ridge Regression
ridgereg = linear_model.Ridge()
ridgereg.fit(X_train,y_train)

# Get Predicted Values using Ridge Regression Model
y_pred_ridge = ridgereg.predict(X_test)
In [51]:
# Test Predicted Values using Ridge Regression Model

print mean_squared_error(y_test,y_pred_ridge)**0.5
r2_score(y_test,y_pred_ridge)
7487.98904709
Out[51]:
0.43429820253318729

Lasso Regression

In [52]:
# Creat Model using Lasso Regression
lassoreg = linear_model.Lasso()
lassoreg.fit(X_train,y_train)

# Get Predicted Values using Lasso Regression Model
y_pred_lasso = lassoreg.predict(X_test)
C:\Users\nwerner\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\linear_model\coordinate_descent.py:491: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
  ConvergenceWarning)
In [53]:
# Test Predicted Values using Lasso Regression Model

print mean_squared_error(y_test,y_pred_lasso)**0.5
r2_score(y_test,y_pred_lasso)
7501.13081891
Out[53]:
0.43231079371436343

Class Mini Project - Predicting Housing Prices

In [54]:
# Correlation Matrix
# Select Best Variables
# Remove Text Variables

# Shooting for an error of less than $100k
In [55]:
df = pd.read_csv('kc_house_data.csv')
df.head()
Out[55]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view ... grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
0 7129300520 20141013T000000 221900.0 3 1.00 1180 5650 1.0 0 0 ... 7 1180 0 1955 0 98178 47.5112 -122.257 1340 5650
1 6414100192 20141209T000000 538000.0 3 2.25 2570 7242 2.0 0 0 ... 7 2170 400 1951 1991 98125 47.7210 -122.319 1690 7639
2 5631500400 20150225T000000 180000.0 2 1.00 770 10000 1.0 0 0 ... 6 770 0 1933 0 98028 47.7379 -122.233 2720 8062
3 2487200875 20141209T000000 604000.0 4 3.00 1960 5000 1.0 0 0 ... 7 1050 910 1965 0 98136 47.5208 -122.393 1360 5000
4 1954400510 20150218T000000 510000.0 3 2.00 1680 8080 1.0 0 0 ... 8 1680 0 1987 0 98074 47.6168 -122.045 1800 7503

5 rows × 21 columns

In [56]:
df['age'] = 2017 - df['yr_built']
In [57]:
df['renovated?'] = df['yr_renovated'].apply(lambda i: 1 if i>0 else 0)
In [58]:
df['basement?'] = df['sqft_basement'].apply(lambda i: 1 if i>0 else 0)
In [59]:
df['greaterThan15?'] = df['sqft_living'] - df['sqft_living15']
In [60]:
df['bathrooms'] = df['bathrooms']*1000
In [61]:
x = pd.get_dummies(df,columns=['zipcode'],drop_first=True)

# x = pd.get_dummies(df,columns=['lat'],drop_first=True)
# Will not print when I include 'lat' and 'long' in get_dummies. Too much info?
In [62]:
# Eliminating Outlier Housing Prices

high = np.percentile(x.price,90)
low = np.percentile(x.price,10)

x_no_outs = x[(x.price < high) & (x.price > low)]

x_no_outs.head()
Out[62]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view ... zipcode_98146 zipcode_98148 zipcode_98155 zipcode_98166 zipcode_98168 zipcode_98177 zipcode_98178 zipcode_98188 zipcode_98198 zipcode_98199
1 6414100192 20141209T000000 538000.0 3 2250.0 2570 7242 2.0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 2487200875 20141209T000000 604000.0 4 3000.0 1960 5000 1.0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 1954400510 20150218T000000 510000.0 3 2000.0 1680 8080 1.0 0 0 ... 0 0 0 0 0 0 0 0 0 0
6 1321400060 20140627T000000 257500.0 3 2250.0 1715 6819 2.0 0 0 ... 0 0 0 0 0 0 0 0 0 0
7 2008000270 20150115T000000 291850.0 3 1500.0 1060 9711 1.0 0 0 ... 0 0 0 0 0 0 0 0 1 0

5 rows × 93 columns

In [63]:
x_no_outs.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 17280 entries, 1 to 21612
Data columns (total 93 columns):
id                17280 non-null int64
date              17280 non-null object
price             17280 non-null float64
bedrooms          17280 non-null int64
bathrooms         17280 non-null float64
sqft_living       17280 non-null int64
sqft_lot          17280 non-null int64
floors            17280 non-null float64
waterfront        17280 non-null int64
view              17280 non-null int64
condition         17280 non-null int64
grade             17280 non-null int64
sqft_above        17280 non-null int64
sqft_basement     17280 non-null int64
yr_built          17280 non-null int64
yr_renovated      17280 non-null int64
lat               17280 non-null float64
long              17280 non-null float64
sqft_living15     17280 non-null int64
sqft_lot15        17280 non-null int64
age               17280 non-null int64
renovated?        17280 non-null int64
basement?         17280 non-null int64
greaterThan15?    17280 non-null int64
zipcode_98002     17280 non-null uint8
zipcode_98003     17280 non-null uint8
zipcode_98004     17280 non-null uint8
zipcode_98005     17280 non-null uint8
zipcode_98006     17280 non-null uint8
zipcode_98007     17280 non-null uint8
zipcode_98008     17280 non-null uint8
zipcode_98010     17280 non-null uint8
zipcode_98011     17280 non-null uint8
zipcode_98014     17280 non-null uint8
zipcode_98019     17280 non-null uint8
zipcode_98022     17280 non-null uint8
zipcode_98023     17280 non-null uint8
zipcode_98024     17280 non-null uint8
zipcode_98027     17280 non-null uint8
zipcode_98028     17280 non-null uint8
zipcode_98029     17280 non-null uint8
zipcode_98030     17280 non-null uint8
zipcode_98031     17280 non-null uint8
zipcode_98032     17280 non-null uint8
zipcode_98033     17280 non-null uint8
zipcode_98034     17280 non-null uint8
zipcode_98038     17280 non-null uint8
zipcode_98039     17280 non-null uint8
zipcode_98040     17280 non-null uint8
zipcode_98042     17280 non-null uint8
zipcode_98045     17280 non-null uint8
zipcode_98052     17280 non-null uint8
zipcode_98053     17280 non-null uint8
zipcode_98055     17280 non-null uint8
zipcode_98056     17280 non-null uint8
zipcode_98058     17280 non-null uint8
zipcode_98059     17280 non-null uint8
zipcode_98065     17280 non-null uint8
zipcode_98070     17280 non-null uint8
zipcode_98072     17280 non-null uint8
zipcode_98074     17280 non-null uint8
zipcode_98075     17280 non-null uint8
zipcode_98077     17280 non-null uint8
zipcode_98092     17280 non-null uint8
zipcode_98102     17280 non-null uint8
zipcode_98103     17280 non-null uint8
zipcode_98105     17280 non-null uint8
zipcode_98106     17280 non-null uint8
zipcode_98107     17280 non-null uint8
zipcode_98108     17280 non-null uint8
zipcode_98109     17280 non-null uint8
zipcode_98112     17280 non-null uint8
zipcode_98115     17280 non-null uint8
zipcode_98116     17280 non-null uint8
zipcode_98117     17280 non-null uint8
zipcode_98118     17280 non-null uint8
zipcode_98119     17280 non-null uint8
zipcode_98122     17280 non-null uint8
zipcode_98125     17280 non-null uint8
zipcode_98126     17280 non-null uint8
zipcode_98133     17280 non-null uint8
zipcode_98136     17280 non-null uint8
zipcode_98144     17280 non-null uint8
zipcode_98146     17280 non-null uint8
zipcode_98148     17280 non-null uint8
zipcode_98155     17280 non-null uint8
zipcode_98166     17280 non-null uint8
zipcode_98168     17280 non-null uint8
zipcode_98177     17280 non-null uint8
zipcode_98178     17280 non-null uint8
zipcode_98188     17280 non-null uint8
zipcode_98198     17280 non-null uint8
zipcode_98199     17280 non-null uint8
dtypes: float64(5), int64(18), object(1), uint8(69)
memory usage: 4.4+ MB
In [64]:
x_no_outs.corr()
Out[64]:
id price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition ... zipcode_98146 zipcode_98148 zipcode_98155 zipcode_98166 zipcode_98168 zipcode_98177 zipcode_98178 zipcode_98188 zipcode_98198 zipcode_98199
id 1.000000 0.025003 0.005132 0.030501 0.015257 -0.141358 0.032259 -0.000446 0.019340 -0.039690 ... 0.001460 0.019428 0.004035 -0.020824 -0.037117 -0.023158 -0.014553 0.004211 0.020111 -0.036516
price 0.025003 1.000000 0.218041 0.321041 0.505100 0.085042 0.183024 0.036774 0.172145 0.044692 ... -0.047665 -0.047400 -0.073215 -0.031079 -0.091548 0.029505 -0.070786 -0.060059 -0.075009 0.088184
bedrooms 0.005132 0.218041 1.000000 0.464432 0.572837 0.016306 0.109824 -0.041325 0.008489 0.028000 ... -0.007079 -0.004261 -0.002393 0.019327 -0.000704 -0.005586 0.022182 0.024950 0.004861 -0.048317
bathrooms 0.030501 0.321041 0.464432 1.000000 0.672830 0.049415 0.472426 -0.028300 0.044584 -0.151723 ... -0.039758 -0.004273 -0.064857 -0.026581 -0.040804 -0.027797 -0.012042 0.013796 -0.009957 -0.036033
sqft_living 0.015257 0.505100 0.572837 0.672830 1.000000 0.151776 0.279252 -0.010924 0.117881 -0.078037 ... -0.027206 -0.013488 -0.048314 0.013262 -0.019172 0.003335 0.002903 0.014412 0.001799 -0.041607
sqft_lot -0.141358 0.085042 0.016306 0.049415 0.151776 1.000000 -0.033157 0.015360 0.090070 0.008647 ... -0.015421 -0.006879 -0.017788 -0.003559 -0.003929 -0.014203 -0.015312 -0.007046 -0.007937 -0.028812
floors 0.032259 0.183024 0.109824 0.472426 0.279252 -0.033157 1.000000 -0.016143 -0.037225 -0.300903 ... -0.042592 -0.020051 -0.084405 -0.052245 -0.037037 -0.056613 -0.047729 -0.024656 -0.036357 -0.029076
waterfront -0.000446 0.036774 -0.041325 -0.028300 -0.010924 0.015360 -0.016143 1.000000 0.299867 0.015918 ... 0.026583 -0.002569 -0.008375 0.044280 -0.004422 -0.005915 0.073863 -0.003727 0.084290 -0.006194
view 0.019340 0.172145 0.008489 0.044584 0.117881 0.090070 -0.037225 0.299867 1.000000 0.028564 ... 0.045507 -0.012774 -0.010439 0.067675 -0.017433 0.048496 0.102690 0.008440 0.115757 0.010946
condition -0.039690 0.044692 0.028000 -0.151723 -0.078037 0.008647 -0.300903 0.015918 0.028564 1.000000 ... -0.000580 -0.022754 0.022500 0.036199 -0.014719 0.010520 0.000918 -0.006151 0.009113 0.016368
grade 0.042787 0.526507 0.257400 0.554000 0.640211 0.079411 0.428700 -0.026470 0.082694 -0.200611 ... -0.052570 -0.025302 -0.069663 -0.018610 -0.062231 0.005703 -0.048629 -0.020074 -0.031451 -0.011463
sqft_above 0.020153 0.422028 0.436234 0.595395 0.830403 0.153917 0.502136 -0.018199 0.025344 -0.198297 ... -0.034080 -0.006267 -0.058595 -0.005371 -0.030342 -0.018864 -0.035570 -0.001502 -0.015296 -0.075661
sqft_basement -0.007297 0.171052 0.265339 0.171936 0.347285 0.005781 -0.357868 0.011569 0.163038 0.197582 ... 0.009888 -0.012987 0.014317 0.032181 0.017615 0.037570 0.064934 0.027677 0.028884 0.054743
yr_built 0.025875 -0.010143 0.138653 0.553601 0.341695 0.038631 0.524942 -0.047645 -0.079718 -0.382869 ... -0.042173 -0.005811 -0.056478 -0.051522 -0.035865 -0.046047 -0.040616 -0.001615 -0.008958 -0.068162
yr_renovated -0.020053 0.065906 -0.006490 0.008492 0.008218 0.015188 -0.017623 0.059604 0.051425 -0.059040 ... 0.031913 -0.009300 0.000126 0.038978 -0.001021 0.023975 0.000636 -0.004548 0.009849 0.012810
lat 0.002947 0.368550 -0.101505 -0.146746 -0.141875 -0.132476 -0.056329 -0.056577 -0.075202 0.013081 ... -0.051015 -0.046500 0.211658 -0.094052 -0.048495 0.139597 -0.050346 -0.063099 -0.131219 0.066572
long 0.021436 0.068074 0.147883 0.265490 0.309147 0.225690 0.133000 -0.064348 -0.086554 -0.126502 ... -0.108866 -0.039508 -0.100804 -0.106189 -0.056350 -0.120743 -0.022756 -0.033502 -0.072726 -0.149578
sqft_living15 0.022375 0.460563 0.336654 0.468683 0.700420 0.141574 0.217445 -0.002009 0.141564 -0.131867 ... -0.057549 -0.021051 -0.060647 -0.005693 -0.049157 0.012808 -0.022951 -0.021600 -0.010231 -0.037481
sqft_lot15 -0.151596 0.074105 0.012337 0.049246 0.165413 0.735087 -0.037051 0.034714 0.084813 0.007855 ... -0.017703 -0.006940 -0.016221 -0.000650 -0.006656 -0.013235 -0.017073 -0.007127 -0.010318 -0.033526
age -0.025875 0.010143 -0.138653 -0.553601 -0.341695 -0.038631 -0.524942 0.047645 0.079718 0.382869 ... 0.042173 0.005811 0.056478 0.051522 0.035865 0.046047 0.040616 0.001615 0.008958 0.068162
renovated? -0.020044 0.065586 -0.006708 0.008054 0.008063 0.015252 -0.017627 0.059866 0.051609 -0.058605 ... 0.032051 -0.009300 0.000101 0.038950 -0.000908 0.023986 0.000575 -0.004554 0.009816 0.012844
basement? -0.002479 0.111738 0.121646 0.082818 0.132705 -0.038494 -0.344237 0.012306 0.122284 0.150601 ... 0.008239 -0.009576 0.008280 0.025237 0.024587 0.035156 0.047419 0.018072 0.023651 0.090427
greaterThan15? -0.003718 0.188506 0.420265 0.412152 0.608300 0.053106 0.145569 -0.012919 0.006098 0.038389 ... 0.026258 0.004699 0.000422 0.024729 0.028069 -0.009616 0.029550 0.044013 0.013874 -0.016036
zipcode_98002 0.002621 -0.087525 0.035289 0.030896 0.005729 -0.012829 0.006593 -0.003793 -0.016214 0.011635 ... -0.007319 -0.003426 -0.011172 -0.007814 -0.005898 -0.007890 -0.006920 -0.004972 -0.006985 -0.008262
zipcode_98003 0.002595 -0.090122 0.018842 0.028055 0.028719 -0.008497 -0.017116 -0.005440 0.024972 -0.007588 ... -0.010496 -0.004914 -0.016023 -0.011206 -0.008459 -0.011316 -0.009925 -0.007130 -0.010017 -0.011849
zipcode_98004 -0.003893 0.116797 0.000848 -0.036770 -0.020377 -0.008030 -0.044550 -0.003923 -0.015661 0.020667 ... -0.007568 -0.003543 -0.011553 -0.008080 -0.006100 -0.008159 -0.007156 -0.005141 -0.007223 -0.008544
zipcode_98005 0.021594 0.104474 0.034138 0.016470 0.032532 -0.001319 -0.050190 -0.004459 -0.015399 0.045693 ... -0.008604 -0.004028 -0.013134 -0.009185 -0.006934 -0.009275 -0.008135 -0.005845 -0.008211 -0.009713
zipcode_98006 -0.007427 0.123464 0.049871 0.028312 0.058826 -0.010633 -0.035935 -0.007395 -0.007814 0.077435 ... -0.014267 -0.006679 -0.021779 -0.015231 -0.011498 -0.015380 -0.013490 -0.009692 -0.013615 -0.016106
zipcode_98007 -0.004238 0.043006 0.036637 -0.001625 0.003947 -0.011379 -0.022188 -0.004552 -0.020423 0.030854 ... -0.008782 -0.004112 -0.013407 -0.009376 -0.007078 -0.009468 -0.008304 -0.005966 -0.008381 -0.009914
zipcode_98008 -0.008666 0.047054 0.047100 -0.016946 -0.007176 -0.016273 -0.080471 -0.006539 0.010186 0.051988 ... -0.012617 -0.005907 -0.019260 -0.013470 -0.010168 -0.013601 -0.011929 -0.008571 -0.012040 -0.014243
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
zipcode_98092 -0.031101 -0.097740 0.034232 0.052894 0.064362 0.070696 0.046144 -0.006893 -0.001079 -0.033526 ... -0.013300 -0.006227 -0.020303 -0.014199 -0.010719 -0.014338 -0.012575 -0.009035 -0.012692 -0.015014
zipcode_98102 -0.013103 0.053960 -0.026006 -0.000481 -0.034964 -0.019497 0.051789 -0.003401 -0.009529 -0.020131 ... -0.006562 -0.003072 -0.010017 -0.007005 -0.005288 -0.007074 -0.006204 -0.004457 -0.006262 -0.007407
zipcode_98103 -0.001394 0.077554 -0.072693 -0.060086 -0.108644 -0.052070 0.109134 -0.009741 -0.005990 0.021563 ... -0.018794 -0.008799 -0.028691 -0.020065 -0.015147 -0.020261 -0.017771 -0.012767 -0.017936 -0.021217
zipcode_98105 0.017373 0.072923 -0.005229 -0.034625 -0.043008 -0.025463 0.002599 -0.005057 -0.020157 0.031064 ... -0.009756 -0.004568 -0.014894 -0.010416 -0.007863 -0.010518 -0.009225 -0.006628 -0.009311 -0.011014
zipcode_98106 -0.011359 -0.103079 -0.016730 -0.028195 -0.070874 -0.029427 -0.019122 -0.006744 -0.014695 -0.020534 ... -0.013011 -0.006091 -0.019862 -0.013891 -0.010486 -0.014027 -0.012302 -0.008839 -0.012417 -0.014688
zipcode_98107 -0.056313 0.050207 -0.057059 -0.017303 -0.081890 -0.035718 0.076347 -0.006539 0.003198 0.007691 ... -0.012617 -0.005907 -0.019260 -0.013470 -0.010168 -0.013601 -0.011929 -0.008571 -0.012040 -0.014243
zipcode_98108 0.007762 -0.061715 -0.004553 -0.010971 -0.028309 -0.023039 -0.021710 -0.005204 -0.008410 0.006168 ... -0.010040 -0.004700 -0.015327 -0.010719 -0.008092 -0.010824 -0.009493 -0.006820 -0.009581 -0.011334
zipcode_98109 -0.023787 0.066939 -0.029023 -0.021188 -0.034565 -0.018502 0.010736 -0.003401 0.014101 -0.008707 ... -0.006562 -0.003072 -0.010017 -0.007005 -0.005288 -0.007074 -0.006204 -0.004457 -0.006262 -0.007407
zipcode_98112 0.039650 0.091588 -0.031629 -0.030295 -0.027822 -0.024244 0.021511 -0.004607 -0.016345 -0.012036 ... -0.008888 -0.004161 -0.013568 -0.009489 -0.007163 -0.009582 -0.008404 -0.006038 -0.008482 -0.010033
zipcode_98115 0.047026 0.084607 -0.044090 -0.085640 -0.073873 -0.041937 -0.040578 -0.009402 -0.014940 0.029639 ... -0.018140 -0.008493 -0.027692 -0.019367 -0.014620 -0.019556 -0.017152 -0.012323 -0.017312 -0.020478
zipcode_98116 0.027202 0.055004 -0.050627 -0.041520 -0.058379 -0.033290 -0.003224 -0.006955 0.045148 0.024048 ... -0.013418 -0.006282 -0.020484 -0.014325 -0.010814 -0.014465 -0.012687 -0.009115 -0.012805 -0.015148
zipcode_98117 -0.020878 0.064132 -0.080612 -0.088710 -0.098295 -0.045095 -0.023945 -0.009328 -0.014330 0.040663 ... -0.017996 -0.008425 -0.027472 -0.019213 -0.014504 -0.019401 -0.017016 -0.012225 -0.017175 -0.020316
zipcode_98118 -0.015644 -0.052790 -0.027328 -0.067726 -0.060287 -0.036521 -0.042249 -0.001331 0.015522 0.006379 ... -0.016219 -0.007593 -0.024759 -0.017315 -0.013071 -0.017485 -0.015335 -0.011018 -0.015478 -0.018309
zipcode_98119 -0.007253 0.088882 -0.037944 -0.013067 -0.038978 -0.025313 0.030463 -0.004661 0.012511 -0.012112 ... -0.008992 -0.004210 -0.013727 -0.009600 -0.007247 -0.009694 -0.008503 -0.006109 -0.008582 -0.010151
zipcode_98122 0.071459 0.052194 -0.035553 -0.018286 -0.059711 -0.035012 0.055085 -0.006500 -0.012020 -0.031378 ... -0.012541 -0.005872 -0.019145 -0.013389 -0.010108 -0.013520 -0.011858 -0.008520 -0.011969 -0.014158
zipcode_98125 0.014865 -0.033516 -0.025351 -0.063729 -0.060496 -0.027721 -0.037010 -0.008082 0.003497 -0.011473 ... -0.015593 -0.007300 -0.023804 -0.016647 -0.012567 -0.016810 -0.014744 -0.010593 -0.014881 -0.017603
zipcode_98126 0.029264 -0.040090 -0.071494 -0.078276 -0.082130 -0.033167 -0.036542 -0.007348 0.050861 0.022639 ... -0.014177 -0.006637 -0.021642 -0.015136 -0.011426 -0.015284 -0.013405 -0.009631 -0.013530 -0.016005
zipcode_98133 -0.002722 -0.087903 -0.032849 -0.082177 -0.081896 -0.031615 -0.028574 -0.008907 -0.038521 0.040608 ... -0.017185 -0.008045 -0.026233 -0.018347 -0.013850 -0.018526 -0.016249 -0.011674 -0.016400 -0.019400
zipcode_98136 0.003114 0.018865 -0.061939 -0.053006 -0.058694 -0.027710 -0.008444 0.002906 0.061794 -0.001649 ... -0.012262 -0.005741 -0.018718 -0.013091 -0.009882 -0.013219 -0.011594 -0.008330 -0.011702 -0.013842
zipcode_98144 -0.011139 0.000104 -0.033298 -0.032471 -0.055552 -0.036940 0.028072 -0.007016 0.012985 0.015424 ... -0.013536 -0.006337 -0.020663 -0.014451 -0.010909 -0.014592 -0.012799 -0.009195 -0.012918 -0.015280
zipcode_98146 0.001460 -0.047665 -0.007079 -0.039758 -0.027206 -0.015421 -0.042592 0.026583 0.045507 -0.000580 ... 1.000000 -0.004956 -0.016159 -0.011301 -0.008531 -0.011412 -0.010009 -0.007191 -0.010102 -0.011950
zipcode_98148 0.019428 -0.047400 -0.004261 -0.004273 -0.013488 -0.006879 -0.020051 -0.002569 -0.012774 -0.022754 ... -0.004956 1.000000 -0.007565 -0.005291 -0.003994 -0.005343 -0.004686 -0.003367 -0.004729 -0.005595
zipcode_98155 0.004035 -0.073215 -0.002393 -0.064857 -0.048314 -0.017788 -0.084405 -0.008375 -0.010439 0.022500 ... -0.016159 -0.007565 1.000000 -0.017252 -0.013023 -0.017420 -0.015279 -0.010977 -0.015421 -0.018242
zipcode_98166 -0.020824 -0.031079 0.019327 -0.026581 0.013262 -0.003559 -0.052245 0.044280 0.067675 0.036199 ... -0.011301 -0.005291 -0.017252 1.000000 -0.009108 -0.012183 -0.010686 -0.007677 -0.010785 -0.012758
zipcode_98168 -0.037117 -0.091548 -0.000704 -0.040804 -0.019172 -0.003929 -0.037037 -0.004422 -0.017433 -0.014719 ... -0.008531 -0.003994 -0.013023 -0.009108 1.000000 -0.009197 -0.008067 -0.005795 -0.008142 -0.009631
zipcode_98177 -0.023158 0.029505 -0.005586 -0.027797 0.003335 -0.014203 -0.056613 -0.005915 0.048496 0.010520 ... -0.011412 -0.005343 -0.017420 -0.012183 -0.009197 1.000000 -0.010790 -0.007752 -0.010890 -0.012882
zipcode_98178 -0.014553 -0.070786 0.022182 -0.012042 0.002903 -0.015312 -0.047729 0.073863 0.102690 0.000918 ... -0.010009 -0.004686 -0.015279 -0.010686 -0.008067 -0.010790 1.000000 -0.006799 -0.009552 -0.011299
zipcode_98188 0.004211 -0.060059 0.024950 0.013796 0.014412 -0.007046 -0.024656 -0.003727 0.008440 -0.006151 ... -0.007191 -0.003367 -0.010977 -0.007677 -0.005795 -0.007752 -0.006799 1.000000 -0.006862 -0.008118
zipcode_98198 0.020111 -0.075009 0.004861 -0.009957 0.001799 -0.007937 -0.036357 0.084290 0.115757 0.009113 ... -0.010102 -0.004729 -0.015421 -0.010785 -0.008142 -0.010890 -0.009552 -0.006862 1.000000 -0.011404
zipcode_98199 -0.036516 0.088184 -0.048317 -0.036033 -0.041607 -0.028812 -0.029076 -0.006194 0.010946 0.016368 ... -0.011950 -0.005595 -0.018242 -0.012758 -0.009631 -0.012882 -0.011299 -0.008118 -0.011404 1.000000

92 rows × 92 columns

In [65]:
# Predict on the Y-Axis is the price
y = x_no_outs[['price']]

# Input for model in x-axis
x_no_outs.drop(['date','id','price','age'],axis=1,inplace=True)
#,'yr_built','condition','waterfront'
C:\Users\nwerner\AppData\Local\Continuum\Anaconda2\lib\site-packages\ipykernel_launcher.py:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
In [66]:
x_no_outs = sm.add_constant(x_no_outs)
x_no_outs.head()
Out[66]:
const bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade ... zipcode_98146 zipcode_98148 zipcode_98155 zipcode_98166 zipcode_98168 zipcode_98177 zipcode_98178 zipcode_98188 zipcode_98198 zipcode_98199
1 1.0 3 2250.0 2570 7242 2.0 0 0 3 7 ... 0 0 0 0 0 0 0 0 0 0
3 1.0 4 3000.0 1960 5000 1.0 0 0 5 7 ... 0 0 0 0 0 0 0 0 0 0
4 1.0 3 2000.0 1680 8080 1.0 0 0 3 8 ... 0 0 0 0 0 0 0 0 0 0
6 1.0 3 2250.0 1715 6819 2.0 0 0 3 7 ... 0 0 0 0 0 0 0 0 0 0
7 1.0 3 1500.0 1060 9711 1.0 0 0 3 7 ... 0 0 0 0 0 0 0 0 1 0

5 rows × 90 columns

In [67]:
# Stats Module. Ordinary Least Squares
# Fit 
est =sm.OLS(y,x_no_outs)
est_price = est.fit()
est_price.summary()
Out[67]:
OLS Regression Results
Dep. Variable: price R-squared: 0.786
Model: OLS Adj. R-squared: 0.785
Method: Least Squares F-statistic: 727.8
Date: Sun, 18 Feb 2018 Prob (F-statistic): 0.00
Time: 18:32:26 Log-Likelihood: -2.1862e+05
No. Observations: 17280 AIC: 4.374e+05
Df Residuals: 17192 BIC: 4.381e+05
Df Model: 87
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -8.507e+06 3.6e+06 -2.366 0.018 -1.56e+07 -1.46e+06
bedrooms -2083.3047 827.108 -2.519 0.012 -3704.520 -462.090
bathrooms 10.1584 1.446 7.025 0.000 7.324 12.993
sqft_living 45.6760 0.879 51.982 0.000 43.954 47.398
sqft_lot 0.2843 0.022 13.073 0.000 0.242 0.327
floors -1.309e+04 1693.136 -7.734 0.000 -1.64e+04 -9775.592
waterfront 1.312e+05 1.19e+04 11.028 0.000 1.08e+05 1.55e+05
view 2.291e+04 1058.149 21.649 0.000 2.08e+04 2.5e+04
condition 2.125e+04 1045.685 20.326 0.000 1.92e+04 2.33e+04
grade 3.823e+04 1010.051 37.854 0.000 3.63e+04 4.02e+04
sqft_above 50.1094 1.422 35.248 0.000 47.323 52.896
sqft_basement -4.4334 1.853 -2.393 0.017 -8.065 -0.802
yr_built -487.7769 35.268 -13.831 0.000 -556.906 -418.648
yr_renovated 1587.6713 190.029 8.355 0.000 1215.196 1960.147
lat 1.724e+05 3.37e+04 5.116 0.000 1.06e+05 2.38e+05
long -8172.9289 2.77e+04 -0.295 0.768 -6.25e+04 4.61e+04
sqft_living15 42.5552 1.015 41.939 0.000 40.566 44.544
sqft_lot15 0.0448 0.034 1.326 0.185 -0.021 0.111
renovated? -3.143e+06 3.79e+05 -8.290 0.000 -3.89e+06 -2.4e+06
basement? 1.282e+04 2333.907 5.494 0.000 8247.800 1.74e+04
greaterThan15? 3.1207 0.888 3.513 0.000 1.380 4.862
zipcode_98002 -1.039e+04 9807.321 -1.060 0.289 -2.96e+04 8832.455
zipcode_98003 -8718.9095 7761.213 -1.123 0.261 -2.39e+04 6493.860
zipcode_98004 3.882e+05 1.41e+04 27.529 0.000 3.61e+05 4.16e+05
zipcode_98005 2.593e+05 1.36e+04 19.051 0.000 2.33e+05 2.86e+05
zipcode_98006 2.04e+05 1.14e+04 17.839 0.000 1.82e+05 2.26e+05
zipcode_98007 1.87e+05 1.39e+04 13.481 0.000 1.6e+05 2.14e+05
zipcode_98008 1.793e+05 1.34e+04 13.360 0.000 1.53e+05 2.06e+05
zipcode_98010 9.44e+04 1.27e+04 7.443 0.000 6.95e+04 1.19e+05
zipcode_98011 6.544e+04 1.69e+04 3.862 0.000 3.22e+04 9.86e+04
zipcode_98014 5.802e+04 1.9e+04 3.051 0.002 2.07e+04 9.53e+04
zipcode_98019 2.686e+04 1.9e+04 1.413 0.158 -1.04e+04 6.41e+04
zipcode_98022 1.243e+04 1.13e+04 1.102 0.270 -9682.776 3.46e+04
zipcode_98023 -3.737e+04 7203.376 -5.188 0.000 -5.15e+04 -2.33e+04
zipcode_98024 9.755e+04 1.74e+04 5.622 0.000 6.35e+04 1.32e+05
zipcode_98027 1.533e+05 1.18e+04 12.947 0.000 1.3e+05 1.76e+05
zipcode_98028 5.627e+04 1.65e+04 3.415 0.001 2.4e+04 8.86e+04
zipcode_98029 1.861e+05 1.35e+04 13.761 0.000 1.6e+05 2.13e+05
zipcode_98030 -4596.7442 8093.903 -0.568 0.570 -2.05e+04 1.13e+04
zipcode_98031 -6908.3003 8281.601 -0.834 0.404 -2.31e+04 9324.482
zipcode_98032 -3.841e+04 1.07e+04 -3.597 0.000 -5.93e+04 -1.75e+04
zipcode_98033 2.275e+05 1.45e+04 15.640 0.000 1.99e+05 2.56e+05
zipcode_98034 1.057e+05 1.53e+04 6.896 0.000 7.57e+04 1.36e+05
zipcode_98038 3.046e+04 9407.876 3.237 0.001 1.2e+04 4.89e+04
zipcode_98039 5.23e+05 4.54e+04 11.523 0.000 4.34e+05 6.12e+05
zipcode_98040 3.304e+05 1.27e+04 26.014 0.000 3.06e+05 3.55e+05
zipcode_98042 -5838.3596 7989.699 -0.731 0.465 -2.15e+04 9822.266
zipcode_98045 8.373e+04 1.74e+04 4.816 0.000 4.96e+04 1.18e+05
zipcode_98052 1.845e+05 1.48e+04 12.482 0.000 1.56e+05 2.14e+05
zipcode_98053 1.703e+05 1.62e+04 10.536 0.000 1.39e+05 2.02e+05
zipcode_98055 1.822e+04 9234.149 1.973 0.049 117.067 3.63e+04
zipcode_98056 6.954e+04 9859.732 7.053 0.000 5.02e+04 8.89e+04
zipcode_98058 1.145e+04 8757.731 1.308 0.191 -5712.535 2.86e+04
zipcode_98059 6.35e+04 9704.197 6.543 0.000 4.45e+04 8.25e+04
zipcode_98065 9.585e+04 1.57e+04 6.109 0.000 6.51e+04 1.27e+05
zipcode_98070 8.872e+04 1.12e+04 7.897 0.000 6.67e+04 1.11e+05
zipcode_98072 9.072e+04 1.72e+04 5.289 0.000 5.71e+04 1.24e+05
zipcode_98074 1.596e+05 1.43e+04 11.149 0.000 1.32e+05 1.88e+05
zipcode_98075 1.82e+05 1.39e+04 13.046 0.000 1.55e+05 2.09e+05
zipcode_98077 9.416e+04 1.8e+04 5.220 0.000 5.88e+04 1.3e+05
zipcode_98092 -1.724e+04 7437.490 -2.318 0.020 -3.18e+04 -2665.387
zipcode_98102 2.921e+05 1.51e+04 19.356 0.000 2.63e+05 3.22e+05
zipcode_98103 2.363e+05 1.37e+04 17.225 0.000 2.09e+05 2.63e+05
zipcode_98105 2.721e+05 1.44e+04 18.890 0.000 2.44e+05 3e+05
zipcode_98106 8.337e+04 1.04e+04 8.037 0.000 6.3e+04 1.04e+05
zipcode_98107 2.402e+05 1.4e+04 17.122 0.000 2.13e+05 2.68e+05
zipcode_98108 7.618e+04 1.12e+04 6.776 0.000 5.41e+04 9.82e+04
zipcode_98109 3.162e+05 1.52e+04 20.817 0.000 2.86e+05 3.46e+05
zipcode_98112 3.124e+05 1.36e+04 22.920 0.000 2.86e+05 3.39e+05
zipcode_98115 2.299e+05 1.4e+04 16.440 0.000 2.02e+05 2.57e+05
zipcode_98116 2.315e+05 1.14e+04 20.314 0.000 2.09e+05 2.54e+05
zipcode_98117 2.254e+05 1.41e+04 15.983 0.000 1.98e+05 2.53e+05
zipcode_98118 1.278e+05 1.01e+04 12.608 0.000 1.08e+05 1.48e+05
zipcode_98119 3.064e+05 1.4e+04 21.874 0.000 2.79e+05 3.34e+05
zipcode_98122 2.29e+05 1.23e+04 18.630 0.000 2.05e+05 2.53e+05
zipcode_98125 1.171e+05 1.5e+04 7.806 0.000 8.77e+04 1.46e+05
zipcode_98126 1.469e+05 1.05e+04 13.981 0.000 1.26e+05 1.68e+05
zipcode_98133 7.221e+04 1.55e+04 4.661 0.000 4.18e+04 1.03e+05
zipcode_98136 2.005e+05 1.07e+04 18.739 0.000 1.8e+05 2.21e+05
zipcode_98144 1.741e+05 1.15e+04 15.165 0.000 1.52e+05 1.97e+05
zipcode_98146 9.564e+04 1.01e+04 9.437 0.000 7.58e+04 1.16e+05
zipcode_98148 3.628e+04 1.38e+04 2.636 0.008 9301.053 6.33e+04
zipcode_98155 5.655e+04 1.61e+04 3.508 0.000 2.5e+04 8.81e+04
zipcode_98166 7.39e+04 8948.872 8.258 0.000 5.64e+04 9.14e+04
zipcode_98168 1.587e+04 1.06e+04 1.502 0.133 -4835.871 3.66e+04
zipcode_98177 1.2e+05 1.62e+04 7.410 0.000 8.82e+04 1.52e+05
zipcode_98178 2.172e+04 1.02e+04 2.122 0.034 1657.944 4.18e+04
zipcode_98188 1.081e+04 1.07e+04 1.006 0.315 -1.03e+04 3.19e+04
zipcode_98198 1.943e+04 8344.430 2.328 0.020 3072.447 3.58e+04
zipcode_98199 2.647e+05 1.36e+04 19.487 0.000 2.38e+05 2.91e+05
Omnibus: 943.097 Durbin-Watson: 1.993
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2771.542
Skew: 0.262 Prob(JB): 0.00
Kurtosis: 4.891 Cond. No. 1.34e+17
In [68]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x_no_outs,y,train_size=0.8)
X_train.shape, y_train.shape, X_test.shape, y_test.shape
Out[68]:
((13824, 90), (13824, 1), (3456, 90), (3456, 1))
In [69]:
from sklearn import linear_model

# Linear Regression
reg = linear_model.LinearRegression()
regmodel = reg.fit(X_train,y_train)
In [70]:
y_predtest = regmodel.predict(X_test)
In [71]:
from sklearn.metrics import mean_squared_error, r2_score
In [72]:
mean_squared_error(y_test, y_predtest)**0.5
Out[72]:
73768.106058077668
In [73]:
r2_score(y_test,y_predtest)
Out[73]:
0.79396540202174837
In [74]:
from sklearn.linear_model import Ridge,Lasso
In [75]:
# Create Model using Ridge Regression
ridgereg = linear_model.Ridge()
ridgereg.fit(X_train,y_train)

# Get Predicted Values using Ridge Regression Model
y_pred_ridge = ridgereg.predict(X_test)
In [76]:
# Test Predicted Values using Ridge Regression Model

print mean_squared_error(y_test,y_pred_ridge)**0.5
r2_score(y_test,y_pred_ridge)
73978.8268305
Out[76]:
0.79278663300683838
In [77]:
# Creat Model using Lasso Regression
lassoreg = linear_model.Lasso()
lassoreg.fit(X_train,y_train)

# Get Predicted Values using Lasso Regression Model
y_pred_lasso = lassoreg.predict(X_test)
In [78]:
# Test Predicted Values using Lasso Regression Model

print mean_squared_error(y_test,y_pred_lasso)**0.5
r2_score(y_test,y_pred_lasso)
73827.3657178
Out[78]:
0.79363424415642869
In [ ]:

rss facebook twitter github youtube mail spotify lastfm instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora quora