Loan Predictions - Classifier Algorithms¶

Purpose/Goal:¶

To use the given data to create a classification algorithm that will accurately predict if a person will default on their loan or not.

Process¶

Read in Data
Remove Duplicates
Feature Engineer/ Bin Continuous Variables
Convert Features to Numbers
Algorithm
ROC & AUC

Read in Libraries¶

import numpy as np
from sklearn.cross_validation import train_test_split as tts
from sklearn import linear_model
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import time
import seaborn as sns

C:\Users\nwerner\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

# Read in Data

df = pd.read_csv('LoansTrainingSet.csv') #,dtype=str

C:\Users\nwerner\AppData\Local\Continuum\Anaconda2\lib\site-packages\IPython\core\interactiveshell.py:2717: DtypeWarning: Columns (16) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

Preliminary Data Exploration¶

df.head(10)

df.info()
df.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 256984 entries, 0 to 256983
Data columns (total 19 columns):
Loan ID                         256984 non-null object
Customer ID                     256984 non-null object
Loan Status                     256984 non-null object
Current Loan Amount             256984 non-null int64
Term                            256984 non-null object
Credit Score                    195308 non-null float64
Years in current job            256984 non-null object
Home Ownership                  256984 non-null object
Annual Income                   195308 non-null float64
Purpose                         256984 non-null object
Monthly Debt                    256984 non-null object
Years of Credit History         256984 non-null float64
Months since last delinquent    116601 non-null float64
Number of Open Accounts         256984 non-null int64
Number of Credit Problems       256984 non-null int64
Current Credit Balance          256984 non-null int64
Maximum Open Credit             256984 non-null object
Bankruptcies                    256455 non-null float64
Tax Liens                       256961 non-null float64
dtypes: float64(6), int64(4), object(9)
memory usage: 37.3+ MB

Loan ID                              0
Customer ID                          0
Loan Status                          0
Current Loan Amount                  0
Term                                 0
Credit Score                     61676
Years in current job                 0
Home Ownership                       0
Annual Income                    61676
Purpose                              0
Monthly Debt                         0
Years of Credit History              0
Months since last delinquent    140383
Number of Open Accounts              0
Number of Credit Problems            0
Current Credit Balance               0
Maximum Open Credit                  0
Bankruptcies                       529
Tax Liens                           23
dtype: int64

Removing Duplicates¶

for col in df.columns:
    print 'Number of Unique {} values: '.format(col), df[col].nunique()

Number of Unique Loan ID values:  215700
Number of Unique Customer ID values:  215700
Number of Unique Loan Status values:  2
Number of Unique Current Loan Amount values:  27347
Number of Unique Term values:  2
Number of Unique Credit Score values:  334
Number of Unique Years in current job values:  12
Number of Unique Home Ownership values:  4
Number of Unique Annual Income values:  60558
Number of Unique Purpose values:  10
Number of Unique Monthly Debt values:  129115
Number of Unique Years of Credit History values:  541
Number of Unique Months since last delinquent values:  131
Number of Unique Number of Open Accounts values:  59
Number of Unique Number of Credit Problems values:  12
Number of Unique Current Credit Balance values:  45704
Number of Unique Maximum Open Credit values:  87188
Number of Unique Bankruptcies values:  8
Number of Unique Tax Liens values:  12

# Creating a fake data set to explore drop duplicates

data = {'fruit':['apple','banana','carrot','apple','apple'],'dessert':['cake','cookie','ice cream','cake','brownie']}
dataFr = pd.DataFrame(data)
print dataFr, '\n' # Fruit DataFrame
print dataFr.duplicated().sum() # Number of Duplicates
dataFr.duplicated(subset='dessert') # Bools of Number of Duplicates

     dessert   fruit
0       cake   apple
1     cookie  banana
2  ice cream  carrot
3       cake   apple
4    brownie   apple

1

0    False
1    False
2    False
3     True
4    False
dtype: bool

# Using drop duplicates, if I specify the multiple columns, it only drops the row if the values are identical in all of the specified columns 

#dataFr.drop_duplicates(subset=['fruit','dessert'],inplace=True)
dataFr.drop_duplicates(inplace=True)
#dataFr = dataFr.drop_duplicates()
print dataFr, '\n'
dataFr.info()

     dessert   fruit
0       cake   apple
1     cookie  banana
2  ice cream  carrot
4    brownie   apple

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 4
Data columns (total 2 columns):
dessert    4 non-null object
fruit      4 non-null object
dtypes: object(2)
memory usage: 96.0+ bytes

# Drop Duplicates where values are equal on ALL features

df.drop_duplicates(inplace=True)
print 'Down to {} rows.'.format(len(df))
print 'There are', df['Loan ID'].nunique(), 'values.'
print (len(df)-df['Loan ID'].nunique()), 'more Rows to drop.'

Down to 240374 rows.
There are 215700 values.
24674 more Rows to drop.

# There are still some duplicated rows

x = df.duplicated(subset=['Loan ID'])
rowsToDrop = []

for i,bool in enumerate(x):
    if bool == True:
        rowsToDrop.append(i)

print len(rowsToDrop)
rowsToDrop

24674

[7,
 13,
 14,
 19,
 42,
 51,
 67,
 69,
 75,
 80,
 84,
 107,
 108,
 109,
 118,
 120,
 126,
 143,
 144,
 153,
 161,
 166,
 192,
 212,
 218,
 236,
 245,
 247,
 254,
 255,
 258,
 263,
 286,
 292,
 295,
 299,
 322,
 332,
 341,
 343,
 354,
 361,
 365,
 377,
 393,
 397,
 400,
 402,
 417,
 467,
 476,
 477,
 483,
 487,
 494,
 505,
 515,
 518,
 524,
 546,
 557,
 576,
 587,
 621,
 628,
 631,
 635,
 656,
 679,
 687,
 692,
 694,
 697,
 699,
 705,
 711,
 713,
 729,
 748,
 765,
 769,
 779,
 784,
 787,
 794,
 804,
 810,
 816,
 825,
 851,
 866,
 874,
 879,
 887,
 905,
 913,
 915,
 923,
 936,
 949,
 960,
 970,
 971,
 973,
 978,
 1003,
 1009,
 1013,
 1015,
 1039,
 1042,
 1044,
 1069,
 1095,
 1111,
 1113,
 1123,
 1135,
 1142,
 1153,
 1161,
 1163,
 1183,
 1207,
 1241,
 1255,
 1264,
 1285,
 1314,
 1318,
 1324,
 1341,
 1344,
 1376,
 1382,
 1384,
 1386,
 1395,
 1416,
 1421,
 1426,
 1427,
 1428,
 1452,
 1454,
 1473,
 1475,
 1495,
 1503,
 1534,
 1554,
 1587,
 1627,
 1636,
 1638,
 1660,
 1666,
 1690,
 1707,
 1722,
 1729,
 1735,
 1755,
 1762,
 1767,
 1778,
 1794,
 1799,
 1806,
 1825,
 1860,
 1888,
 1890,
 1905,
 1920,
 1924,
 1928,
 1943,
 1950,
 1953,
 1973,
 1975,
 1981,
 1984,
 1995,
 2022,
 2027,
 2042,
 2062,
 2074,
 2076,
 2084,
 2099,
 2100,
 2101,
 2116,
 2135,
 2140,
 2143,
 2161,
 2173,
 2200,
 2203,
 2206,
 2210,
 2243,
 2248,
 2261,
 2263,
 2265,
 2270,
 2276,
 2292,
 2296,
 2299,
 2320,
 2337,
 2339,
 2345,
 2356,
 2368,
 2374,
 2382,
 2391,
 2399,
 2404,
 2406,
 2430,
 2435,
 2440,
 2466,
 2473,
 2485,
 2487,
 2491,
 2493,
 2506,
 2538,
 2564,
 2567,
 2569,
 2578,
 2581,
 2584,
 2591,
 2607,
 2609,
 2630,
 2645,
 2653,
 2672,
 2676,
 2678,
 2708,
 2719,
 2723,
 2725,
 2731,
 2742,
 2747,
 2751,
 2758,
 2766,
 2769,
 2787,
 2808,
 2814,
 2816,
 2820,
 2830,
 2840,
 2852,
 2856,
 2858,
 2872,
 2877,
 2880,
 2910,
 2923,
 2927,
 2931,
 2934,
 2940,
 2989,
 2995,
 3004,
 3026,
 3039,
 3044,
 3046,
 3048,
 3051,
 3065,
 3074,
 3099,
 3108,
 3109,
 3125,
 3127,
 3135,
 3150,
 3167,
 3170,
 3188,
 3192,
 3195,
 3224,
 3226,
 3238,
 3240,
 3241,
 3242,
 3262,
 3264,
 3265,
 3266,
 3271,
 3318,
 3333,
 3352,
 3355,
 3361,
 3375,
 3390,
 3403,
 3412,
 3429,
 3431,
 3445,
 3455,
 3456,
 3457,
 3459,
 3467,
 3470,
 3471,
 3492,
 3504,
 3506,
 3517,
 3535,
 3546,
 3558,
 3572,
 3576,
 3582,
 3585,
 3606,
 3609,
 3615,
 3620,
 3624,
 3638,
 3643,
 3660,
 3672,
 3677,
 3681,
 3686,
 3688,
 3690,
 3692,
 3695,
 3704,
 3708,
 3720,
 3721,
 3727,
 3730,
 3732,
 3738,
 3749,
 3758,
 3776,
 3778,
 3786,
 3792,
 3796,
 3810,
 3818,
 3822,
 3832,
 3850,
 3865,
 3872,
 3874,
 3882,
 3885,
 3901,
 3910,
 3919,
 3923,
 3927,
 3931,
 3935,
 3960,
 3962,
 3975,
 3980,
 3984,
 4010,
 4016,
 4021,
 4023,
 4035,
 4041,
 4064,
 4069,
 4079,
 4086,
 4096,
 4102,
 4104,
 4108,
 4138,
 4156,
 4159,
 4166,
 4167,
 4176,
 4180,
 4205,
 4221,
 4225,
 4227,
 4228,
 4237,
 4245,
 4249,
 4256,
 4258,
 4276,
 4278,
 4287,
 4312,
 4317,
 4320,
 4329,
 4332,
 4343,
 4346,
 4351,
 4353,
 4357,
 4358,
 4360,
 4379,
 4382,
 4389,
 4396,
 4412,
 4416,
 4428,
 4434,
 4441,
 4453,
 4457,
 4463,
 4466,
 4475,
 4480,
 4486,
 4487,
 4497,
 4507,
 4517,
 4528,
 4533,
 4543,
 4548,
 4559,
 4569,
 4586,
 4590,
 4593,
 4598,
 4603,
 4607,
 4629,
 4658,
 4661,
 4693,
 4695,
 4733,
 4737,
 4784,
 4797,
 4807,
 4833,
 4848,
 4850,
 4852,
 4856,
 4896,
 4915,
 4916,
 4919,
 4925,
 4945,
 4966,
 4967,
 4969,
 4972,
 4976,
 4997,
 5011,
 5014,
 5029,
 5034,
 5037,
 5051,
 5062,
 5071,
 5074,
 5076,
 5078,
 5085,
 5088,
 5099,
 5104,
 5124,
 5131,
 5134,
 5168,
 5180,
 5191,
 5215,
 5218,
 5221,
 5225,
 5230,
 5241,
 5248,
 5268,
 5296,
 5298,
 5303,
 5314,
 5342,
 5355,
 5359,
 5362,
 5364,
 5367,
 5374,
 5381,
 5385,
 5399,
 5431,
 5439,
 5452,
 5462,
 5464,
 5470,
 5472,
 5479,
 5520,
 5531,
 5549,
 5557,
 5572,
 5577,
 5583,
 5585,
 5606,
 5617,
 5650,
 5698,
 5700,
 5725,
 5733,
 5736,
 5752,
 5755,
 5766,
 5773,
 5775,
 5793,
 5795,
 5805,
 5827,
 5839,
 5844,
 5846,
 5853,
 5889,
 5897,
 5902,
 5913,
 5927,
 5932,
 5966,
 5984,
 5988,
 5996,
 6006,
 6010,
 6013,
 6015,
 6030,
 6034,
 6037,
 6039,
 6053,
 6054,
 6056,
 6075,
 6081,
 6086,
 6091,
 6109,
 6112,
 6117,
 6122,
 6130,
 6147,
 6153,
 6159,
 6162,
 6173,
 6183,
 6186,
 6195,
 6219,
 6224,
 6226,
 6231,
 6246,
 6251,
 6269,
 6272,
 6274,
 6290,
 6292,
 6307,
 6315,
 6349,
 6351,
 6363,
 6386,
 6390,
 6393,
 6398,
 6410,
 6468,
 6470,
 6486,
 6511,
 6515,
 6518,
 6527,
 6562,
 6569,
 6596,
 6600,
 6614,
 6615,
 6637,
 6638,
 6639,
 6642,
 6644,
 6662,
 6670,
 6678,
 6684,
 6689,
 6693,
 6706,
 6709,
 6712,
 6716,
 6724,
 6728,
 6742,
 6753,
 6765,
 6768,
 6773,
 6775,
 6778,
 6796,
 6815,
 6818,
 6826,
 6843,
 6846,
 6850,
 6868,
 6883,
 6893,
 6905,
 6913,
 6919,
 6941,
 6969,
 6984,
 6992,
 7009,
 7023,
 7030,
 7034,
 7047,
 7059,
 7065,
 7069,
 7075,
 7078,
 7081,
 7090,
 7100,
 7106,
 7107,
 7111,
 7141,
 7143,
 7148,
 7152,
 7156,
 7185,
 7188,
 7192,
 7201,
 7206,
 7214,
 7216,
 7227,
 7230,
 7257,
 7276,
 7278,
 7286,
 7287,
 7288,
 7291,
 7298,
 7314,
 7317,
 7319,
 7321,
 7331,
 7349,
 7356,
 7369,
 7371,
 7372,
 7388,
 7409,
 7411,
 7418,
 7430,
 7434,
 7437,
 7438,
 7462,
 7473,
 7476,
 7481,
 7491,
 7494,
 7495,
 7523,
 7526,
 7534,
 7536,
 7551,
 7571,
 7573,
 7588,
 7601,
 7619,
 7620,
 7625,
 7632,
 7636,
 7648,
 7653,
 7654,
 7657,
 7659,
 7665,
 7671,
 7677,
 7682,
 7687,
 7694,
 7699,
 7701,
 7722,
 7729,
 7733,
 7761,
 7763,
 7778,
 7782,
 7788,
 7807,
 7808,
 7816,
 7819,
 7825,
 7840,
 7854,
 7857,
 7860,
 7865,
 7869,
 7872,
 7877,
 7882,
 7895,
 7898,
 7903,
 7906,
 7916,
 7928,
 7963,
 7967,
 7972,
 7979,
 7995,
 8005,
 8007,
 8032,
 8033,
 8034,
 8063,
 8067,
 8076,
 8077,
 8099,
 8108,
 8131,
 8168,
 8173,
 8175,
 8179,
 8181,
 8183,
 8194,
 8196,
 8200,
 8205,
 8226,
 8250,
 8255,
 8261,
 8285,
 8311,
 8326,
 8334,
 8347,
 8350,
 8362,
 8367,
 8386,
 8393,
 8413,
 8415,
 8418,
 8419,
 8420,
 8429,
 8438,
 8450,
 8468,
 8470,
 8481,
 8489,
 8512,
 8517,
 8528,
 8530,
 8536,
 8538,
 8543,
 8568,
 8571,
 8580,
 8598,
 8607,
 8609,
 8619,
 8629,
 8643,
 8645,
 8657,
 8662,
 8678,
 8680,
 8684,
 8686,
 8690,
 8695,
 8696,
 8697,
 8705,
 8720,
 8747,
 8749,
 8758,
 8766,
 8768,
 8773,
 8795,
 8797,
 8799,
 8804,
 8805,
 8806,
 8821,
 8839,
 8844,
 8854,
 8858,
 8870,
 8876,
 8878,
 8886,
 8891,
 8901,
 8927,
 8931,
 8935,
 8943,
 8958,
 8963,
 8976,
 8985,
 8989,
 9007,
 9049,
 9065,
 9068,
 9073,
 9074,
 9075,
 9082,
 9090,
 9093,
 9104,
 9105,
 9106,
 9130,
 9139,
 9149,
 9159,
 9164,
 9173,
 9185,
 9191,
 9201,
 9204,
 9219,
 9229,
 9246,
 9250,
 9264,
 9274,
 9281,
 9286,
 9287,
 9296,
 9298,
 9301,
 9315,
 9345,
 9351,
 9354,
 9365,
 9367,
 9371,
 9375,
 9379,
 9398,
 9408,
 9411,
 9422,
 9442,
 9453,
 9459,
 9464,
 9470,
 9482,
 9484,
 9489,
 9490,
 9504,
 9516,
 9530,
 9541,
 9551,
 9553,
 9555,
 9559,
 9561,
 9566,
 9602,
 9606,
 9618,
 ...]

def removeDups(indexOfDups,DF):

# This function loops throught the data and finds the duplicated rows
# Then it loops through the duplicated rows and return the row with the least number of nulls/odd values

    indexesToDrop=[]
    i=0
    skip=0

    for i,value in enumerate(indexOfDups):

        if skip>0:
            skip-=1
            continue

        j=1
        indexesToCompare = [value-1,value]
        nulls=[]

        if (i+j)<(len(indexOfDups)-1):
            while (indexOfDups[i+j]==value+j):
                indexesToCompare.append(value+j)
                j+=1
                skip+=1

        for index in indexesToCompare:
            nullCount=0
            for feature in df.iloc[index]:
                if (pd.isnull(feature)==True)|(feature=='n/a')|(feature=='NA')|(feature=='#VALUE!')|(feature==99999999)|(feature=='nan'):
                    nullCount+=1
            nulls.append(nullCount)

        leastNulls = np.argmin(nulls)
        del indexesToCompare[leastNulls]

        indexesToDrop.append(indexesToCompare)
        # print 'Still Working. Just Finished Looping on Index:',i

    return indexesToDrop

print 'Number of Rows to Drop:', len(rowsToDrop), '\n'

start_time = time.time()

rowsToDelete = removeDups(rowsToDrop,df)
rowsToDelete = [item for sublist in rowsToDelete for item in sublist]

print '\n','Run Time:', (time.time()-start_time), 'seconds'

Number of Rows to Drop: 24674


Run Time: 11.5829999447 seconds

print len(rowsToDelete)

df.drop(df.index[rowsToDelete],inplace=True,axis=0)

'''for value in rowsToDelete:
    df.drop(df.index[value])
    print 'Just finished dropping row',value'''

print len(df['Loan ID']), len(df['Years in current job'])

24674
215700 215700

# Can skip all those lines of code with this (1) line of code using groupby

'''import time
start_time = time.time()

df = df.groupby(df['Loan ID']).max()

print 'Run Time:', (time.time()-start_time)'''

# Kept the long lines of code above because it was actually significantly faster to loop throught the data than perform the groupby
# Run time approx 360 seconds vs 24 seconds for the large loop

"import time\nstart_time = time.time()\n\ndf = df.groupby(df['Loan ID']).max()\n\nprint 'Run Time:', (time.time()-start_time)"

df.info()
print df.isnull().sum(), '\n'
print df.duplicated().sum()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 215700 entries, 0 to 256983
Data columns (total 19 columns):
Loan ID                         215700 non-null object
Customer ID                     215700 non-null object
Loan Status                     215700 non-null object
Current Loan Amount             215700 non-null int64
Term                            215700 non-null object
Credit Score                    171202 non-null float64
Years in current job            215700 non-null object
Home Ownership                  215700 non-null object
Annual Income                   171202 non-null float64
Purpose                         215700 non-null object
Monthly Debt                    215700 non-null object
Years of Credit History         215700 non-null float64
Months since last delinquent    97438 non-null float64
Number of Open Accounts         215700 non-null int64
Number of Credit Problems       215700 non-null int64
Current Credit Balance          215700 non-null int64
Maximum Open Credit             215700 non-null object
Bankruptcies                    215248 non-null float64
Tax Liens                       215678 non-null float64
dtypes: float64(6), int64(4), object(9)
memory usage: 32.9+ MB
Loan ID                              0
Customer ID                          0
Loan Status                          0
Current Loan Amount                  0
Term                                 0
Credit Score                     44498
Years in current job                 0
Home Ownership                       0
Annual Income                    44498
Purpose                              0
Monthly Debt                         0
Years of Credit History              0
Months since last delinquent    118262
Number of Open Accounts              0
Number of Credit Problems            0
Current Credit Balance               0
Maximum Open Credit                  0
Bankruptcies                       452
Tax Liens                           22
dtype: int64

0

# Dropping Loan ID & Customer ID

df.drop(['Loan ID','Customer ID'],axis=1,inplace=True)

Fill Nulls/ Fix Odd Values¶

Current Loan Amount (99999999 values)
Credit Score (4 Digit FICO Scores & Nulls)
Years in current job (Nulls)
Annual Income (Nulls)
Purpose ('Other' & 'other')
Months since last delinquent (Nulls)
Max Open Credit ('#Value!')
Bankruptcies (Nulls)
Tax Liens (Nulls)

for col in df.columns:
    print col, df[col].unique(), '\n'

Loan Status ['Fully Paid' 'Charged Off']

Current Loan Amount [11520  3441 21029 ..., 15352 29071 11524]

Term ['Short Term' 'Long Term']

Credit Score [  741.   734.   747.   746.   716.  6640.    nan   733.   742.   701.
   744.   729.   745.   728.   724.   692.   739.   740.   607.   735.
   623.   750.   737.   636.   736.   715.   727.   743.  7320.   751.
   703.   732.   707.   665.   730.   717.  7180.  6670.   748.  7270.
  7070.   738.   699.  7370.   700.   675.   725.   676.   711.  7060.
   679.   705.  7410.  7330.   713.   673.   722.   704.   656.   697.
   709.   749.  7130.   726.   609.  6620.   693.   687.   720.   712.
   718.   708.   710.   680.  7440.   657.   714.   678.   685.   723.
  6720.   653.  7000.   719.   731.   690.  6800.   671.   650.  7430.
   706.   696.  6600.  6760.   721.   615.   662.   640.  7460.   677.
   669.  6950.   612.   702.   683.   684.  7300.   698.   667.   666.
   606.   674.  7170.   655.   625.   652.  7470.   691.   689.  7010.
   695.  6490.   682.   668.  6780.  6030.  7390.  7380.   686.  7400.
   694.   670.   642.  6920.  7280.  6690.   661.  7310.   622.   635.
   632.  6740.  7230.  7420.   688.  7040.  6890.   610.   663.   658.
   613.  7210.  7200.  7340.  6880.  7150.  6510.  7250.  7050.   637.
  6980.  7260.  6960.  6790.  6810.  7100.  7360.  7350.  6900.   616.
  7030.   602.  7080.   639.  6380.  6430.   651.  6710.  6680.   601.
  7480.  7020.   619.  6840.   641.  6450.   605.   631.  6910.   654.
   664.  6480.  6280.  7120.  7240.   681.  6440.  7510.   633.  7190.
  6550.  6370.   660.   648.  6520.  6500.  6580.  6610.  7500.  7110.
   647.  7490.   624.   646.   659.  6130.   621.  6850.  7160.  7140.
  7090.   611.   618.   649.  6570.  6250.  6560.  7220.   643.   595.
  6990.  7290.   638.   645.  6730.   591.   594.  7450.   672.   644.
  6820.  6870.   620.  6240.   630.   626.  6940.  6770.  6630.  6700.
   599.  6270.  6660.   614.  6110.  6590.  6360.  6050.  6310.  6650.
  6530.  6390.   627.   628.  6830.  6930.  5970.  6750.   629.   604.
   587.  6350.   593.  6410.  6180.   634.   586.  6860.  6460.  6400.
   592.   608.  6220.  6260.  6170.  6970.   603.  6140.  5990.  6200.
   589.  6160.  6420.   600.  6540.  6320.  6470.   585.  6150.  5940.
  6080.  6100.  6210.   596.  6040.  6330.  6190.   617.  6290.  6300.
   597.  6230.  5980.  5950.  6000.  5890.   598.  6020.  5870.  6120.
   588.   590.  6060.  5920.  5900.  6090.  5850.  5880.  6340.  6010.
  5860.  5930.  6070.  5960.  5910.]

Years in current job ['10+ years' '4 years' '6 years' '5 years' 'n/a' '3 years' '2 years'
 '< 1 year' '1 year' '7 years' '9 years' '8 years']

Home Ownership ['Home Mortgage' 'Own Home' 'Rent' 'HaveMortgage']

Annual Income [ 33694.  42269.  90126. ...,  38649.  34749.  30854.]

Purpose ['Debt Consolidation' 'other' 'Business Loan' 'Home Improvements'
 'Buy House' 'Other' 'Buy a Car' 'Medical Bills' 'Take a Trip'
 'Educational Expenses']

Monthly Debt ['$584.03' '$1,106.04 ' '$1,321.85 ' ..., '$707.08' '$47.11' '$2,525.82 ']

Years of Credit History [ 12.3  26.3  28.8  26.2  11.5  13.2  17.6  17.7  19.8  26.8  19.2   9.9
  21.   18.8  15.   14.9  17.8  15.4  30.3  19.9  14.   29.6  13.6  13.8
  15.5  18.   15.1  16.2  14.6  27.2  22.   14.1  26.1  23.1  20.8  28.6
  24.7  23.6  10.8   8.   18.4  21.3  14.3  31.8  22.1  25.2  25.5  16.5
  30.   16.1  20.4  17.1  10.2  14.7   7.8  30.5  22.7  12.5  17.3  10.9
  21.5  14.8  10.3  27.4  14.2  17.4  21.4  13.7  19.5  32.5  44.5  32.
  16.9  19.4  24.1  23.4  20.5  13.3  16.   19.6  18.6  16.3  17.9  32.4
  18.7  23.3  25.   11.4  11.7  21.9  16.4  16.8  38.3  21.2   8.9  27.
  38.5  15.3  25.9  15.9  26.4  19.   17.2  40.2   6.6  27.7  23.5  24.8
  29.   26.7  11.8  11.   35.6  10.    6.2  19.3  23.    6.9  31.4  13.5
  17.   18.1  34.6  13.   18.2  25.1   9.3  29.4  16.7  23.8  17.5  16.6
  15.7  21.7  18.5  14.5  22.4  22.6  15.6  28.5   9.6  10.4   9.5  20.9
  20.1  15.8  24.9  20.   24.4  27.9   8.5  24.5  23.7  29.5  18.9  12.9
  10.5  23.2  10.6  21.6  43.2   7.6  13.4  25.7  11.9  12.6   9.2  21.8
  12.4  25.3  12.    6.5  27.1  28.   35.8  29.3  25.4  41.4  26.   20.3
  10.1  12.7  27.3  22.5  13.1  20.2  36.7  28.4  28.2  33.   11.1  11.2
  11.6   8.6  33.7   7.4  34.9  12.8  32.9  22.8  33.5   8.3  29.9  29.2
  11.3  35.   25.8   9.   13.9  40.4  20.7  18.3  19.1  22.2  36.8  10.7
  14.4  26.5  40.5  19.7   8.1  30.1  31.1  22.3  21.1  30.8   5.7  27.8
  32.8  32.2  30.4   5.2  29.7  37.7   6.8  31.7  28.1   8.4  27.5   6.4
  31.    9.1  47.7  15.2  41.1   8.8   5.9  24.6  12.1  28.3  31.2  34.8
  24.3   7.   41.2   9.7   9.8  37.3  47.4  34.1  31.9   5.8  12.2  38.9
  23.9  37.5  27.6  34.2  37.9   9.4  30.9  22.9  33.2  30.2  37.   45.2
  20.6  34.   38.   38.1  37.8  24.2  25.6  35.3  47.8  34.3   6.3   4.7
  33.4   7.9  38.2   7.7   7.5  31.6   7.1  37.2  24.   36.5  37.6  34.5
   6.   36.9  49.3  35.4   8.7  32.1  33.8  43.4  28.9   6.1  32.6  33.1
  50.5  39.2  38.7  33.6  41.6  33.9  36.1  42.8  26.6   4.9  40.9   8.2
  30.7  39.   33.3  53.5  28.7  29.1   4.8  34.4  31.3  31.5  38.8  30.6
  35.9   5.5   7.3  39.7  29.8   6.7  36.   51.   40.6   7.2   5.   51.4
  41.5  46.1  41.3  40.   46.   43.9  44.4  26.9  32.3  35.5  44.3  36.4
  37.4   5.4  42.2   5.1  32.7  45.7  46.8  45.8  47.   39.6   3.7   4.5
  39.9  39.4  35.1  47.9  39.5  45.3  39.1  40.3  47.3  37.1  46.5  48.5
  35.2  41.   38.6  44.   46.6  43.5  39.3  39.8  34.7  36.3  49.   35.7
  50.   48.6  52.3  40.8  43.1  48.4  36.6   5.6  43.7   5.3  45.4  55.3
  44.8  48.3  42.5  45.   51.5  49.5  38.4   3.9  42.4  44.7  54.6  44.2
  46.2  47.5  40.1  43.8  41.8  46.9  50.6  60.5  42.3  44.6  42.9   4.6
  42.7  52.5  43.   45.5  45.6  45.1  45.9  36.2   4.4  50.1  53.1  46.7
  57.5  56.1  51.1  48.7  42.   51.6  46.3  40.7  47.6  43.3  50.3   3.8
  46.4  59.7  49.8  50.9   4.   50.7  49.4  51.3   4.3  43.6  44.9  42.6
  49.6  41.9   4.1  50.8  52.1   4.2  62.5  47.1  50.2  48.2  52.9  41.7
  48.1  42.1  44.1  47.2  48.   61.3  61.7  51.7  53.9  58.   54.5  53.6
  51.8  52.4  51.9  53.7  56.   48.9  53.8  52.8  49.9  61.5  50.4  65.8
  64.6  49.1  57.7  56.5  55.8  49.7  49.2  57.1  55.4  52.2  58.8  56.8
  55.   61.8  70.5  54.2  55.5  52.6  48.8  53.4  57.8  53.   55.6  56.9
  54.3  59.9   3.4  55.7  57.   57.6  54.7  65.   54.1  54.   66.   59.1
  60.7]

Months since last delinquent [  41.   nan   73.   29.   43.   79.    2.   15.   27.    5.   55.   21.
   56.   25.   67.   37.   74.   45.   76.   46.   35.   60.   28.   34.
   13.    7.   17.   22.   42.   53.   12.   39.   71.   20.   38.    0.
   81.   78.   44.   63.   77.   49.   30.   70.   26.   14.    8.   52.
   68.   75.   51.   59.   40.    4.   18.   47.   24.   31.   69.   32.
   36.   16.   10.   62.    9.   11.   66.   58.   54.    3.   50.   82.
   61.   72.   57.   23.   19.   80.    6.    1.   86.   48.   33.   65.
   64.   84.   83.  104.   90.  149.   96.  110.   99.   91.   95.  152.
   87.  131.  100.  116.  106.  107.   85.   89.  101.  122.   93.  135.
  139.   88.  148.   92.   94.  114.  143.  115.  108.  140.  130.   98.
   97.  120.  102.  176.  151.  118.  109.  134.  112.  113.  133.  119.]

Number of Open Accounts [10 17  5  9 12  4  7 11  8 13 14 16 22 34 28  6 15  3 24  2 21 20 19 25 29
 18 23 35 32 31 26 47 30 27 37 40 33 36 39 38 42 48 76 50 44 41  1 56 43 53
  0 46 49 45 55 52 51 58 54]

Number of Credit Problems [ 0  1  2  5  3  4  6  7  9 10  8 11]

Current Credit Balance [ 6760  6262 20967 ..., 31006 33895 35089]

Maximum Open Credit [16056L 19149L 28335L ..., 55520L 37004L 62371L]

Bankruptcies [  0.   1.   2.  nan   3.   4.   5.   7.   6.]

Tax Liens [  0.   5.  nan   1.   2.   4.   3.   6.   7.   9.   8.  10.  11.]

# Fill Current Loan Amount
# Replace 99999999 Values

print df['Current Loan Amount'].value_counts(), '\n'
x = df[df['Current Loan Amount'] != 99999999]
df['Current Loan Amount'] = df['Current Loan Amount'].replace({99999999:x['Current Loan Amount'].median()})
print df['Current Loan Amount'].value_counts()

99999999    35210
9793           50
9820           48
5920           43
10192          43
10096          42
9781           42
10132          42
10025          42
10137          42
9941           42
10226          41
10211          41
9839           41
8154           41
9981           40
9984           40
9851           40
9827           40
10214          40
9767           40
7970           40
6122           39
7938           39
10085          39
9947           39
10248          39
9931           39
9826           39
9980           39
            ...
19943           1
23378           1
19773           1
25429           1
31574           1
30394           1
20157           1
22206           1
26009           1
12387           1
30778           1
21791           1
26680           1
2100            1
22462           1
20413           1
31318           1
27258           1
32154           1
26137           1
20285           1
30334           1
30698           1
17105           1
25211           1
23250           1
1557            1
26082           1
24255           1
21933           1
Name: Current Loan Amount, Length: 27306, dtype: int64

11968.0    35234
9793.0        50
9820.0        48
5920.0        43
10192.0       43
10132.0       42
10137.0       42
9941.0        42
9781.0        42
10096.0       42
10025.0       42
10226.0       41
9839.0        41
10211.0       41
8154.0        41
9981.0        40
7970.0        40
10214.0       40
9851.0        40
9827.0        40
9984.0        40
9767.0        40
6122.0        39
10248.0       39
9816.0        39
9931.0        39
9947.0        39
9826.0        39
7938.0        39
10085.0       39
           ...
27092.0        1
2868.0         1
17462.0        1
29329.0        1
25871.0        1
24136.0        1
19831.0        1
31501.0        1
25740.0        1
6670.0         1
12302.0        1
32487.0        1
24327.0        1
21772.0        1
17464.0        1
23313.0        1
18181.0        1
16327.0        1
1455.0         1
27368.0        1
1235.0         1
29677.0        1
18143.0        1
22479.0        1
8381.0         1
31502.0        1
26622.0        1
1869.0         1
15705.0        1
20165.0        1
Name: Current Loan Amount, Length: 27305, dtype: int64

# I am assuming the credit scores > 1000 are a typo where they added a zero to the end of them
# Removing the last 0 in (4) digit credit scores

df['Credit Score'] = df['Credit Score'].apply(lambda x: x/10 if x>=1000 else x)
print df['Credit Score'].unique(), '\n'

# Fico Credit Scores range from 300 to 850. Checking to make sure the values fall within this range.
count = 0
for score in df['Credit Score']:
    if (score<300)|(score>850):
        print score
        count +=1
print count

df['Credit Score'].fillna(df['Credit Score'].median(), inplace=True)

[ 741.  734.  747.  746.  716.  664.   nan  733.  742.  701.  744.  729.
  745.  728.  724.  692.  739.  740.  607.  735.  623.  750.  737.  636.
  736.  715.  727.  743.  732.  751.  703.  707.  665.  730.  717.  718.
  667.  748.  738.  699.  700.  675.  725.  676.  711.  706.  679.  705.
  713.  673.  722.  704.  656.  697.  709.  749.  726.  609.  662.  693.
  687.  720.  712.  708.  710.  680.  657.  714.  678.  685.  723.  672.
  653.  719.  731.  690.  671.  650.  696.  660.  721.  615.  640.  677.
  669.  695.  612.  702.  683.  684.  698.  666.  606.  674.  655.  625.
  652.  691.  689.  649.  682.  668.  603.  686.  694.  670.  642.  661.
  622.  635.  632.  688.  610.  663.  658.  613.  651.  637.  681.  616.
  602.  639.  638.  643.  601.  619.  641.  645.  605.  631.  654.  648.
  628.  644.  633.  647.  624.  646.  659.  621.  611.  618.  595.  591.
  594.  620.  630.  626.  599.  627.  614.  597.  629.  604.  587.  593.
  634.  586.  592.  608.  617.  589.  600.  585.  596.  598.  588.  590.]

0

# Fill 'n/a' in Years in current job
# Assuming 'n/a' means they do not currently have a job

print df['Years in current job'].value_counts(), '\n'
df['Years in current job'] = df['Years in current job'].replace({'n/a':'No Job'})
print df['Years in current job'].value_counts()

10+ years    66711
2 years      19831
< 1 year     17544
3 years      17428
5 years      14987
1 year       14130
4 years      13632
6 years      12230
7 years      11713
8 years      10232
n/a           8990
9 years       8272
Name: Years in current job, dtype: int64

10+ years    66711
2 years      19831
< 1 year     17544
3 years      17428
5 years      14987
1 year       14130
4 years      13632
6 years      12230
7 years      11713
8 years      10232
No Job        8990
9 years       8272
Name: Years in current job, dtype: int64

# Fill Annual Income

print df['Annual Income'].median()
df['Annual Income'].fillna(df['Annual Income'].median(), inplace=True)

62105.0

# Purpose has (2) values of 'other' & 'Other'
# Convert 'Other' values to 'other'

df['Purpose'] = df['Purpose'].apply(lambda x: x.lower() if x=='Other' else x)
print df['Purpose'].unique()

['Debt Consolidation' 'other' 'Business Loan' 'Home Improvements'
 'Buy House' 'Buy a Car' 'Medical Bills' 'Take a Trip'
 'Educational Expenses']

# for Feature 'Months since last delinquent', making NULL values (those who have never been delinquent on a payment) 500 value
# Should create a WasDelinquent (Y/N?) feature

monthsSinceLastDelinquentMaxPlus1 = df['Months since last delinquent'].max()+1
df['Months since last delinquent'].fillna(monthsSinceLastDelinquentMaxPlus1, inplace=True)
print df['Months since last delinquent'].max()

177.0

# Maximum Open Credit Has Nulls as '#VALUE!'

print len(df[df['Maximum Open Credit']=='#VALUE!']), '\n'
print df['Maximum Open Credit'].value_counts(), '\n'
df['Maximum Open Credit'].replace(to_replace='#VALUE!',value=0,inplace=True)

2

0         1381
0          206
15662       15
10198       15
10597       14
9501        14
15348       14
14799       14
12202       14
8589        14
8792        14
11128       14
21523       14
7067        14
13449       13
10440       13
12195       13
12051       13
10648       13
16303       13
16861       13
10694       13
16709       13
21517       13
14770       13
11345       13
13376       13
19467       13
11553       13
18386       13
          ...
7444         1
65548        1
41301        1
38879        1
65557        1
65509        1
15649        1
26679        1
65561        1
7446         1
65565        1
26678        1
65542        1
65541        1
65539        1
65538        1
65531        1
327674       1
7181         1
21367        1
36963        1
65521        1
28192        1
65517        1
28196        1
28197        1
28194        1
65511        1
65510        1
34645        1
Name: Maximum Open Credit, Length: 87024, dtype: int64

# Replace null Bankruptcy values with 0

df['Bankruptcies'].fillna(0,inplace=True)

# Replace null Tax Lien values with 0

df['Tax Liens'].fillna(0,inplace=True)

df.isnull().sum()

Loan Status                     0
Current Loan Amount             0
Term                            0
Credit Score                    0
Years in current job            0
Home Ownership                  0
Annual Income                   0
Purpose                         0
Monthly Debt                    0
Years of Credit History         0
Months since last delinquent    0
Number of Open Accounts         0
Number of Credit Problems       0
Current Credit Balance          0
Maximum Open Credit             0
Bankruptcies                    0
Tax Liens                       0
dtype: int64

Feature Engineering¶

# Creating 'Has Job?' Feature
# 1 if have job, else 0

df['Has Job?'] = df['Years in current job'].apply(lambda x: 0 if x == 'No Job' else 1)

# Creating 'Been Delinquent?' Feature
# 1 if been delinquent, else 0

df['Been Delinquent?'] = df['Months since last delinquent'].apply(lambda x: 0 if x == monthsSinceLastDelinquentMaxPlus1 else 1)

# Creating 'Credit Problem?' Feature
# 1 if had Credit Problem, else 0

df['Credit Problem?'] = df['Number of Credit Problems'].apply(lambda x: 0 if x == 0 else 1)

# Creating 'Been Bankrupt?' Feature
# 1 if been Bankrupt, else 0

df['Been Bankrupt?'] = df['Bankruptcies'].apply(lambda x: 0 if x == 0 else 1)

# Creating 'Had Tax Lien?' Feature
# 1 if Had Tax Lien, else 0

df['Had Tax Lien?'] = df['Tax Liens'].apply(lambda x: 0 if x==0 else 1)

Convert to Numbers¶

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 215700 entries, 0 to 256983
Data columns (total 22 columns):
Loan Status                     215700 non-null object
Current Loan Amount             215700 non-null float64
Term                            215700 non-null object
Credit Score                    215700 non-null float64
Years in current job            215700 non-null object
Home Ownership                  215700 non-null object
Annual Income                   215700 non-null float64
Purpose                         215700 non-null object
Monthly Debt                    215700 non-null object
Years of Credit History         215700 non-null float64
Months since last delinquent    215700 non-null float64
Number of Open Accounts         215700 non-null int64
Number of Credit Problems       215700 non-null int64
Current Credit Balance          215700 non-null int64
Maximum Open Credit             215700 non-null object
Bankruptcies                    215700 non-null float64
Tax Liens                       215700 non-null float64
Has Job?                        215700 non-null int64
Been Delinquent?                215700 non-null int64
Credit Problem?                 215700 non-null int64
Been Bankrupt?                  215700 non-null int64
Had Tax Lien?                   215700 non-null int64
dtypes: float64(7), int64(8), object(7)
memory usage: 37.9+ MB

# For Monthly Debt, Remove '$' and ',' Characters

df['Monthly Debt'] = df['Monthly Debt'].str.replace('$','')
df['Monthly Debt'] = df['Monthly Debt'].str.replace(',','')
df['Monthly Debt'] = map(float, df['Monthly Debt'])

df['Maximum Open Credit'] = map(int, df['Maximum Open Credit'])

Data Exploration¶

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 215700 entries, 0 to 256983
Data columns (total 22 columns):
Loan Status                     215700 non-null object
Current Loan Amount             215700 non-null float64
Term                            215700 non-null object
Credit Score                    215700 non-null float64
Years in current job            215700 non-null object
Home Ownership                  215700 non-null object
Annual Income                   215700 non-null float64
Purpose                         215700 non-null object
Monthly Debt                    215700 non-null float64
Years of Credit History         215700 non-null float64
Months since last delinquent    215700 non-null float64
Number of Open Accounts         215700 non-null int64
Number of Credit Problems       215700 non-null int64
Current Credit Balance          215700 non-null int64
Maximum Open Credit             215700 non-null int64
Bankruptcies                    215700 non-null float64
Tax Liens                       215700 non-null float64
Has Job?                        215700 non-null int64
Been Delinquent?                215700 non-null int64
Credit Problem?                 215700 non-null int64
Been Bankrupt?                  215700 non-null int64
Had Tax Lien?                   215700 non-null int64
dtypes: float64(8), int64(9), object(5)
memory usage: 37.9+ MB

corr = df.corr()

fig, ax = plt.subplots(figsize=(14,14))
sns.heatmap(corr, xticklabels=corr.columns.values, yticklabels=corr.columns.values, ax=ax)

<matplotlib.axes._subplots.AxesSubplot at 0x2071a978>

def plotStackedBars(xbars, ybars, df):

    plot_df = pd.DataFrame()

    for value in df[xbars].unique():
        x = df[df[xbars]==value][ybars].value_counts()
        plot_df = plot_df.append(x)

    plot_df.index = df[xbars].unique()

    return plot_df

# Purpose

plotStackedBars('Purpose','Loan Status',df).plot(kind='bar', stacked=True, rot=0, figsize=(14,6), color = ['red', 'green'])

<matplotlib.axes._subplots.AxesSubplot at 0x1f8d37f0>

# Years in current job

plotStackedBars('Years in current job','Loan Status',df).plot(kind='bar', stacked=True, rot=0, figsize=(14,6), color = ['red', 'green'])

<matplotlib.axes._subplots.AxesSubplot at 0x268099e8>

# Term

plotStackedBars('Term','Loan Status',df).plot(kind='bar', stacked=True, rot=0, figsize=(14,6), color = ['red', 'green'])

<matplotlib.axes._subplots.AxesSubplot at 0x2510db00>

fig = plt.figure(figsize=(10,6))
plt.hist([df[df['Loan Status'] == 'Charged Off']['Credit Score'], df[df['Loan Status'] == 'Fully Paid']['Credit Score']], stacked=True, color= ['red', 'green'], bins=30)
plt.xlabel('Credit Score')
plt.ylabel('Number of Loans')
plt.legend()
plt.show()

C:\Users\nwerner\AppData\Local\Continuum\Anaconda2\lib\site-packages\matplotlib\axes\_axes.py:545: UserWarning: No labelled objects found. Use label='...' kwarg on individual plots.
  warnings.warn("No labelled objects found. "

fig = plt.figure(figsize=(10,6))
plt.hist([df[df['Loan Status'] == 'Charged Off']['Monthly Debt'], df[df['Loan Status'] == 'Fully Paid']['Monthly Debt']], stacked=True, bins=50, color= ['red', 'green'])
plt.xlabel('Monthly Debt')
plt.ylabel('Number of Loans')
plt.legend()
plt.show()

Get Dummies¶

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 215700 entries, 0 to 256983
Data columns (total 22 columns):
Loan Status                     215700 non-null object
Current Loan Amount             215700 non-null float64
Term                            215700 non-null object
Credit Score                    215700 non-null float64
Years in current job            215700 non-null object
Home Ownership                  215700 non-null object
Annual Income                   215700 non-null float64
Purpose                         215700 non-null object
Monthly Debt                    215700 non-null float64
Years of Credit History         215700 non-null float64
Months since last delinquent    215700 non-null float64
Number of Open Accounts         215700 non-null int64
Number of Credit Problems       215700 non-null int64
Current Credit Balance          215700 non-null int64
Maximum Open Credit             215700 non-null int64
Bankruptcies                    215700 non-null float64
Tax Liens                       215700 non-null float64
Has Job?                        215700 non-null int64
Been Delinquent?                215700 non-null int64
Credit Problem?                 215700 non-null int64
Been Bankrupt?                  215700 non-null int64
Had Tax Lien?                   215700 non-null int64
dtypes: float64(8), int64(9), object(5)
memory usage: 37.9+ MB

d = {'Fully Paid':1,'Charged Off':0}

df['Loan Status'] = df['Loan Status'].map(d)

print df['Loan Status'].value_counts()

1    176191
0     39509
Name: Loan Status, dtype: int64

df = pd.get_dummies(df, columns = ['Term', 'Years in current job', 'Home Ownership', 'Purpose'], drop_first=True)
df.head()

Algorithm Training¶

y = df['Loan Status']
X = df.drop(['Loan Status'], axis=1)

X_train, X_test, y_train,y_test = tts(X,y,train_size=0.8, random_state=17)
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((172560, 40), (172560L,), (43140, 40), (43140L,))

# Logistic Regression

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score
from sklearn import grid_search

logRegModel = LogisticRegression().fit(X_train, y_train)
predictionLR = logRegModel.predict(X_test)

accScoreLR = accuracy_score(y_test, predictionLR)
f1ScoreLR = f1_score(y_test, predictionLR)

print 'Logistic Regression Accuracy Score:', accScoreLR
print 'Logistic Regression F1 Score:', f1ScoreLR

C:\Users\nwerner\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)

Logistic Regression Accuracy Score: 0.815345387112
Logistic Regression F1 Score: 0.898281277932

# Gradient Boosting

from sklearn.ensemble import GradientBoostingClassifier as GBC

clfGB = GBC(n_estimators=10).fit(X_train, y_train)
predictionGB = clfGB.predict(X_test)

accScoreGB = accuracy_score(y_test, predictionGB)
f1ScoreGB = f1_score(y_test, predictionGB)

print 'Gradient Boosting Accuracy Score:', accScoreGB
print 'Gradient Boosting F1 Score:', f1ScoreGB

Gradient Boosting Accuracy Score: 0.815368567455
Gradient Boosting F1 Score: 0.898295345719

# Random Forest

from sklearn.ensemble import RandomForestClassifier as RFC

clfRF = RFC(n_estimators=10).fit(X_train, y_train)
predictionRF = clfRF.predict(X_test)

accScoreRF = accuracy_score(y_test, predictionRF)
f1ScoreRF = f1_score(y_test, predictionRF)

print 'Random Forest Accuracy Score:', accScoreRF
print 'Random Forest F1 Score:', f1ScoreRF

Random Forest Accuracy Score: 0.803407510431
Random Forest F1 Score: 0.884820664648

Confusion Matrix¶

from sklearn.metrics import confusion_matrix
import itertools

def plot_confusion_matrix(cm, classes, normalize=False, title='Confusion Matrix', cmap=plt.cm.Blues):

    # This function prints the confusion matrix
    # Normalization can be applied by setting it to true

    plt.imshow(cm, interpolation= 'nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)

    if normalize==True:
        cm = cm.astype(float)/cm.sum(axis=1)[:, np.newaxis]
        print 'Normalized Confusion Matrix'
    else:
        print 'Confusion Matrix without Normalization'

    print cm

    thresh = cm.max()/2
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i,j], horizontalalignment='center', color='white' if cm[i,j] > thresh else 'black')

    plt.tight_layout()
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')

cnf_matrix = confusion_matrix(y_test, predictionLR)
np.set_printoptions(precision=2)

plt.figure()

class_names = ['Loan Default', 'Fully Paid']
plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True, title='Normalized Confusion Matrix')

plt.show()

Normalized Confusion Matrix
[[  0.00e+00   1.00e+00]
 [  2.84e-05   1.00e+00]]

ROC Curve¶

from sklearn import metrics

predsLR = logRegModel.predict_proba(X_test)[:,1]
fpr, tpr, _ =  metrics.roc_curve(y_test, predsLR)

roc_auc = metrics.auc(fpr, tpr)

plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0,1], [0,1], 'r--')
plt.xlim([0,1])
plt.ylim([0,1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

from ggplot import *

predsLR = logRegModel.predict_proba(X_test)[:,1]
fpr, tpr, _ =  metrics.roc_curve(y_test, predsLR)

ROCdfLR = pd.DataFrame(dict(fpr=fpr, tpr=tpr))

ggplot(ROCdfLR, aes(x='fpr', y='tpr')) + geom_line() + geom_abline(linetype='dashed') + ggtitle("ROC Curve")

C:\Users\nwerner\AppData\Local\Continuum\Anaconda2\lib\site-packages\ggplot\utils.py:81: FutureWarning: pandas.tslib is deprecated and will be removed in a future version.
You can access Timestamp as pandas.Timestamp
  pd.tslib.Timestamp,
C:\Users\nwerner\AppData\Local\Continuum\Anaconda2\lib\site-packages\ggplot\stats\smoothers.py:4: FutureWarning: The pandas.lib module is deprecated and will be removed in a future version. These are private functions and can be accessed from pandas._libs.lib instead
  from pandas.lib import Timestamp
C:\Users\nwerner\AppData\Local\Continuum\Anaconda2\lib\site-packages\statsmodels\compat\pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

<ggplot: (56592185)>

auc = metrics.auc(fpr, tpr)

ggplot(ROCdfLR, aes(x='fpr', y='tpr', ymin='0', ymax='tpr')) + geom_area(alpha=.2) + geom_line(y='tpr') + ggtitle("ROC Curve w/ AUC= " + str(auc))

<ggplot: (58305407)>

	Loan ID	Customer ID	Loan Status	Current Loan Amount	Term	Credit Score	Years in current job	Home Ownership	Annual Income	Purpose	Monthly Debt	Years of Credit History	Months since last delinquent	Number of Open Accounts	Number of Credit Problems	Current Credit Balance	Maximum Open Credit	Bankruptcies
0	000025bb-5694-4cff-b17d-192b1a98ba44	5ebc8bb1-5eb9-4404-b11b-a6eebc401a19	Fully Paid	11520	Short Term	741.0	10+ years	Home Mortgage	33694.0	Debt Consolidation	$584.03	12.3	41.0	10	0	6760	16056	0.0
1	00002c49-3a29-4bd4-8f67-c8f8fbc1048c	927b388d-2e01-423f-a8dc-f7e42d668f46	Fully Paid	3441	Short Term	734.0	4 years	Home Mortgage	42269.0	other	$1,106.04	26.3	NaN	17	0	6262	19149	0.0
2	00002d89-27f3-409b-aa76-90834f359a65	defce609-c631-447d-aad6-1270615e89c4	Fully Paid	21029	Short Term	747.0	10+ years	Home Mortgage	90126.0	Debt Consolidation	$1,321.85	28.8	NaN	5	0	20967	28335	0.0
3	00005222-b4d8-45a4-ad8c-186057e24233	070bcecb-aae7-4485-a26a-e0403e7bb6c5	Fully Paid	18743	Short Term	747.0	10+ years	Own Home	38072.0	Debt Consolidation	$751.92	26.2	NaN	9	0	22529	43915	0.0
4	0000757f-a121-41ed-b17b-162e76647c1f	dde79588-12f0-4811-bab0-e2b07f633fcd	Fully Paid	11731	Short Term	746.0	4 years	Rent	50025.0	Debt Consolidation	$355.18	11.5	NaN	12	0	17391	37081	0.0
5	0000a149-b055-4a57-b762-280783ccc25e	62ddc017-7023-4ba7-af23-1a7cd16c1ce5	Fully Paid	10208	Short Term	716.0	10+ years	Rent	41853.0	Business Loan	$561.52	13.2	NaN	4	1	2289	4671	1.0
6	0000afa6-8902-4f8f-b870-25a8fdad0aeb	e49c1a82-a0f7-45e8-9f46-2f75c43f9fbc	Charged Off	24613	Long Term	6640.0	6 years	Rent	49225.0	Business Loan	$542.29	17.6	73.0	7	0	14123	16954	0.0
7	0000afa6-8902-4f8f-b870-25a8fdad0aeb	e49c1a82-a0f7-45e8-9f46-2f75c43f9fbc	Charged Off	24613	Long Term	NaN	6 years	Rent	NaN	Business Loan	$542.29	17.6	73.0	7	0	14123	16954	0.0
8	00011dfc-31c1-4178-932a-fbeb3f341efb	ef6e098c-6c83-4752-8d00-ff793e476b8c	Fully Paid	10036	Short Term	NaN	5 years	Rent	NaN	Debt Consolidation	$386.36	17.7	NaN	7	0	11970	16579	0.0
9	0001cb86-af28-4011-bb86-183786e473ae	4aae67bb-d54b-41ae-8bce-1d62022ed8dd	Fully Paid	2036	Short Term	733.0	n/a	Home Mortgage	55985.0	Debt Consolidation	$741.79	19.8	29.0	7	0	10926	15676	0.0

	Loan Status	Current Loan Amount	Credit Score	Annual Income	Monthly Debt	Years of Credit History	Months since last delinquent	Number of Open Accounts	Current Credit Balance	...	Home Ownership_Own Home	Home Ownership_Rent	Purpose_Debt Consolidation	Purpose_other
0	1	11520.0	741.0	33694.0	584.03	12.3	41.0	10	6760	...	0	0	1	0
1	1	3441.0	734.0	42269.0	1106.04	26.3	177.0	17	6262	...	0	0	0	1
2	1	21029.0	747.0	90126.0	1321.85	28.8	177.0	5	20967	...	0	0	1	0
3	1	18743.0	747.0	38072.0	751.92	26.2	177.0	9	22529	...	1	0	1	0
4	1	11731.0	746.0	50025.0	355.18	11.5	177.0	12	17391	...	0	1	1	0