Loan Predictions

Project 1 - Loan Predictions 3.0

Loan Predictions - Classifier Algorithms

Purpose/Goal:

To use the given data to create a classification algorithm that will accurately predict if a person will default on their loan or not.

Process

  1. Read in Data
  2. Remove Duplicates
  3. Feature Engineer/ Bin Continuous Variables
  4. Convert Features to Numbers
  5. Algorithm
  6. ROC & AUC

Read in Libraries

In [1]:
import numpy as np
from sklearn.cross_validation import train_test_split as tts
from sklearn import linear_model
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import time
import seaborn as sns
C:\Users\nwerner\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
In [2]:
# Read in Data

df = pd.read_csv('LoansTrainingSet.csv') #,dtype=str
C:\Users\nwerner\AppData\Local\Continuum\Anaconda2\lib\site-packages\IPython\core\interactiveshell.py:2717: DtypeWarning: Columns (16) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

Preliminary Data Exploration

In [3]:
df.head(10)
Out[3]:
Loan ID Customer ID Loan Status Current Loan Amount Term Credit Score Years in current job Home Ownership Annual Income Purpose Monthly Debt Years of Credit History Months since last delinquent Number of Open Accounts Number of Credit Problems Current Credit Balance Maximum Open Credit Bankruptcies Tax Liens
0 000025bb-5694-4cff-b17d-192b1a98ba44 5ebc8bb1-5eb9-4404-b11b-a6eebc401a19 Fully Paid 11520 Short Term 741.0 10+ years Home Mortgage 33694.0 Debt Consolidation $584.03 12.3 41.0 10 0 6760 16056 0.0 0.0
1 00002c49-3a29-4bd4-8f67-c8f8fbc1048c 927b388d-2e01-423f-a8dc-f7e42d668f46 Fully Paid 3441 Short Term 734.0 4 years Home Mortgage 42269.0 other $1,106.04 26.3 NaN 17 0 6262 19149 0.0 0.0
2 00002d89-27f3-409b-aa76-90834f359a65 defce609-c631-447d-aad6-1270615e89c4 Fully Paid 21029 Short Term 747.0 10+ years Home Mortgage 90126.0 Debt Consolidation $1,321.85 28.8 NaN 5 0 20967 28335 0.0 0.0
3 00005222-b4d8-45a4-ad8c-186057e24233 070bcecb-aae7-4485-a26a-e0403e7bb6c5 Fully Paid 18743 Short Term 747.0 10+ years Own Home 38072.0 Debt Consolidation $751.92 26.2 NaN 9 0 22529 43915 0.0 0.0
4 0000757f-a121-41ed-b17b-162e76647c1f dde79588-12f0-4811-bab0-e2b07f633fcd Fully Paid 11731 Short Term 746.0 4 years Rent 50025.0 Debt Consolidation $355.18 11.5 NaN 12 0 17391 37081 0.0 0.0
5 0000a149-b055-4a57-b762-280783ccc25e 62ddc017-7023-4ba7-af23-1a7cd16c1ce5 Fully Paid 10208 Short Term 716.0 10+ years Rent 41853.0 Business Loan $561.52 13.2 NaN 4 1 2289 4671 1.0 0.0
6 0000afa6-8902-4f8f-b870-25a8fdad0aeb e49c1a82-a0f7-45e8-9f46-2f75c43f9fbc Charged Off 24613 Long Term 6640.0 6 years Rent 49225.0 Business Loan $542.29 17.6 73.0 7 0 14123 16954 0.0 0.0
7 0000afa6-8902-4f8f-b870-25a8fdad0aeb e49c1a82-a0f7-45e8-9f46-2f75c43f9fbc Charged Off 24613 Long Term NaN 6 years Rent NaN Business Loan $542.29 17.6 73.0 7 0 14123 16954 0.0 0.0
8 00011dfc-31c1-4178-932a-fbeb3f341efb ef6e098c-6c83-4752-8d00-ff793e476b8c Fully Paid 10036 Short Term NaN 5 years Rent NaN Debt Consolidation $386.36 17.7 NaN 7 0 11970 16579 0.0 0.0
9 0001cb86-af28-4011-bb86-183786e473ae 4aae67bb-d54b-41ae-8bce-1d62022ed8dd Fully Paid 2036 Short Term 733.0 n/a Home Mortgage 55985.0 Debt Consolidation $741.79 19.8 29.0 7 0 10926 15676 0.0 0.0
In [4]:
df.info()
df.isnull().sum()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 256984 entries, 0 to 256983
Data columns (total 19 columns):
Loan ID                         256984 non-null object
Customer ID                     256984 non-null object
Loan Status                     256984 non-null object
Current Loan Amount             256984 non-null int64
Term                            256984 non-null object
Credit Score                    195308 non-null float64
Years in current job            256984 non-null object
Home Ownership                  256984 non-null object
Annual Income                   195308 non-null float64
Purpose                         256984 non-null object
Monthly Debt                    256984 non-null object
Years of Credit History         256984 non-null float64
Months since last delinquent    116601 non-null float64
Number of Open Accounts         256984 non-null int64
Number of Credit Problems       256984 non-null int64
Current Credit Balance          256984 non-null int64
Maximum Open Credit             256984 non-null object
Bankruptcies                    256455 non-null float64
Tax Liens                       256961 non-null float64
dtypes: float64(6), int64(4), object(9)
memory usage: 37.3+ MB
Out[4]:
Loan ID                              0
Customer ID                          0
Loan Status                          0
Current Loan Amount                  0
Term                                 0
Credit Score                     61676
Years in current job                 0
Home Ownership                       0
Annual Income                    61676
Purpose                              0
Monthly Debt                         0
Years of Credit History              0
Months since last delinquent    140383
Number of Open Accounts              0
Number of Credit Problems            0
Current Credit Balance               0
Maximum Open Credit                  0
Bankruptcies                       529
Tax Liens                           23
dtype: int64

Removing Duplicates

In [5]:
for col in df.columns:
    print 'Number of Unique {} values: '.format(col), df[col].nunique()
Number of Unique Loan ID values:  215700
Number of Unique Customer ID values:  215700
Number of Unique Loan Status values:  2
Number of Unique Current Loan Amount values:  27347
Number of Unique Term values:  2
Number of Unique Credit Score values:  334
Number of Unique Years in current job values:  12
Number of Unique Home Ownership values:  4
Number of Unique Annual Income values:  60558
Number of Unique Purpose values:  10
Number of Unique Monthly Debt values:  129115
Number of Unique Years of Credit History values:  541
Number of Unique Months since last delinquent values:  131
Number of Unique Number of Open Accounts values:  59
Number of Unique Number of Credit Problems values:  12
Number of Unique Current Credit Balance values:  45704
Number of Unique Maximum Open Credit values:  87188
Number of Unique Bankruptcies values:  8
Number of Unique Tax Liens values:  12
In [6]:
# Creating a fake data set to explore drop duplicates

data = {'fruit':['apple','banana','carrot','apple','apple'],'dessert':['cake','cookie','ice cream','cake','brownie']}
dataFr = pd.DataFrame(data)
print dataFr, '\n' # Fruit DataFrame
print dataFr.duplicated().sum() # Number of Duplicates
dataFr.duplicated(subset='dessert') # Bools of Number of Duplicates
     dessert   fruit
0       cake   apple
1     cookie  banana
2  ice cream  carrot
3       cake   apple
4    brownie   apple

1
Out[6]:
0    False
1    False
2    False
3     True
4    False
dtype: bool
In [7]:
# Using drop duplicates, if I specify the multiple columns, it only drops the row if the values are identical in all of the specified columns 

#dataFr.drop_duplicates(subset=['fruit','dessert'],inplace=True)
dataFr.drop_duplicates(inplace=True)
#dataFr = dataFr.drop_duplicates()
print dataFr, '\n'
dataFr.info()
     dessert   fruit
0       cake   apple
1     cookie  banana
2  ice cream  carrot
4    brownie   apple

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 4
Data columns (total 2 columns):
dessert    4 non-null object
fruit      4 non-null object
dtypes: object(2)
memory usage: 96.0+ bytes
In [8]:
# Drop Duplicates where values are equal on ALL features

df.drop_duplicates(inplace=True)
print 'Down to {} rows.'.format(len(df))
print 'There are', df['Loan ID'].nunique(), 'values.'
print (len(df)-df['Loan ID'].nunique()), 'more Rows to drop.'
Down to 240374 rows.
There are 215700 values.
24674 more Rows to drop.
In [9]:
# There are still some duplicated rows

x = df.duplicated(subset=['Loan ID'])
rowsToDrop = []

for i,bool in enumerate(x):
    if bool == True:
        rowsToDrop.append(i)

print len(rowsToDrop)
rowsToDrop
24674
Out[9]:
[7,
 13,
 14,
 19,
 42,
 51,
 67,
 69,
 75,
 80,
 84,
 107,
 108,
 109,
 118,
 120,
 126,
 143,
 144,
 153,
 161,
 166,
 192,
 212,
 218,
 236,
 245,
 247,
 254,
 255,
 258,
 263,
 286,
 292,
 295,
 299,
 322,
 332,
 341,
 343,
 354,
 361,
 365,
 377,
 393,
 397,
 400,
 402,
 417,
 467,
 476,
 477,
 483,
 487,
 494,
 505,
 515,
 518,
 524,
 546,
 557,
 576,
 587,
 621,
 628,
 631,
 635,
 656,
 679,
 687,
 692,
 694,
 697,
 699,
 705,
 711,
 713,
 729,
 748,
 765,
 769,
 779,
 784,
 787,
 794,
 804,
 810,
 816,
 825,
 851,
 866,
 874,
 879,
 887,
 905,
 913,
 915,
 923,
 936,
 949,
 960,
 970,
 971,
 973,
 978,
 1003,
 1009,
 1013,
 1015,
 1039,
 1042,
 1044,
 1069,
 1095,
 1111,
 1113,
 1123,
 1135,
 1142,
 1153,
 1161,
 1163,
 1183,
 1207,
 1241,
 1255,
 1264,
 1285,
 1314,
 1318,
 1324,
 1341,
 1344,
 1376,
 1382,
 1384,
 1386,
 1395,
 1416,
 1421,
 1426,
 1427,
 1428,
 1452,
 1454,
 1473,
 1475,
 1495,
 1503,
 1534,
 1554,
 1587,
 1627,
 1636,
 1638,
 1660,
 1666,
 1690,
 1707,
 1722,
 1729,
 1735,
 1755,
 1762,
 1767,
 1778,
 1794,
 1799,
 1806,
 1825,
 1860,
 1888,
 1890,
 1905,
 1920,
 1924,
 1928,
 1943,
 1950,
 1953,
 1973,
 1975,
 1981,
 1984,
 1995,
 2022,
 2027,
 2042,
 2062,
 2074,
 2076,
 2084,
 2099,
 2100,
 2101,
 2116,
 2135,
 2140,
 2143,
 2161,
 2173,
 2200,
 2203,
 2206,
 2210,
 2243,
 2248,
 2261,
 2263,
 2265,
 2270,
 2276,
 2292,
 2296,
 2299,
 2320,
 2337,
 2339,
 2345,
 2356,
 2368,
 2374,
 2382,
 2391,
 2399,
 2404,
 2406,
 2430,
 2435,
 2440,
 2466,
 2473,
 2485,
 2487,
 2491,
 2493,
 2506,
 2538,
 2564,
 2567,
 2569,
 2578,
 2581,
 2584,
 2591,
 2607,
 2609,
 2630,
 2645,
 2653,
 2672,
 2676,
 2678,
 2708,
 2719,
 2723,
 2725,
 2731,
 2742,
 2747,
 2751,
 2758,
 2766,
 2769,
 2787,
 2808,
 2814,
 2816,
 2820,
 2830,
 2840,
 2852,
 2856,
 2858,
 2872,
 2877,
 2880,
 2910,
 2923,
 2927,
 2931,
 2934,
 2940,
 2989,
 2995,
 3004,
 3026,
 3039,
 3044,
 3046,
 3048,
 3051,
 3065,
 3074,
 3099,
 3108,
 3109,
 3125,
 3127,
 3135,
 3150,
 3167,
 3170,
 3188,
 3192,
 3195,
 3224,
 3226,
 3238,
 3240,
 3241,
 3242,
 3262,
 3264,
 3265,
 3266,
 3271,
 3318,
 3333,
 3352,
 3355,
 3361,
 3375,
 3390,
 3403,
 3412,
 3429,
 3431,
 3445,
 3455,
 3456,
 3457,
 3459,
 3467,
 3470,
 3471,
 3492,
 3504,
 3506,
 3517,
 3535,
 3546,
 3558,
 3572,
 3576,
 3582,
 3585,
 3606,
 3609,
 3615,
 3620,
 3624,
 3638,
 3643,
 3660,
 3672,
 3677,
 3681,
 3686,
 3688,
 3690,
 3692,
 3695,
 3704,
 3708,
 3720,
 3721,
 3727,
 3730,
 3732,
 3738,
 3749,
 3758,
 3776,
 3778,
 3786,
 3792,
 3796,
 3810,
 3818,
 3822,
 3832,
 3850,
 3865,
 3872,
 3874,
 3882,
 3885,
 3901,
 3910,
 3919,
 3923,
 3927,
 3931,
 3935,
 3960,
 3962,
 3975,
 3980,
 3984,
 4010,
 4016,
 4021,
 4023,
 4035,
 4041,
 4064,
 4069,
 4079,
 4086,
 4096,
 4102,
 4104,
 4108,
 4138,
 4156,
 4159,
 4166,
 4167,
 4176,
 4180,
 4205,
 4221,
 4225,
 4227,
 4228,
 4237,
 4245,
 4249,
 4256,
 4258,
 4276,
 4278,
 4287,
 4312,
 4317,
 4320,
 4329,
 4332,
 4343,
 4346,
 4351,
 4353,
 4357,
 4358,
 4360,
 4379,
 4382,
 4389,
 4396,
 4412,
 4416,
 4428,
 4434,
 4441,
 4453,
 4457,
 4463,
 4466,
 4475,
 4480,
 4486,
 4487,
 4497,
 4507,
 4517,
 4528,
 4533,
 4543,
 4548,
 4559,
 4569,
 4586,
 4590,
 4593,
 4598,
 4603,
 4607,
 4629,
 4658,
 4661,
 4693,
 4695,
 4733,
 4737,
 4784,
 4797,
 4807,
 4833,
 4848,
 4850,
 4852,
 4856,
 4896,
 4915,
 4916,
 4919,
 4925,
 4945,
 4966,
 4967,
 4969,
 4972,
 4976,
 4997,
 5011,
 5014,
 5029,
 5034,
 5037,
 5051,
 5062,
 5071,
 5074,
 5076,
 5078,
 5085,
 5088,
 5099,
 5104,
 5124,
 5131,
 5134,
 5168,
 5180,
 5191,
 5215,
 5218,
 5221,
 5225,
 5230,
 5241,
 5248,
 5268,
 5296,
 5298,
 5303,
 5314,
 5342,
 5355,
 5359,
 5362,
 5364,
 5367,
 5374,
 5381,
 5385,
 5399,
 5431,
 5439,
 5452,
 5462,
 5464,
 5470,
 5472,
 5479,
 5520,
 5531,
 5549,
 5557,
 5572,
 5577,
 5583,
 5585,
 5606,
 5617,
 5650,
 5698,
 5700,
 5725,
 5733,
 5736,
 5752,
 5755,
 5766,
 5773,
 5775,
 5793,
 5795,
 5805,
 5827,
 5839,
 5844,
 5846,
 5853,
 5889,
 5897,
 5902,
 5913,
 5927,
 5932,
 5966,
 5984,
 5988,
 5996,
 6006,
 6010,
 6013,
 6015,
 6030,
 6034,
 6037,
 6039,
 6053,
 6054,
 6056,
 6075,
 6081,
 6086,
 6091,
 6109,
 6112,
 6117,
 6122,
 6130,
 6147,
 6153,
 6159,
 6162,
 6173,
 6183,
 6186,
 6195,
 6219,
 6224,
 6226,
 6231,
 6246,
 6251,
 6269,
 6272,
 6274,
 6290,
 6292,
 6307,
 6315,
 6349,
 6351,
 6363,
 6386,
 6390,
 6393,
 6398,
 6410,
 6468,
 6470,
 6486,
 6511,
 6515,
 6518,
 6527,
 6562,
 6569,
 6596,
 6600,
 6614,
 6615,
 6637,
 6638,
 6639,
 6642,
 6644,
 6662,
 6670,
 6678,
 6684,
 6689,
 6693,
 6706,
 6709,
 6712,
 6716,
 6724,
 6728,
 6742,
 6753,
 6765,
 6768,
 6773,
 6775,
 6778,
 6796,
 6815,
 6818,
 6826,
 6843,
 6846,
 6850,
 6868,
 6883,
 6893,
 6905,
 6913,
 6919,
 6941,
 6969,
 6984,
 6992,
 7009,
 7023,
 7030,
 7034,
 7047,
 7059,
 7065,
 7069,
 7075,
 7078,
 7081,
 7090,
 7100,
 7106,
 7107,
 7111,
 7141,
 7143,
 7148,
 7152,
 7156,
 7185,
 7188,
 7192,
 7201,
 7206,
 7214,
 7216,
 7227,
 7230,
 7257,
 7276,
 7278,
 7286,
 7287,
 7288,
 7291,
 7298,
 7314,
 7317,
 7319,
 7321,
 7331,
 7349,
 7356,
 7369,
 7371,
 7372,
 7388,
 7409,
 7411,
 7418,
 7430,
 7434,
 7437,
 7438,
 7462,
 7473,
 7476,
 7481,
 7491,
 7494,
 7495,
 7523,
 7526,
 7534,
 7536,
 7551,
 7571,
 7573,
 7588,
 7601,
 7619,
 7620,
 7625,
 7632,
 7636,
 7648,
 7653,
 7654,
 7657,
 7659,
 7665,
 7671,
 7677,
 7682,
 7687,
 7694,
 7699,
 7701,
 7722,
 7729,
 7733,
 7761,
 7763,
 7778,
 7782,
 7788,
 7807,
 7808,
 7816,
 7819,
 7825,
 7840,
 7854,
 7857,
 7860,
 7865,
 7869,
 7872,
 7877,
 7882,
 7895,
 7898,
 7903,
 7906,
 7916,
 7928,
 7963,
 7967,
 7972,
 7979,
 7995,
 8005,
 8007,
 8032,
 8033,
 8034,
 8063,
 8067,
 8076,
 8077,
 8099,
 8108,
 8131,
 8168,
 8173,
 8175,
 8179,
 8181,
 8183,
 8194,
 8196,
 8200,
 8205,
 8226,
 8250,
 8255,
 8261,
 8285,
 8311,
 8326,
 8334,
 8347,
 8350,
 8362,
 8367,
 8386,
 8393,
 8413,
 8415,
 8418,
 8419,
 8420,
 8429,
 8438,
 8450,
 8468,
 8470,
 8481,
 8489,
 8512,
 8517,
 8528,
 8530,
 8536,
 8538,
 8543,
 8568,
 8571,
 8580,
 8598,
 8607,
 8609,
 8619,
 8629,
 8643,
 8645,
 8657,
 8662,
 8678,
 8680,
 8684,
 8686,
 8690,
 8695,
 8696,
 8697,
 8705,
 8720,
 8747,
 8749,
 8758,
 8766,
 8768,
 8773,
 8795,
 8797,
 8799,
 8804,
 8805,
 8806,
 8821,
 8839,
 8844,
 8854,
 8858,
 8870,
 8876,
 8878,
 8886,
 8891,
 8901,
 8927,
 8931,
 8935,
 8943,
 8958,
 8963,
 8976,
 8985,
 8989,
 9007,
 9049,
 9065,
 9068,
 9073,
 9074,
 9075,
 9082,
 9090,
 9093,
 9104,
 9105,
 9106,
 9130,
 9139,
 9149,
 9159,
 9164,
 9173,
 9185,
 9191,
 9201,
 9204,
 9219,
 9229,
 9246,
 9250,
 9264,
 9274,
 9281,
 9286,
 9287,
 9296,
 9298,
 9301,
 9315,
 9345,
 9351,
 9354,
 9365,
 9367,
 9371,
 9375,
 9379,
 9398,
 9408,
 9411,
 9422,
 9442,
 9453,
 9459,
 9464,
 9470,
 9482,
 9484,
 9489,
 9490,
 9504,
 9516,
 9530,
 9541,
 9551,
 9553,
 9555,
 9559,
 9561,
 9566,
 9602,
 9606,
 9618,
 ...]
In [10]:
def removeDups(indexOfDups,DF):

# This function loops throught the data and finds the duplicated rows
# Then it loops through the duplicated rows and return the row with the least number of nulls/odd values

    indexesToDrop=[]
    i=0
    skip=0

    for i,value in enumerate(indexOfDups):

        if skip>0:
            skip-=1
            continue

        j=1
        indexesToCompare = [value-1,value]
        nulls=[]

        if (i+j)<(len(indexOfDups)-1):
            while (indexOfDups[i+j]==value+j):
                indexesToCompare.append(value+j)
                j+=1
                skip+=1

        for index in indexesToCompare:
            nullCount=0
            for feature in df.iloc[index]:
                if (pd.isnull(feature)==True)|(feature=='n/a')|(feature=='NA')|(feature=='#VALUE!')|(feature==99999999)|(feature=='nan'):
                    nullCount+=1
            nulls.append(nullCount)

        leastNulls = np.argmin(nulls)
        del indexesToCompare[leastNulls]

        indexesToDrop.append(indexesToCompare)
        # print 'Still Working. Just Finished Looping on Index:',i

    return indexesToDrop
In [11]:
print 'Number of Rows to Drop:', len(rowsToDrop), '\n'

start_time = time.time()

rowsToDelete = removeDups(rowsToDrop,df)
rowsToDelete = [item for sublist in rowsToDelete for item in sublist]

print '\n','Run Time:', (time.time()-start_time), 'seconds'
Number of Rows to Drop: 24674


Run Time: 11.5829999447 seconds
In [12]:
print len(rowsToDelete)

df.drop(df.index[rowsToDelete],inplace=True,axis=0)

'''for value in rowsToDelete:
    df.drop(df.index[value])
    print 'Just finished dropping row',value'''

print len(df['Loan ID']), len(df['Years in current job'])
24674
215700 215700
In [13]:
# Can skip all those lines of code with this (1) line of code using groupby

'''import time
start_time = time.time()

df = df.groupby(df['Loan ID']).max()

print 'Run Time:', (time.time()-start_time)'''

# Kept the long lines of code above because it was actually significantly faster to loop throught the data than perform the groupby
# Run time approx 360 seconds vs 24 seconds for the large loop
Out[13]:
"import time\nstart_time = time.time()\n\ndf = df.groupby(df['Loan ID']).max()\n\nprint 'Run Time:', (time.time()-start_time)"
In [14]:
df.info()
print df.isnull().sum(), '\n'
print df.duplicated().sum()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 215700 entries, 0 to 256983
Data columns (total 19 columns):
Loan ID                         215700 non-null object
Customer ID                     215700 non-null object
Loan Status                     215700 non-null object
Current Loan Amount             215700 non-null int64
Term                            215700 non-null object
Credit Score                    171202 non-null float64
Years in current job            215700 non-null object
Home Ownership                  215700 non-null object
Annual Income                   171202 non-null float64
Purpose                         215700 non-null object
Monthly Debt                    215700 non-null object
Years of Credit History         215700 non-null float64
Months since last delinquent    97438 non-null float64
Number of Open Accounts         215700 non-null int64
Number of Credit Problems       215700 non-null int64
Current Credit Balance          215700 non-null int64
Maximum Open Credit             215700 non-null object
Bankruptcies                    215248 non-null float64
Tax Liens                       215678 non-null float64
dtypes: float64(6), int64(4), object(9)
memory usage: 32.9+ MB
Loan ID                              0
Customer ID                          0
Loan Status                          0
Current Loan Amount                  0
Term                                 0
Credit Score                     44498
Years in current job                 0
Home Ownership                       0
Annual Income                    44498
Purpose                              0
Monthly Debt                         0
Years of Credit History              0
Months since last delinquent    118262
Number of Open Accounts              0
Number of Credit Problems            0
Current Credit Balance               0
Maximum Open Credit                  0
Bankruptcies                       452
Tax Liens                           22
dtype: int64

0
In [15]:
# Dropping Loan ID & Customer ID

df.drop(['Loan ID','Customer ID'],axis=1,inplace=True)

Fill Nulls/ Fix Odd Values

  • Current Loan Amount (99999999 values)
  • Credit Score (4 Digit FICO Scores & Nulls)
  • Years in current job (Nulls)
  • Annual Income (Nulls)
  • Purpose ('Other' & 'other')
  • Months since last delinquent (Nulls)
  • Max Open Credit ('#Value!')
  • Bankruptcies (Nulls)
  • Tax Liens (Nulls)
In [16]:
for col in df.columns:
    print col, df[col].unique(), '\n'
Loan Status ['Fully Paid' 'Charged Off']

Current Loan Amount [11520  3441 21029 ..., 15352 29071 11524]

Term ['Short Term' 'Long Term']

Credit Score [  741.   734.   747.   746.   716.  6640.    nan   733.   742.   701.
   744.   729.   745.   728.   724.   692.   739.   740.   607.   735.
   623.   750.   737.   636.   736.   715.   727.   743.  7320.   751.
   703.   732.   707.   665.   730.   717.  7180.  6670.   748.  7270.
  7070.   738.   699.  7370.   700.   675.   725.   676.   711.  7060.
   679.   705.  7410.  7330.   713.   673.   722.   704.   656.   697.
   709.   749.  7130.   726.   609.  6620.   693.   687.   720.   712.
   718.   708.   710.   680.  7440.   657.   714.   678.   685.   723.
  6720.   653.  7000.   719.   731.   690.  6800.   671.   650.  7430.
   706.   696.  6600.  6760.   721.   615.   662.   640.  7460.   677.
   669.  6950.   612.   702.   683.   684.  7300.   698.   667.   666.
   606.   674.  7170.   655.   625.   652.  7470.   691.   689.  7010.
   695.  6490.   682.   668.  6780.  6030.  7390.  7380.   686.  7400.
   694.   670.   642.  6920.  7280.  6690.   661.  7310.   622.   635.
   632.  6740.  7230.  7420.   688.  7040.  6890.   610.   663.   658.
   613.  7210.  7200.  7340.  6880.  7150.  6510.  7250.  7050.   637.
  6980.  7260.  6960.  6790.  6810.  7100.  7360.  7350.  6900.   616.
  7030.   602.  7080.   639.  6380.  6430.   651.  6710.  6680.   601.
  7480.  7020.   619.  6840.   641.  6450.   605.   631.  6910.   654.
   664.  6480.  6280.  7120.  7240.   681.  6440.  7510.   633.  7190.
  6550.  6370.   660.   648.  6520.  6500.  6580.  6610.  7500.  7110.
   647.  7490.   624.   646.   659.  6130.   621.  6850.  7160.  7140.
  7090.   611.   618.   649.  6570.  6250.  6560.  7220.   643.   595.
  6990.  7290.   638.   645.  6730.   591.   594.  7450.   672.   644.
  6820.  6870.   620.  6240.   630.   626.  6940.  6770.  6630.  6700.
   599.  6270.  6660.   614.  6110.  6590.  6360.  6050.  6310.  6650.
  6530.  6390.   627.   628.  6830.  6930.  5970.  6750.   629.   604.
   587.  6350.   593.  6410.  6180.   634.   586.  6860.  6460.  6400.
   592.   608.  6220.  6260.  6170.  6970.   603.  6140.  5990.  6200.
   589.  6160.  6420.   600.  6540.  6320.  6470.   585.  6150.  5940.
  6080.  6100.  6210.   596.  6040.  6330.  6190.   617.  6290.  6300.
   597.  6230.  5980.  5950.  6000.  5890.   598.  6020.  5870.  6120.
   588.   590.  6060.  5920.  5900.  6090.  5850.  5880.  6340.  6010.
  5860.  5930.  6070.  5960.  5910.]

Years in current job ['10+ years' '4 years' '6 years' '5 years' 'n/a' '3 years' '2 years'
 '< 1 year' '1 year' '7 years' '9 years' '8 years']

Home Ownership ['Home Mortgage' 'Own Home' 'Rent' 'HaveMortgage']

Annual Income [ 33694.  42269.  90126. ...,  38649.  34749.  30854.]

Purpose ['Debt Consolidation' 'other' 'Business Loan' 'Home Improvements'
 'Buy House' 'Other' 'Buy a Car' 'Medical Bills' 'Take a Trip'
 'Educational Expenses']

Monthly Debt ['$584.03' '$1,106.04 ' '$1,321.85 ' ..., '$707.08' '$47.11' '$2,525.82 ']

Years of Credit History [ 12.3  26.3  28.8  26.2  11.5  13.2  17.6  17.7  19.8  26.8  19.2   9.9
  21.   18.8  15.   14.9  17.8  15.4  30.3  19.9  14.   29.6  13.6  13.8
  15.5  18.   15.1  16.2  14.6  27.2  22.   14.1  26.1  23.1  20.8  28.6
  24.7  23.6  10.8   8.   18.4  21.3  14.3  31.8  22.1  25.2  25.5  16.5
  30.   16.1  20.4  17.1  10.2  14.7   7.8  30.5  22.7  12.5  17.3  10.9
  21.5  14.8  10.3  27.4  14.2  17.4  21.4  13.7  19.5  32.5  44.5  32.
  16.9  19.4  24.1  23.4  20.5  13.3  16.   19.6  18.6  16.3  17.9  32.4
  18.7  23.3  25.   11.4  11.7  21.9  16.4  16.8  38.3  21.2   8.9  27.
  38.5  15.3  25.9  15.9  26.4  19.   17.2  40.2   6.6  27.7  23.5  24.8
  29.   26.7  11.8  11.   35.6  10.    6.2  19.3  23.    6.9  31.4  13.5
  17.   18.1  34.6  13.   18.2  25.1   9.3  29.4  16.7  23.8  17.5  16.6
  15.7  21.7  18.5  14.5  22.4  22.6  15.6  28.5   9.6  10.4   9.5  20.9
  20.1  15.8  24.9  20.   24.4  27.9   8.5  24.5  23.7  29.5  18.9  12.9
  10.5  23.2  10.6  21.6  43.2   7.6  13.4  25.7  11.9  12.6   9.2  21.8
  12.4  25.3  12.    6.5  27.1  28.   35.8  29.3  25.4  41.4  26.   20.3
  10.1  12.7  27.3  22.5  13.1  20.2  36.7  28.4  28.2  33.   11.1  11.2
  11.6   8.6  33.7   7.4  34.9  12.8  32.9  22.8  33.5   8.3  29.9  29.2
  11.3  35.   25.8   9.   13.9  40.4  20.7  18.3  19.1  22.2  36.8  10.7
  14.4  26.5  40.5  19.7   8.1  30.1  31.1  22.3  21.1  30.8   5.7  27.8
  32.8  32.2  30.4   5.2  29.7  37.7   6.8  31.7  28.1   8.4  27.5   6.4
  31.    9.1  47.7  15.2  41.1   8.8   5.9  24.6  12.1  28.3  31.2  34.8
  24.3   7.   41.2   9.7   9.8  37.3  47.4  34.1  31.9   5.8  12.2  38.9
  23.9  37.5  27.6  34.2  37.9   9.4  30.9  22.9  33.2  30.2  37.   45.2
  20.6  34.   38.   38.1  37.8  24.2  25.6  35.3  47.8  34.3   6.3   4.7
  33.4   7.9  38.2   7.7   7.5  31.6   7.1  37.2  24.   36.5  37.6  34.5
   6.   36.9  49.3  35.4   8.7  32.1  33.8  43.4  28.9   6.1  32.6  33.1
  50.5  39.2  38.7  33.6  41.6  33.9  36.1  42.8  26.6   4.9  40.9   8.2
  30.7  39.   33.3  53.5  28.7  29.1   4.8  34.4  31.3  31.5  38.8  30.6
  35.9   5.5   7.3  39.7  29.8   6.7  36.   51.   40.6   7.2   5.   51.4
  41.5  46.1  41.3  40.   46.   43.9  44.4  26.9  32.3  35.5  44.3  36.4
  37.4   5.4  42.2   5.1  32.7  45.7  46.8  45.8  47.   39.6   3.7   4.5
  39.9  39.4  35.1  47.9  39.5  45.3  39.1  40.3  47.3  37.1  46.5  48.5
  35.2  41.   38.6  44.   46.6  43.5  39.3  39.8  34.7  36.3  49.   35.7
  50.   48.6  52.3  40.8  43.1  48.4  36.6   5.6  43.7   5.3  45.4  55.3
  44.8  48.3  42.5  45.   51.5  49.5  38.4   3.9  42.4  44.7  54.6  44.2
  46.2  47.5  40.1  43.8  41.8  46.9  50.6  60.5  42.3  44.6  42.9   4.6
  42.7  52.5  43.   45.5  45.6  45.1  45.9  36.2   4.4  50.1  53.1  46.7
  57.5  56.1  51.1  48.7  42.   51.6  46.3  40.7  47.6  43.3  50.3   3.8
  46.4  59.7  49.8  50.9   4.   50.7  49.4  51.3   4.3  43.6  44.9  42.6
  49.6  41.9   4.1  50.8  52.1   4.2  62.5  47.1  50.2  48.2  52.9  41.7
  48.1  42.1  44.1  47.2  48.   61.3  61.7  51.7  53.9  58.   54.5  53.6
  51.8  52.4  51.9  53.7  56.   48.9  53.8  52.8  49.9  61.5  50.4  65.8
  64.6  49.1  57.7  56.5  55.8  49.7  49.2  57.1  55.4  52.2  58.8  56.8
  55.   61.8  70.5  54.2  55.5  52.6  48.8  53.4  57.8  53.   55.6  56.9
  54.3  59.9   3.4  55.7  57.   57.6  54.7  65.   54.1  54.   66.   59.1
  60.7]

Months since last delinquent [  41.   nan   73.   29.   43.   79.    2.   15.   27.    5.   55.   21.
   56.   25.   67.   37.   74.   45.   76.   46.   35.   60.   28.   34.
   13.    7.   17.   22.   42.   53.   12.   39.   71.   20.   38.    0.
   81.   78.   44.   63.   77.   49.   30.   70.   26.   14.    8.   52.
   68.   75.   51.   59.   40.    4.   18.   47.   24.   31.   69.   32.
   36.   16.   10.   62.    9.   11.   66.   58.   54.    3.   50.   82.
   61.   72.   57.   23.   19.   80.    6.    1.   86.   48.   33.   65.
   64.   84.   83.  104.   90.  149.   96.  110.   99.   91.   95.  152.
   87.  131.  100.  116.  106.  107.   85.   89.  101.  122.   93.  135.
  139.   88.  148.   92.   94.  114.  143.  115.  108.  140.  130.   98.
   97.  120.  102.  176.  151.  118.  109.  134.  112.  113.  133.  119.]

Number of Open Accounts [10 17  5  9 12  4  7 11  8 13 14 16 22 34 28  6 15  3 24  2 21 20 19 25 29
 18 23 35 32 31 26 47 30 27 37 40 33 36 39 38 42 48 76 50 44 41  1 56 43 53
  0 46 49 45 55 52 51 58 54]

Number of Credit Problems [ 0  1  2  5  3  4  6  7  9 10  8 11]

Current Credit Balance [ 6760  6262 20967 ..., 31006 33895 35089]

Maximum Open Credit [16056L 19149L 28335L ..., 55520L 37004L 62371L]

Bankruptcies [  0.   1.   2.  nan   3.   4.   5.   7.   6.]

Tax Liens [  0.   5.  nan   1.   2.   4.   3.   6.   7.   9.   8.  10.  11.]

In [17]:
# Fill Current Loan Amount
# Replace 99999999 Values

print df['Current Loan Amount'].value_counts(), '\n'
x = df[df['Current Loan Amount'] != 99999999]
df['Current Loan Amount'] = df['Current Loan Amount'].replace({99999999:x['Current Loan Amount'].median()})
print df['Current Loan Amount'].value_counts()
99999999    35210
9793           50
9820           48
5920           43
10192          43
10096          42
9781           42
10132          42
10025          42
10137          42
9941           42
10226          41
10211          41
9839           41
8154           41
9981           40
9984           40
9851           40
9827           40
10214          40
9767           40
7970           40
6122           39
7938           39
10085          39
9947           39
10248          39
9931           39
9826           39
9980           39
            ...
19943           1
23378           1
19773           1
25429           1
31574           1
30394           1
20157           1
22206           1
26009           1
12387           1
30778           1
21791           1
26680           1
2100            1
22462           1
20413           1
31318           1
27258           1
32154           1
26137           1
20285           1
30334           1
30698           1
17105           1
25211           1
23250           1
1557            1
26082           1
24255           1
21933           1
Name: Current Loan Amount, Length: 27306, dtype: int64

11968.0    35234
9793.0        50
9820.0        48
5920.0        43
10192.0       43
10132.0       42
10137.0       42
9941.0        42
9781.0        42
10096.0       42
10025.0       42
10226.0       41
9839.0        41
10211.0       41
8154.0        41
9981.0        40
7970.0        40
10214.0       40
9851.0        40
9827.0        40
9984.0        40
9767.0        40
6122.0        39
10248.0       39
9816.0        39
9931.0        39
9947.0        39
9826.0        39
7938.0        39
10085.0       39
           ...
27092.0        1
2868.0         1
17462.0        1
29329.0        1
25871.0        1
24136.0        1
19831.0        1
31501.0        1
25740.0        1
6670.0         1
12302.0        1
32487.0        1
24327.0        1
21772.0        1
17464.0        1
23313.0        1
18181.0        1
16327.0        1
1455.0         1
27368.0        1
1235.0         1
29677.0        1
18143.0        1
22479.0        1
8381.0         1
31502.0        1
26622.0        1
1869.0         1
15705.0        1
20165.0        1
Name: Current Loan Amount, Length: 27305, dtype: int64
In [18]:
# I am assuming the credit scores > 1000 are a typo where they added a zero to the end of them
# Removing the last 0 in (4) digit credit scores

df['Credit Score'] = df['Credit Score'].apply(lambda x: x/10 if x>=1000 else x)
print df['Credit Score'].unique(), '\n'

# Fico Credit Scores range from 300 to 850. Checking to make sure the values fall within this range.
count = 0
for score in df['Credit Score']:
    if (score<300)|(score>850):
        print score
        count +=1
print count

df['Credit Score'].fillna(df['Credit Score'].median(), inplace=True)
[ 741.  734.  747.  746.  716.  664.   nan  733.  742.  701.  744.  729.
  745.  728.  724.  692.  739.  740.  607.  735.  623.  750.  737.  636.
  736.  715.  727.  743.  732.  751.  703.  707.  665.  730.  717.  718.
  667.  748.  738.  699.  700.  675.  725.  676.  711.  706.  679.  705.
  713.  673.  722.  704.  656.  697.  709.  749.  726.  609.  662.  693.
  687.  720.  712.  708.  710.  680.  657.  714.  678.  685.  723.  672.
  653.  719.  731.  690.  671.  650.  696.  660.  721.  615.  640.  677.
  669.  695.  612.  702.  683.  684.  698.  666.  606.  674.  655.  625.
  652.  691.  689.  649.  682.  668.  603.  686.  694.  670.  642.  661.
  622.  635.  632.  688.  610.  663.  658.  613.  651.  637.  681.  616.
  602.  639.  638.  643.  601.  619.  641.  645.  605.  631.  654.  648.
  628.  644.  633.  647.  624.  646.  659.  621.  611.  618.  595.  591.
  594.  620.  630.  626.  599.  627.  614.  597.  629.  604.  587.  593.
  634.  586.  592.  608.  617.  589.  600.  585.  596.  598.  588.  590.]

0
In [19]:
# Fill 'n/a' in Years in current job
# Assuming 'n/a' means they do not currently have a job

print df['Years in current job'].value_counts(), '\n'
df['Years in current job'] = df['Years in current job'].replace({'n/a':'No Job'})
print df['Years in current job'].value_counts()
10+ years    66711
2 years      19831
< 1 year     17544
3 years      17428
5 years      14987
1 year       14130
4 years      13632
6 years      12230
7 years      11713
8 years      10232
n/a           8990
9 years       8272
Name: Years in current job, dtype: int64

10+ years    66711
2 years      19831
< 1 year     17544
3 years      17428
5 years      14987
1 year       14130
4 years      13632
6 years      12230
7 years      11713
8 years      10232
No Job        8990
9 years       8272
Name: Years in current job, dtype: int64
In [20]:
# Fill Annual Income

print df['Annual Income'].median()
df['Annual Income'].fillna(df['Annual Income'].median(), inplace=True)
62105.0
In [21]:
# Purpose has (2) values of 'other' & 'Other'
# Convert 'Other' values to 'other'

df['Purpose'] = df['Purpose'].apply(lambda x: x.lower() if x=='Other' else x)
print df['Purpose'].unique()
['Debt Consolidation' 'other' 'Business Loan' 'Home Improvements'
 'Buy House' 'Buy a Car' 'Medical Bills' 'Take a Trip'
 'Educational Expenses']
In [22]:
# for Feature 'Months since last delinquent', making NULL values (those who have never been delinquent on a payment) 500 value
# Should create a WasDelinquent (Y/N?) feature

monthsSinceLastDelinquentMaxPlus1 = df['Months since last delinquent'].max()+1
df['Months since last delinquent'].fillna(monthsSinceLastDelinquentMaxPlus1, inplace=True)
print df['Months since last delinquent'].max()
177.0
In [23]:
# Maximum Open Credit Has Nulls as '#VALUE!'

print len(df[df['Maximum Open Credit']=='#VALUE!']), '\n'
print df['Maximum Open Credit'].value_counts(), '\n'
df['Maximum Open Credit'].replace(to_replace='#VALUE!',value=0,inplace=True)
2

0         1381
0          206
15662       15
10198       15
10597       14
9501        14
15348       14
14799       14
12202       14
8589        14
8792        14
11128       14
21523       14
7067        14
13449       13
10440       13
12195       13
12051       13
10648       13
16303       13
16861       13
10694       13
16709       13
21517       13
14770       13
11345       13
13376       13
19467       13
11553       13
18386       13
          ...
7444         1
65548        1
41301        1
38879        1
65557        1
65509        1
15649        1
26679        1
65561        1
7446         1
65565        1
26678        1
65542        1
65541        1
65539        1
65538        1
65531        1
327674       1
7181         1
21367        1
36963        1
65521        1
28192        1
65517        1
28196        1
28197        1
28194        1
65511        1
65510        1
34645        1
Name: Maximum Open Credit, Length: 87024, dtype: int64

In [24]:
# Replace null Bankruptcy values with 0

df['Bankruptcies'].fillna(0,inplace=True)
In [25]:
# Replace null Tax Lien values with 0

df['Tax Liens'].fillna(0,inplace=True)
In [26]:
df.isnull().sum()
Out[26]:
Loan Status                     0
Current Loan Amount             0
Term                            0
Credit Score                    0
Years in current job            0
Home Ownership                  0
Annual Income                   0
Purpose                         0
Monthly Debt                    0
Years of Credit History         0
Months since last delinquent    0
Number of Open Accounts         0
Number of Credit Problems       0
Current Credit Balance          0
Maximum Open Credit             0
Bankruptcies                    0
Tax Liens                       0
dtype: int64

Feature Engineering

In [27]:
# Creating 'Has Job?' Feature
# 1 if have job, else 0

df['Has Job?'] = df['Years in current job'].apply(lambda x: 0 if x == 'No Job' else 1)
In [28]:
# Creating 'Been Delinquent?' Feature
# 1 if been delinquent, else 0

df['Been Delinquent?'] = df['Months since last delinquent'].apply(lambda x: 0 if x == monthsSinceLastDelinquentMaxPlus1 else 1)
In [29]:
# Creating 'Credit Problem?' Feature
# 1 if had Credit Problem, else 0

df['Credit Problem?'] = df['Number of Credit Problems'].apply(lambda x: 0 if x == 0 else 1)
In [30]:
# Creating 'Been Bankrupt?' Feature
# 1 if been Bankrupt, else 0

df['Been Bankrupt?'] = df['Bankruptcies'].apply(lambda x: 0 if x == 0 else 1)
In [31]:
# Creating 'Had Tax Lien?' Feature
# 1 if Had Tax Lien, else 0

df['Had Tax Lien?'] = df['Tax Liens'].apply(lambda x: 0 if x==0 else 1)

Convert to Numbers

In [32]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 215700 entries, 0 to 256983
Data columns (total 22 columns):
Loan Status                     215700 non-null object
Current Loan Amount             215700 non-null float64
Term                            215700 non-null object
Credit Score                    215700 non-null float64
Years in current job            215700 non-null object
Home Ownership                  215700 non-null object
Annual Income                   215700 non-null float64
Purpose                         215700 non-null object
Monthly Debt                    215700 non-null object
Years of Credit History         215700 non-null float64
Months since last delinquent    215700 non-null float64
Number of Open Accounts         215700 non-null int64
Number of Credit Problems       215700 non-null int64
Current Credit Balance          215700 non-null int64
Maximum Open Credit             215700 non-null object
Bankruptcies                    215700 non-null float64
Tax Liens                       215700 non-null float64
Has Job?                        215700 non-null int64
Been Delinquent?                215700 non-null int64
Credit Problem?                 215700 non-null int64
Been Bankrupt?                  215700 non-null int64
Had Tax Lien?                   215700 non-null int64
dtypes: float64(7), int64(8), object(7)
memory usage: 37.9+ MB
In [33]:
# For Monthly Debt, Remove '$' and ',' Characters

df['Monthly Debt'] = df['Monthly Debt'].str.replace('$','')
df['Monthly Debt'] = df['Monthly Debt'].str.replace(',','')
df['Monthly Debt'] = map(float, df['Monthly Debt'])
In [34]:
df['Maximum Open Credit'] = map(int, df['Maximum Open Credit'])

Data Exploration

In [35]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 215700 entries, 0 to 256983
Data columns (total 22 columns):
Loan Status                     215700 non-null object
Current Loan Amount             215700 non-null float64
Term                            215700 non-null object
Credit Score                    215700 non-null float64
Years in current job            215700 non-null object
Home Ownership                  215700 non-null object
Annual Income                   215700 non-null float64
Purpose                         215700 non-null object
Monthly Debt                    215700 non-null float64
Years of Credit History         215700 non-null float64
Months since last delinquent    215700 non-null float64
Number of Open Accounts         215700 non-null int64
Number of Credit Problems       215700 non-null int64
Current Credit Balance          215700 non-null int64
Maximum Open Credit             215700 non-null int64
Bankruptcies                    215700 non-null float64
Tax Liens                       215700 non-null float64
Has Job?                        215700 non-null int64
Been Delinquent?                215700 non-null int64
Credit Problem?                 215700 non-null int64
Been Bankrupt?                  215700 non-null int64
Had Tax Lien?                   215700 non-null int64
dtypes: float64(8), int64(9), object(5)
memory usage: 37.9+ MB
In [36]:
corr = df.corr()

fig, ax = plt.subplots(figsize=(14,14))
sns.heatmap(corr, xticklabels=corr.columns.values, yticklabels=corr.columns.values, ax=ax)
Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x2071a978>
In [37]:
def plotStackedBars(xbars, ybars, df):

    plot_df = pd.DataFrame()

    for value in df[xbars].unique():
        x = df[df[xbars]==value][ybars].value_counts()
        plot_df = plot_df.append(x)

    plot_df.index = df[xbars].unique()

    return plot_df
In [38]:
# Purpose

plotStackedBars('Purpose','Loan Status',df).plot(kind='bar', stacked=True, rot=0, figsize=(14,6), color = ['red', 'green'])
Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f8d37f0>
In [39]:
# Years in current job

plotStackedBars('Years in current job','Loan Status',df).plot(kind='bar', stacked=True, rot=0, figsize=(14,6), color = ['red', 'green'])
Out[39]:
<matplotlib.axes._subplots.AxesSubplot at 0x268099e8>
In [40]:
# Term

plotStackedBars('Term','Loan Status',df).plot(kind='bar', stacked=True, rot=0, figsize=(14,6), color = ['red', 'green'])
Out[40]:
<matplotlib.axes._subplots.AxesSubplot at 0x2510db00>
In [41]:
fig = plt.figure(figsize=(10,6))
plt.hist([df[df['Loan Status'] == 'Charged Off']['Credit Score'], df[df['Loan Status'] == 'Fully Paid']['Credit Score']], stacked=True, color= ['red', 'green'], bins=30)
plt.xlabel('Credit Score')
plt.ylabel('Number of Loans')
plt.legend()
plt.show()
C:\Users\nwerner\AppData\Local\Continuum\Anaconda2\lib\site-packages\matplotlib\axes\_axes.py:545: UserWarning: No labelled objects found. Use label='...' kwarg on individual plots.
  warnings.warn("No labelled objects found. "
In [42]:
fig = plt.figure(figsize=(10,6))
plt.hist([df[df['Loan Status'] == 'Charged Off']['Monthly Debt'], df[df['Loan Status'] == 'Fully Paid']['Monthly Debt']], stacked=True, bins=50, color= ['red', 'green'])
plt.xlabel('Monthly Debt')
plt.ylabel('Number of Loans')
plt.legend()
plt.show()

Get Dummies

In [43]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 215700 entries, 0 to 256983
Data columns (total 22 columns):
Loan Status                     215700 non-null object
Current Loan Amount             215700 non-null float64
Term                            215700 non-null object
Credit Score                    215700 non-null float64
Years in current job            215700 non-null object
Home Ownership                  215700 non-null object
Annual Income                   215700 non-null float64
Purpose                         215700 non-null object
Monthly Debt                    215700 non-null float64
Years of Credit History         215700 non-null float64
Months since last delinquent    215700 non-null float64
Number of Open Accounts         215700 non-null int64
Number of Credit Problems       215700 non-null int64
Current Credit Balance          215700 non-null int64
Maximum Open Credit             215700 non-null int64
Bankruptcies                    215700 non-null float64
Tax Liens                       215700 non-null float64
Has Job?                        215700 non-null int64
Been Delinquent?                215700 non-null int64
Credit Problem?                 215700 non-null int64
Been Bankrupt?                  215700 non-null int64
Had Tax Lien?                   215700 non-null int64
dtypes: float64(8), int64(9), object(5)
memory usage: 37.9+ MB
In [44]:
d = {'Fully Paid':1,'Charged Off':0}

df['Loan Status'] = df['Loan Status'].map(d)
In [45]:
print df['Loan Status'].value_counts()
1    176191
0     39509
Name: Loan Status, dtype: int64
In [46]:
df = pd.get_dummies(df, columns = ['Term', 'Years in current job', 'Home Ownership', 'Purpose'], drop_first=True)
df.head()
Out[46]:
Loan Status Current Loan Amount Credit Score Annual Income Monthly Debt Years of Credit History Months since last delinquent Number of Open Accounts Number of Credit Problems Current Credit Balance ... Home Ownership_Own Home Home Ownership_Rent Purpose_Buy House Purpose_Buy a Car Purpose_Debt Consolidation Purpose_Educational Expenses Purpose_Home Improvements Purpose_Medical Bills Purpose_Take a Trip Purpose_other
0 1 11520.0 741.0 33694.0 584.03 12.3 41.0 10 0 6760 ... 0 0 0 0 1 0 0 0 0 0
1 1 3441.0 734.0 42269.0 1106.04 26.3 177.0 17 0 6262 ... 0 0 0 0 0 0 0 0 0 1
2 1 21029.0 747.0 90126.0 1321.85 28.8 177.0 5 0 20967 ... 0 0 0 0 1 0 0 0 0 0
3 1 18743.0 747.0 38072.0 751.92 26.2 177.0 9 0 22529 ... 1 0 0 0 1 0 0 0 0 0
4 1 11731.0 746.0 50025.0 355.18 11.5 177.0 12 0 17391 ... 0 1 0 0 1 0 0 0 0 0

5 rows × 41 columns

Algorithm Training

In [47]:
y = df['Loan Status']
X = df.drop(['Loan Status'], axis=1)
In [48]:
X_train, X_test, y_train,y_test = tts(X,y,train_size=0.8, random_state=17)
X_train.shape, y_train.shape, X_test.shape, y_test.shape
Out[48]:
((172560, 40), (172560L,), (43140, 40), (43140L,))
In [49]:
# Logistic Regression

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score
from sklearn import grid_search

logRegModel = LogisticRegression().fit(X_train, y_train)
predictionLR = logRegModel.predict(X_test)

accScoreLR = accuracy_score(y_test, predictionLR)
f1ScoreLR = f1_score(y_test, predictionLR)

print 'Logistic Regression Accuracy Score:', accScoreLR
print 'Logistic Regression F1 Score:', f1ScoreLR
C:\Users\nwerner\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)
Logistic Regression Accuracy Score: 0.815345387112
Logistic Regression F1 Score: 0.898281277932
In [50]:
# Gradient Boosting

from sklearn.ensemble import GradientBoostingClassifier as GBC

clfGB = GBC(n_estimators=10).fit(X_train, y_train)
predictionGB = clfGB.predict(X_test)

accScoreGB = accuracy_score(y_test, predictionGB)
f1ScoreGB = f1_score(y_test, predictionGB)

print 'Gradient Boosting Accuracy Score:', accScoreGB
print 'Gradient Boosting F1 Score:', f1ScoreGB
Gradient Boosting Accuracy Score: 0.815368567455
Gradient Boosting F1 Score: 0.898295345719
In [51]:
# Random Forest

from sklearn.ensemble import RandomForestClassifier as RFC

clfRF = RFC(n_estimators=10).fit(X_train, y_train)
predictionRF = clfRF.predict(X_test)

accScoreRF = accuracy_score(y_test, predictionRF)
f1ScoreRF = f1_score(y_test, predictionRF)

print 'Random Forest Accuracy Score:', accScoreRF
print 'Random Forest F1 Score:', f1ScoreRF
Random Forest Accuracy Score: 0.803407510431
Random Forest F1 Score: 0.884820664648

Confusion Matrix

In [52]:
from sklearn.metrics import confusion_matrix
import itertools

def plot_confusion_matrix(cm, classes, normalize=False, title='Confusion Matrix', cmap=plt.cm.Blues):

    # This function prints the confusion matrix
    # Normalization can be applied by setting it to true

    plt.imshow(cm, interpolation= 'nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)

    if normalize==True:
        cm = cm.astype(float)/cm.sum(axis=1)[:, np.newaxis]
        print 'Normalized Confusion Matrix'
    else:
        print 'Confusion Matrix without Normalization'

    print cm

    thresh = cm.max()/2
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i,j], horizontalalignment='center', color='white' if cm[i,j] > thresh else 'black')

    plt.tight_layout()
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')
In [53]:
cnf_matrix = confusion_matrix(y_test, predictionLR)
np.set_printoptions(precision=2)

plt.figure()

class_names = ['Loan Default', 'Fully Paid']
plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True, title='Normalized Confusion Matrix')

plt.show()
Normalized Confusion Matrix
[[  0.00e+00   1.00e+00]
 [  2.84e-05   1.00e+00]]

ROC Curve

In [54]:
from sklearn import metrics

predsLR = logRegModel.predict_proba(X_test)[:,1]
fpr, tpr, _ =  metrics.roc_curve(y_test, predsLR)

roc_auc = metrics.auc(fpr, tpr)

plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0,1], [0,1], 'r--')
plt.xlim([0,1])
plt.ylim([0,1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
In [55]:
from ggplot import *

predsLR = logRegModel.predict_proba(X_test)[:,1]
fpr, tpr, _ =  metrics.roc_curve(y_test, predsLR)

ROCdfLR = pd.DataFrame(dict(fpr=fpr, tpr=tpr))

ggplot(ROCdfLR, aes(x='fpr', y='tpr')) + geom_line() + geom_abline(linetype='dashed') + ggtitle("ROC Curve")
C:\Users\nwerner\AppData\Local\Continuum\Anaconda2\lib\site-packages\ggplot\utils.py:81: FutureWarning: pandas.tslib is deprecated and will be removed in a future version.
You can access Timestamp as pandas.Timestamp
  pd.tslib.Timestamp,
C:\Users\nwerner\AppData\Local\Continuum\Anaconda2\lib\site-packages\ggplot\stats\smoothers.py:4: FutureWarning: The pandas.lib module is deprecated and will be removed in a future version. These are private functions and can be accessed from pandas._libs.lib instead
  from pandas.lib import Timestamp
C:\Users\nwerner\AppData\Local\Continuum\Anaconda2\lib\site-packages\statsmodels\compat\pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools
Out[55]:
<ggplot: (56592185)>
In [56]:
auc = metrics.auc(fpr, tpr)

ggplot(ROCdfLR, aes(x='fpr', y='tpr', ymin='0', ymax='tpr')) + geom_area(alpha=.2) + geom_line(y='tpr') + ggtitle("ROC Curve w/ AUC= " + str(auc))
Out[56]:
<ggplot: (58305407)>
In [ ]:

rss facebook twitter github youtube mail spotify lastfm instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora quora