k = number of clusters The algorithm creates k centroids and calculates the means of the distances from the centroid to the distances of the data points and continues to do so until the calculated means do not change or the centroids do not switch between groups of data.
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
%matplotlib inline
data = pd.read_excel('cars.xls')
data.head()
x = data[['Price','Mileage','Cylinder']]
x.head()
model = KMeans(n_clusters = 5)
model
model = model.fit(x)
pred = model.predict(x)
pred
x['new cluster'] = pred
x.head()
The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from -1 to 1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, the the clustering configuration may have too many or too few clusters. The Silhouette Coefficient for a sample is (b-a)/max(a,b)
silhouette_score(x,pred)
# Visualize the data clusters
plt.figure(figsize=(10,6))
plt.scatter(x.Mileage,x.Price,c=pred)
plt.legend()
plt.colorbar()
plt.show()
model.cluster_centers_
model.labels_
# Creating a loop to loop through a range of clusters and see which number of clusters gives us the best silhouette score,
# thus giving us the best model
silScores = []
def clust(clusters):
bestClustNum = 0
silScore = 0
for num_of_cluster in np.arange(1,clusters):
num_of_cluster += 1
model = KMeans(n_clusters=num_of_cluster)
model = model.fit(x)
pred = model.predict(x)
score = silhouette_score(x,pred)
print 'Number of Clusters {}, silhouette score {}'.format(num_of_cluster,score)
if score > silScore:
silScore = score
bestClustNum = num_of_cluster
silScores.append(score)
print 'The best number of Clusters is {} with a silhouette score of {}'.format(bestClustNum,silScore)
clust(20)
silScores
k nearest neighbors calculates distance from points to groups of the points closest to it and determines which quality it is most like. It is highly recommended to scale or normalize your data when using K-nearest neighbors so some values do not have significantly higher weight than others.
data = pd.read_csv('heightweight.csv')
data_weight = data[['weightLb','ageYear']]
data_weight.head()
from sklearn.preprocessing import scale
data_scaled = scale(data_weight)
data_scaled
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=25)
model = neigh.fit(data_weight,data.sex)
print "Predicted Gender = ",model.predict([[110,19.0]])
Naive because we assume that all conditions are independent of each other. This assumption makes the math easier and the algorithm faster. However, this assumption is generally not true. Top uses for this algorithm are spam filters and document classification.
import os
import io
import numpy as np
from pandas import DataFrame
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
!cd
filenames = []
for files in os.walk('C:\\Users\\nwerner\\DevMasters\\Class Notes\\Day #5 - Classifier Algorithms'):
filenames.append(files)
filenames
print os.path
loca = 'C:\\Users\\nwerner\\AppData\\Local\\Continuum\\Anaconda2\\lib\\ntpath.pyc'
print os.path.split(loca)
print os.path.splitdrive(loca)
print os.path.dirname(loca)
print os.path.basename(loca)
def readFiles(path):
for root, dirnames, filenames in os.walk(path):
for filename in filenames:
path = os.path.join(root, filename)
lines = []
f = io.open(path, 'r', encoding='latin1')
for line in f:
lines.append(line)
f.close()
message = '\n'.join(lines)
yield path, message
def dataFrameFromDirectory(path, classification):
rows = []
index = []
for filename, message in readFiles(path):
rows.append({'message': message, 'class': classification})
index.append(filename)
return DataFrame(rows, index=index)
data = DataFrame({'message':[],'class':[]})
data = data.append(dataFrameFromDirectory('C:\\Users\\nwerner\\DevMasters\\Class Notes\\Day #5 - Classifier Algorithms\\Emails\\spam','spam'))
data = data.append(dataFrameFromDirectory('C:\\Users\\nwerner\\DevMasters\\Class Notes\\Day #5 - Classifier Algorithms\\Emails\\ham','ham'))
data