Natural Language Processing¶

Is a field that is at the intersection of Computer science, Computational lingustics and Artificial Intelligence that allows a computer program to understand human speech as it is spoken.
focuses on the interactions between human language and computers.
Allows machine to understand how human speak.
Is used for sentiment analysis, topic extraction,speech tagging, relationship extraction, stemming
Is a very hard field as human's speeches are not precisely stated.

Topics

sentence Tokenize: sent_tokenize()
word tokenize:
Part of Speech tagging: pos_tag()
stemming(word root) nltk.stem import SnowballStemmer stem() - finds linguistic basis of word
lemmatization: nltk.stem import WordNetLemmatizer - finds conceptual basis of word
Named Entity Recognizer: nltk.tag.stanford import NERTagger - classify text elements into pre-defined categories
spelling correction correct()
translation and language detection: from langdetect import detect - detect()
Text Blob .detect_language() .translate()
TF-IDF is term frequency inverse document frequency: term freq = # of times word in document, doc freq = # of docs word is in

If there are any missing nltk modules, (nlt.xxxx), call nltk.download()

You can either download the missing modules individually, or download all packages

Words!¶

In a natural language processing, words come into two forms:
- inflections :
  - Adding a suffix to the word does not change its grammatical category.
    - Plural with nouns
- derivations :
  - Adding a suffix to the word does change its grammatical category.
    - beauty ---- beautiful, nation--- national

from nltk.tokenize import word_tokenize
word_tokenize("Hello world")

['Hello', 'world']

What is the difference between split and world_tokenizee?¶

message = 'I am Nolan Werner, from west africa'
message.split()

['I', 'am', 'Nolan', 'Werner,', 'from', 'Orange', 'County']

word_tokenize("This's a car")

['This', "'s", 'a', 'car']

Sentence Tokenization¶

import nltk
from nltk.tokenize import sent_tokenize
# sent_tokenize tokenizes by sentences. It is used to find the list of sentences 
text="Welcome readers.  I hope you find it interesting.  Please do reply"
print(sent_tokenize(text))

['Welcome readers.', 'I hope you find it interesting.', 'Please do reply']

Glance at list and dictionary comprehensions¶

# As usual, we use loops to do lot of things. But, list comprehension makes 
# things so much easier for us, mostly in a very beautiful fashion.

# Suppose that we have a list of numbers, ie
integerNumbers = [0,1,2,3,4,5,6,7,8,9]

# Create an array that contains the square of each elements
size = len(integerNumbers)
reservoir = [0]*size # need this to be populated
for i in xrange(len(integerNumbers)):
    reservoir[i] = integerNumbers[i]**2
reservoir

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

# The above work could have also be done by just appending
reservoir = []
for i in xrange(size):
    reservoir.append(integerNumbers[i]**2)

reservoir

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

# The same job with list comprehension

[i**2 for i in xrange(10)]

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

# list comprehesension of selecting the even numbers from 0 to 10
[i for i in xrange(10) if i%2==0]

[0, 2, 4, 6, 8]

[(x,y)
 for x in xrange(2)
 for y in['a','b','c']]

[(0, 'a'), (0, 'b'), (0, 'c'), (1, 'a'), (1, 'b'), (1, 'c')]

# conditional if can be used within a list comprehension

even_numbers = [val for val in xrange(10) if val%2==0]
even_numbers

[0, 2, 4, 6, 8]

[i**2 if i%2==0 else i**3 for i in xrange(10)] # Please take a minute and
# see what is going on here

[0, 1, 4, 27, 16, 125, 36, 343, 64, 729]

vowels = ['a','']
[word.upper() if word in['a', 'e', 'i', 'o', 'u'] else word.lower() for word in 'africa']

['A', 'f', 'r', 'I', 'c', 'A']

# dictionary comprehension

{i:i**2 for i in xrange(10)}

{0: 0, 1: 1, 2: 4, 3: 9, 4: 16, 5: 25, 6: 36, 7: 49, 8: 64, 9: 81}

# Take a list words and attach to each word, its length.
wordList = ['Moussa','John','Aristide','Abraham','Obama']
{word:len(word) for word in wordList}

{'Abraham': 7, 'Aristide': 8, 'John': 4, 'Moussa': 6, 'Obama': 5}

Glance at Map,Reduce, and Filter¶

# Map takes as arguments a function and an iterable object and apply
# the function and applies it to each element of the ietarable object.
# The result is a list.

def square(x): return x**2

list(map(square,xrange(10)))

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

list(map(lambda x: x**2,xrange(10)))

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

map(lambda word:len(word),'Arshad is very smart'.split())
# split takes the string and breaks it into a lsit of words

[6, 2, 4, 5]

# Filter, as its name stands for, selects elemts that statisfy a given 
# conditions
even_numbers = list(filter(lambda x:x%2==0, xrange(10)))
even_numbers

[0, 2, 4, 6, 8]

word_with_len_2 = filter(lambda w:len(w)==2,['I','am','from','India'])
word_with_len_2

['am']

# Reduce be looked at as follows:
# suppose we have a list of integers [val1,val2,val3,val4,val5].
# And, you aim to sum all the elements present in the list. In here,
# our function is add or +. Thus, the total sum will be done as:
# ((((val1+val2)+val3)+val4)+val5)

reduce(lambda x,y: x+y,[1,2,3,4,5])

20

# Guess what? One can initialize the sum
reduce(lambda x,y: x+y,[1,2,3,4,5],30) # I suppose my total sum is 30 at 
# the beginning

45

reduce(lambda x,y:x*y,[1,2,3,4,5]) # I am multiplying all the elements of 
# the list

120

reduce(lambda x,y:x*y,[1,2,3,4,5],30) # In here, i initialize my product by 30

3600

# let's put map,filter and at work together
# For, let's suppose i have the elements from 1 to 10. Let's compute
# the sum of the square of all the odd elements in our list
reduce(lambda x,y: x+y,map(lambda x:x**2,filter(lambda x: x%2==1,range(10))))

165

Word Tokenization¶

import nltk
# word_tokenize is used to find the list of words in strings
text = nltk.word_tokenize("PierreVinke, 59 years old, will join as a nonexecutive director on Nov. 29.")
print(text)

['PierreVinke', ',', '59', 'years', 'old', ',', 'will', 'join', 'as', 'a', 'nonexecutive', 'director', 'on', 'Nov.', '29', '.']

TreeBankWordTokenizer¶

# Treebank tokenizer uses regular expressions to tokenize texts

import nltk
from nltk.tokenize import TreebankWordTokenizer
# Treebank tokenizer uses regualar expressions to tokenize text
tokenizer = TreebankWordTokenizer()
print (tokenizer.tokenize("Have a nice day. I hope you find the book interesting"))
print (tokenizer.tokenize("Don't hesitate to ask questions"))

['Have', 'a', 'nice', 'day.', 'I', 'hope', 'you', 'find', 'the', 'book', 'interesting']
['Do', "n't", 'hesitate', 'to', 'ask', 'questions']

WordPunctTokenizer¶

from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()
print (tokenizer.tokenize("Don't hesitate to ask questions"))

['Don', "'", 't', 'hesitate', 'to', 'ask', 'questions']

RegexpTokenizer¶

import nltk
from nltk.tokenize import RegexpTokenizer
sent = "She secures 90.56% in class X.  She is a meritorious student"
capt = RegexpTokenizer('[A-Z]\w+')
capt.tokenize(sent)

['She', 'She']

import nltk
from nltk.tokenize import BlanklineTokenizer
sent = '''She secures 

90.56% in class X.  

She is a meritorious student'''
BlanklineTokenizer().tokenize(sent)

['She secures', '90.56% in class X.', 'She is a meritorious student']

Stemmer & Lemmatizer¶

Stemmer and lemmatizer are two methods to handle inflections.
Stemming and lemmatization tend to "normalize" words to their common base form,
Stemmers aim to remove the morphological affixes from words, leaving only the world stem.
Lemmatisation is to bring a word to its conventional form as it is a dictionary

PorterStemming & PorterStemmer¶

Porter stemming is an algorithm, a collection of rules that provides ways to better handle English inflections.
It is a process from removing suffixes from words in english.
Very important in information retrieval

from nltk.stem import PorterStemmer
stemmerporter = PorterStemmer()

from nltk.stem import PorterStemmer
stemmerporter = PorterStemmer()
print(stemmerporter.stem('talking'))
print (stemmerporter.stem('happiness'))
print (stemmerporter.stem('happy'))
print (stemmerporter.stem('unhappy'))
print (stemmerporter.stem('ran'))
print (stemmerporter.stem('is'))

talk
happi
happi
unhappi
ran
is

words = ['houses', 'trains', 'pens', 'cars', 'eaten','sick', 'nice', 'bought', 'selling', 'sized',
           'speech', 'rolling', 'marching', 'identification', 'universal', 'beautiful', 'references', 'countries','called']

single = [stemmerporter.stem(word) for word in words]
single

[u'hous',
 u'train',
 u'pen',
 u'car',
 u'eaten',
 u'sick',
 u'nice',
 u'bought',
 u'sell',
 u'size',
 u'speech',
 u'roll',
 u'march',
 u'identif',
 u'univers',
 u'beauti',
 u'refer',
 u'countri',
 u'call']

LancasterStemmer¶

import nltk
from nltk.stem import LancasterStemmer
stemmerLan = LancasterStemmer()
print (stemmerLan.stem('happiness'))
print (stemmerLan.stem('happy'))
print (stemmerLan.stem('unhappy'))
print (stemmerLan.stem('ran'))
print (stemmerLan.stem('is'))

happy
happy
unhappy
ran
is

RegexpStemmer¶

Uses regular expressions to identify morphological affixes. As such, a given substring that matches the regular expressions will be automatically removed.

import nltk
from nltk.stem import RegexpStemmer
stemmerreg = RegexpStemmer('ing')
print (stemmerreg.stem('working'))
print (stemmerreg.stem('happiness'))
print (stemmerreg.stem('pairing'))

work
happiness
pair

SnowballStemmer¶

It contains 16 stemmer algorithms (Danish,Dutch, English, Finnish, French, German, Hungarian,...)

import nltk
from nltk.stem import SnowballStemmer
print (SnowballStemmer.languages)
spanishstemmer = SnowballStemmer('spanish')
print (spanishstemmer.stem('comiendo'))

frenchstemmer = SnowballStemmer('french')
print (frenchstemmer.stem('manger'))

(u'danish', u'dutch', u'english', u'finnish', u'french', u'german', u'hungarian', u'italian', u'norwegian', u'porter', u'portuguese', u'romanian', u'russian', u'spanish', u'swedish')
com
mang

spanishstemmer = SnowballStemmer('french')
print (spanishstemmer.stem('danser'))

dans

Lemmatization¶

Stands for doing things in the right way based on the use of a vocabulary and that of the morphological analysis of words
Aims at removing inflectional endings only. Its purposes is to return the base or the directional form of a given word.

import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer_output = WordNetLemmatizer()
print (lemmatizer_output.lemmatize('working', pos='v'))
print (lemmatizer_output.lemmatize('ran', pos='v'))
print (lemmatizer_output.lemmatize('took', pos='v'))
print (lemmatizer_output.lemmatize('is', pos='v'))
print (lemmatizer_output.lemmatize('happiness'))
print (lemmatizer_output.lemmatize('took'))

work
run
take
be
happiness
took

Part of speech tagging¶

Tags available at Penn Treebank
Tagging is a process of classifying words in their part of speech and label then accordingly

example :

  - conjunction of coordinations get mapped to cc
  - adverbs get mapped to RB
  - prepositions get mapped to IN
  - something gets mapped to NN
  - adjectives get mapped to jj
  - verbs get mapped to VBZ

import nltk
from nltk import word_tokenize
text = word_tokenize("It is a pleasant day today")
nltk.pos_tag(text) #pos_tagger stands for part of speech tagger

[('It', 'PRP'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('pleasant', 'JJ'),
 ('day', 'NN'),
 ('today', 'NN')]

text = word_tokenize("They buy the permit in order to be able to attend the event")
nltk.pos_tag(text)

[('They', 'PRP'),
 ('buy', 'VBP'),
 ('the', 'DT'),
 ('permit', 'NN'),
 ('in', 'IN'),
 ('order', 'NN'),
 ('to', 'TO'),
 ('be', 'VB'),
 ('able', 'JJ'),
 ('to', 'TO'),
 ('attend', 'VB'),
 ('the', 'DT'),
 ('event', 'NN')]

import nltk
from nltk.tag import DefaultTagger
tag = DefaultTagger('He is the man')
tag.tag(['Beautiful', 'morning'])

Language conversion & Text formatting & Grammar¶

!pip install --trusted-host pypi.python.org autocorrect

from autocorrect import spell
spell("Tghe")

'The'

!pip install --trusted-host pypi.python.org textblob

from textblob import TextBlob
b = TextBlob("I havv good speling!")
print(b.detect_language())
print (b.correct())

en
I have good spelling!

from textblob import Word
w = Word('falability')
w.spellcheck()

[(u'fallibility', 0.3333333333333333),
 (u'capability', 0.3333333333333333),
 (u'affability', 0.3333333333333333)]

from langdetect import detect
print (detect("War doesn't show who's right, just who's left."))
print (detect("Ein, zwei, drei, vier"))
print (detect("Eu gosto de mulher"))

en
de
pt

!pip install --trusted-host pypi.python.org langdetect

#en_blob = TextBlob(u'Simple is better than complex.')
#en_blob.translate(to='vi') # vi stands for vietnamese
en_blob = TextBlob(u'I am a free black man loved by Jesus Christ.')
en_blob.translate(to='pt')

TextBlob("Eu sou um homem negro livre amado por Jesus Cristo.")

TF-IDF Term Frequency - Inverse Document Frequency¶

It is a statistical tool that aims to reflect how much important a word is to a document in a collection or corpus.
It can be seen as a weighting factor.

How to generate TF-IDF of phrases of Tokens?

- by using CountVectorizer(what the hell is this?) then feeding the output of that into TfidfTransformer.
- by directly inputing the collection of text or documents to TfidfVectorizer

Extraction of numerical features from texts¶

Tokenize strings and attach an integer, called iD, to each obtained token
Count the number of times each token appears in a given document
Normalize over the occurences in the majority of the documents
In doing so, we should note that the frequency of each token is a FEATURE

To vectorize (vectorization) aims at turning a collection of text documents into numerical feature vector

# Questions?
# what is the difference between fit(), 
#fit() : is used to generate learning model parameters from training data
#transform() : parameters generated from fit() method,applied upon model to
# generate transformed data set.
# fit_transform() : combines fit() and transform() applied on same data sets

import numpy as np
import scipy as sp
import pandas as pd
# we need to import and instantiate CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

simple_train = ['Call you tonight', 'Call me a cab', 'please call me... PLEASE!']

vect=CountVectorizer() # CountVectorizer allows one Convert a collection of text
# documents to a matrix of token count. The outputed matrix is a sparse one. What is
# a sparse matrix?
tf = pd.DataFrame(vect.fit_transform(simple_train).toarray(), columns=vect.get_feature_names())
# Take the text or the document and learn the vocabulary
tf

# we can see that it is not displaying the character a. This is mainly because the default
# Check on this case.

vect = CountVectorizer(binary=True)
df = vect.fit_transform(simple_train).toarray().sum(axis=0) # why does axis have to be zero?
pd.DataFrame(df.reshape(1,6), columns=vect.get_feature_names())
# This is about the document frequency

tf/df # why is that? What is the purpose?

vect = TfidfVectorizer()
pd.DataFrame(vect.fit_transform(simple_train).toarray(), columns=vect.get_feature_names())

CountVectorizer - Fit Transform with NLP¶

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

vectorizer  = CountVectorizer(min_df=1) # min_df represents a threshold. Here, it puts a
# constraint by telling while building building the vocabulary you need to ignore terms 
# that have a document frequency strictly lower than the given threshold
print (vectorizer)

corpus = ['This is the first document','This is the second second document', 'And the third one', 'Is this the first document?']

X = vectorizer.fit_transform(corpus)

tf = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
tf

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

print (X)

  (0, 8)	1
  (0, 3)	1
  (0, 6)	1
  (0, 2)	1
  (0, 1)	1
  (1, 8)	1
  (1, 3)	1
  (1, 6)	1
  (1, 1)	1
  (1, 5)	2
  (2, 6)	1
  (2, 0)	1
  (2, 7)	1
  (2, 4)	1
  (3, 8)	1
  (3, 3)	1
  (3, 6)	1
  (3, 2)	1
  (3, 1)	1

Yelp review analysis¶

import pandas as pd
import numpy as np
import scipy as sp
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from textblob import TextBlob, Word
from nltk.stem.snowball import SnowballStemmer
%matplotlib inline

yelp = pd.read_csv('yelp.csv')
yelp.head()

yelp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 10 columns):
business_id    10000 non-null object
date           10000 non-null object
review_id      10000 non-null object
stars          10000 non-null int64
text           10000 non-null object
type           10000 non-null object
user_id        10000 non-null object
cool           10000 non-null int64
useful         10000 non-null int64
funny          10000 non-null int64
dtypes: int64(4), object(6)
memory usage: 781.3+ KB

#create a new DataFrame that only contains the 5-star and 1-star reviews
yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]

yelp_best_worst.reset_index(drop=True, inplace=True) # reset the indices. And instead of 
# creating another data frame, let's just do it inplace

x = yelp_best_worst.text #reviews
y = yelp_best_worst.stars #ratings
# print x to look at x
# print y to take a look at
print (x.shape)

#split into training and test sets
x_train, x_test, y_train, y_test = train_test_split(x,y,random_state=1)

(4086L,)

print x

0       My wife took me here on my birthday for breakf...
1       I have no idea why some people give bad review...
2       Rosie, Dakota, and I LOVE Chaparral Dog Park!!...
3       General Manager Scott Petello is a good egg!!!...
4       Drop what you're doing and drive here. After I...
5       Nobuo shows his unique talents with everything...
6       The oldish man who owns the store is as sweet ...
7       Wonderful Vietnamese sandwich shoppe. Their ba...
8       They have a limited time thing going on right ...
9       okay this is the best place EVER! i grew up sh...
10      This place shouldn't even be reviewed - becaus...
11      first time my friend and I went there... it wa...
12      U can go there n check the car out. If u wanna...
13      I love this place! I have been coming here for...
14      I love love LOVE this place. My boss (who is i...
15      Disclaimer: Like many of you, I am a sucker fo...
16      Disgusting!  Had a Groupon so my daughter and ...
17      Never having dealt with a Discount Tire in Pho...
18      I've eaten here many times, but none as bad as...
19      (Un)fortunately for me, lux is close to my hou...
20      Fred M. pretty much said what I would say, so ...
21      Alright, I have been away from Yelp for quite ...
22      This restaurant is incredible, and has the bes...
23      I have always been a fan of Burlington's deals...
24      Another night meeting friends here.  I have to...
25      Not busy at all but took nearly 45 min to get ...
26      This an incredible church that embraces the pr...
27      This is our favorite breakfast place. The food...
28      I had looked at several invitation websites al...
29      Yikes, reading other reviews I realize my bad ...
                              ...
4056    I have a fond place in my heart for this estab...
4057    Cork is an enigma.\n\nWhat makes it enigmatic ...
4058    Went to Yogurt Kingdom for the first time toni...
4059    I find it hilarious that someone would referen...
4060                                      LOVE Five Guys!
4061    This is a great Mexican food restaurant. I eat...
4062    "Hipster,Trendy" ????-I think NOT !!!! Very di...
4063    "So Jimmy, tell the class what you saw at Swee...
4064    Standard Mexican fare - but quite delicious.  ...
4065    My profile says....\n\nMy Last Meal On Earth: ...
4066    Treats: We tried the cookies (chocolate chip a...
4067    I first joined 24 hr fitness about a year ago,...
4068    Leah, the trainer, at Dog House Training Acade...
4069    This place is super cute lunch joint.  I had t...
4070    The staff is great, the food is great, even th...
4071    Wow!  Went on a Sunday around 11am - busy but ...
4072    When I lived in Phoenix, I was a regular at Fe...
4073    Why did I wait so long to try this neighborhoo...
4074    This is the place for a fabulos breakfast!! I ...
4075    Highly recommend. This is my second time here ...
4076    5 stars for the great $5 happy hour specials. ...
4077    We brought the entire family to Giuseppe's las...
4078    Went last night to Whore Foods to get basics t...
4079    The food is delicious.  The service:  discrimi...
4080    Great food and service! Country food at its best!
4081    Yes I do rock the hipster joints.  I dig this ...
4082    Only 4 stars? \n\n(A few notes: The folks that...
4083    I'm not normally one to jump at reviewing a ch...
4084    Let's see...what is there NOT to like about Su...
4085    4-5 locations.. all 4.5 star average.. I think...
Name: text, dtype: object

Tokenization¶

# use CountVetorizer to create document-term matrices from x_train and x_test
vect = CountVectorizer()
x_train_dtm = vect.fit_transform(x_train) # learn the vocabulary dictionary ad create term document matrix
print (x_train_dtm)
#print (x_train_dtm.shape)
x_test_dtm= vect.transform(x_test)
#print x_test_dtm
#x_test_dtm.shape

  (0, 5773)	1
  (0, 10362)	2
  (0, 12465)	1
  (0, 10069)	1
  (0, 10180)	1
  (0, 16612)	2
  (0, 4631)	1
  (0, 9578)	1
  (0, 15093)	1
  (0, 11186)	1
  (0, 136)	1
  (0, 4809)	1
  (0, 15136)	1
  (0, 10413)	2
  (0, 16195)	1
  (0, 15834)	1
  (0, 12514)	2
  (0, 2789)	1
  (0, 14838)	1
  (0, 10286)	2
  (0, 3679)	1
  (0, 15032)	2
  (0, 1018)	1
  (0, 2286)	2
  (0, 1003)	1
  :	:
  (3063, 2312)	1
  (3063, 9318)	1
  (3063, 879)	1
  (3063, 10352)	2
  (3063, 15968)	1
  (3063, 7181)	1
  (3063, 15042)	1
  (3063, 5333)	1
  (3063, 8189)	2
  (3063, 1548)	1
  (3063, 9807)	1
  (3063, 2818)	1
  (3063, 2735)	1
  (3063, 14836)	1
  (3063, 6718)	1
  (3063, 16599)	1
  (3063, 6974)	1
  (3063, 14137)	1
  (3063, 5139)	1
  (3063, 4538)	1
  (3063, 10805)	1
  (3063, 14994)	1
  (3063, 9438)	1
  (3063, 16162)	1
  (3063, 6616)	1

print x_test

1607    Looking a cutting edge, wanting the best for e...
3409    Greatness in the form of food, just like the o...
1751    The Flower Studio far exceeded my expectations...
2275        So yummy! Strange combination but great place
230     I've been hearing about these cheesecakes from...
902     This has to be the worst restaurant in terms o...
1865    I ate at Scramble last Friday and I have to sa...
636     We decided to eat here on a whim. My husband g...
2625    I LOVE BURRITO EXPRESS. My fiance has been goi...
943     Just open.  I had the roast beef sandwich and ...
1171    Cute busy place in Central Phoenix. Not hiding...
1247    I'm a big fan of Silver Mine. I have been for ...
200     I have now visited Herb n' Flavors several tim...
891     I love to come here from time to time when I'm...
443     Went to Fatburger with our family tonight and ...
2497    This review pertains to carnitas, and as such ...
1673    TIP #1 to Mesa-Gateway fliers: This is the ONL...
745     If you like the stuck up Scottsdale vibe this ...
1105    I'm sorry to be what seems to be the lone one ...
3227    Unprofessional, disorganized, and extremely lo...
1164    Bad music, slow service, disgusting overpriced...
3896    I can't remember the name of the special salad...
4001    Went here last night when on our last night st...
3868    There is only one reason  why I shouldn't love...
4071    Wow!  Went on a Sunday around 11am - busy but ...
1767    Wow! The Penang Curry (chicken) was absolutely...
936     This place is what Desert Ridge wishes it was!...
1249    Andrea is absolutely wonderful. She's pet-sit ...
2095    I am from Chicago - and the italian beef here ...
3803    I was given a $100 gift card to use at Willow ...
                              ...
2867    Totally excited to try this place out, my gran...
1533    Went here for the first time today.  Loved it....
3266    Still a place that is unacceptable in my book-...
407     I took my family here and this was a disappoin...
137     So your going to Scottsdale via Paradise Valle...
973     Very good place to eat.. I go here atleast 3 t...
797     Just did take out.  Great experience.  Easy to...
1808    my husband and i LOVE this place! \n\nno, we c...
1094    Best Greek food I had in Arizona and excellent...
1545              Best ribs in Arizona (besides my own)!!
2346    The greatest community that I've ever been a p...
3330    Love Belly Rubz!! At first little Oscar was ti...
2970    A yummy Mexican Sunnyslope dive.  The oatmeal ...
1540    Have been going to LGO since 2003 and have alw...
3255    I have taken acting classes from Verve Studios...
834     I have been going to the Matador since I was l...
516     Fantastic donuts! Great selection! Coffee was ...
2969    Thanks for helping me to find Valley Eyecare C...
1291    Very good place for breakfast and their pies a...
1156    What an awesome business! Friendly, knowledgab...
220     HELLISH HELLISH SUMMER WEATHER (March thru Oct...
3424    Love this spot - it's pretty close to the conv...
3344    Found the Tuck Shop on my Urban Spoon AP and w...
3685    I usually do not complain about bad food but t...
3141    Wow, this place is still here? I went there as...
2793    Honey jalapeño chicken lollipops and sweet pot...
671                    probably my favorite restaurant :)
3441    A philosophical elder of my profession commonl...
3224    First, I'm sorry this review is lengthy, but i...
3362    You speak Italian to me and provide mouth wate...
Name: text, dtype: object

tf = pd.DataFrame(x_train_dtm.toarray(), columns=vect.get_feature_names())
tf.head()

x_train.head()

2790    FILLY-B's!!!!!  only 8 reviews?? NINE now!!!\n...
725     My husband and I absolutely LOVE this restaura...
1578    We went today after lunch. I got my usual of l...
282     Totally dissapointed.  I had purchased a coupo...
2024    Costco Travel - My husband and I recently retu...
Name: text, dtype: object

#don't lowercase
vect = CountVectorizer(lowercase=False)
x_train_dtm = vect.fit_transform(x_train)
x_train_dtm.shape

(3064, 20838)

# include 1-grams and 2-grams (an n-gram is N-grams is just all combinations of adjacent words 
# or letters of length n that you can find in your source text)
vect = CountVectorizer(ngram_range=(1,2))
x_train_dtm = vect.fit_transform(x_train)
x_train_dtm.shape

(3064, 169847)

print (vect.get_feature_names()[-50:]) # The last 50 words

[u'zone out', u'zone when', u'zones', u'zones dolls', u'zoning', u'zoning issues', u'zoo', u'zoo and', u'zoo is', u'zoo not', u'zoo the', u'zoo ve', u'zoyo', u'zoyo for', u'zucca', u'zucca appetizer', u'zucchini', u'zucchini and', u'zucchini bread', u'zucchini broccoli', u'zucchini carrots', u'zucchini fries', u'zucchini pieces', u'zucchini strips', u'zucchini veal', u'zucchini very', u'zucchini with', u'zuchinni', u'zuchinni again', u'zuchinni the', u'zumba', u'zumba class', u'zumba or', u'zumba yogalates', u'zupa', u'zupa flavors', u'zuzu', u'zuzu in', u'zuzu is', u'zuzu the', u'zwiebel', u'zwiebel kr\xe4uter', u'zzed', u'zzed in', u'\xe9clairs', u'\xe9clairs napoleons', u'\xe9cole', u'\xe9cole len\xf4tre', u'\xe9m', u'\xe9m all']

Predict the star rating¶

vect = CountVectorizer()

x_train_dtm = vect.fit_transform(x_train)
x_test_dtm = vect.transform(x_test)

# Questions?
# what is the difference between f
#fit() : is used to generate learning model parameters from training data
#transform() : parameters generated from fit() method,applied upon model to
# generate transformed data set.
# fit_transform() : combines fit() and transform() api on same data sets

#Naive Bayes
nb = MultinomialNB()
nb.fit(x_train_dtm, y_train)
y_pred_class = nb.predict(x_test_dtm)

print (metrics.accuracy_score(y_test, y_pred_class))

0.918786692759

calculate null accuracy¶

y_test_binary = np.where(y_test==5, 1, 0) max(y_test_binary.mean(), 1-y_test_binary.mean())

#define a function that accepts a vectorizer and calculates the accuracy
def tokenize_test(vect):
    x_train_dtm = vect.fit_transform(x_train)
    print ('Features: ', x_train_dtm.shape[1])
    x_test_dtm = vect.transform(x_test)
    nb = MultinomialNB()
    nb.fit(x_train_dtm, y_train)
    y_pred_class = nb.predict(x_test_dtm)
    print ('Accuracy: ', metrics.accuracy_score(y_test, y_pred_class))

#include 1-grams and 2-grams
vect = CountVectorizer(ngram_range=(1,2))
tokenize_test(vect)

('Features: ', 169847)
('Accuracy: ', 0.85420743639921726)

vect = CountVectorizer()
tokenize_test(vect)

('Features: ', 16825)
('Accuracy: ', 0.91878669275929548)

Stopword removal¶

#remove English stop words
vect = CountVectorizer(stop_words='english')
tokenize_test(vect)

('Features: ', 16528)
('Accuracy: ', 0.91585127201565553)

# set of stop words
print (vect.get_stop_words())

frozenset(['all', 'six', 'less', 'being', 'indeed', 'over', 'move', 'anyway', 'four', 'not', 'own', 'through', 'yourselves', 'fify', 'where', 'mill', 'only', 'find', 'before', 'one', 'whose', 'system', 'how', 'somewhere', 'with', 'thick', 'show', 'had', 'enough', 'should', 'to', 'must', 'whom', 'seeming', 'under', 'ours', 'has', 'might', 'thereafter', 'latterly', 'do', 'them', 'his', 'around', 'than', 'get', 'very', 'de', 'none', 'cannot', 'every', 'whether', 'they', 'front', 'during', 'thus', 'now', 'him', 'nor', 'name', 'several', 'hereafter', 'always', 'who', 'cry', 'whither', 'this', 'someone', 'either', 'each', 'become', 'thereupon', 'sometime', 'side', 'two', 'therein', 'twelve', 'because', 'often', 'ten', 'our', 'eg', 'some', 'back', 'up', 'go', 'namely', 'towards', 'are', 'further', 'beyond', 'ourselves', 'yet', 'out', 'even', 'will', 'what', 'still', 'for', 'bottom', 'mine', 'since', 'please', 'forty', 'per', 'its', 'everything', 'behind', 'un', 'above', 'between', 'it', 'neither', 'seemed', 'ever', 'across', 'she', 'somehow', 'be', 'we', 'full', 'never', 'sixty', 'however', 'here', 'otherwise', 'were', 'whereupon', 'nowhere', 'although', 'found', 'alone', 're', 'along', 'fifteen', 'by', 'both', 'about', 'last', 'would', 'anything', 'via', 'many', 'could', 'thence', 'put', 'against', 'keep', 'etc', 'amount', 'became', 'ltd', 'hence', 'onto', 'or', 'con', 'among', 'already', 'co', 'afterwards', 'formerly', 'within', 'seems', 'into', 'others', 'while', 'whatever', 'except', 'down', 'hers', 'everyone', 'done', 'least', 'another', 'whoever', 'moreover', 'couldnt', 'throughout', 'anyhow', 'yourself', 'three', 'from', 'her', 'few', 'together', 'top', 'there', 'due', 'been', 'next', 'anyone', 'eleven', 'much', 'call', 'therefore', 'interest', 'then', 'thru', 'themselves', 'hundred', 'was', 'sincere', 'empty', 'more', 'himself', 'elsewhere', 'mostly', 'on', 'fire', 'am', 'becoming', 'hereby', 'amongst', 'else', 'part', 'everywhere', 'too', 'herself', 'former', 'those', 'he', 'me', 'myself', 'made', 'twenty', 'these', 'bill', 'cant', 'us', 'until', 'besides', 'nevertheless', 'below', 'anywhere', 'nine', 'can', 'of', 'toward', 'my', 'something', 'and', 'whereafter', 'whenever', 'give', 'almost', 'wherever', 'is', 'describe', 'beforehand', 'herein', 'an', 'as', 'itself', 'at', 'have', 'in', 'seem', 'whence', 'ie', 'any', 'fill', 'again', 'hasnt', 'inc', 'thereby', 'thin', 'no', 'perhaps', 'latter', 'meanwhile', 'when', 'detail', 'same', 'wherein', 'beside', 'also', 'that', 'other', 'take', 'which', 'becomes', 'you', 'if', 'nobody', 'see', 'though', 'may', 'after', 'upon', 'most', 'hereupon', 'eight', 'but', 'serious', 'nothing', 'such', 'your', 'why', 'a', 'off', 'whereby', 'third', 'i', 'whole', 'noone', 'sometimes', 'well', 'amoungst', 'yours', 'their', 'rather', 'without', 'so', 'five', 'the', 'first', 'whereas', 'once'])

#max_features
vect = CountVectorizer(stop_words='english', max_features=100)
tokenize_test(vect)

('Features: ', 100)
('Accuracy: ', 0.86986301369863017)

print(vect.get_feature_names())

[u'amazing', u'area', u'atmosphere', u'awesome', u'bad', u'bar', u'best', u'better', u'big', u'came', u'cheese', u'chicken', u'clean', u'coffee', u'come', u'day', u'definitely', u'delicious', u'did', u'didn', u'dinner', u'don', u'eat', u'excellent', u'experience', u'favorite', u'feel', u'food', u'free', u'fresh', u'friendly', u'friends', u'going', u'good', u'got', u'great', u'happy', u'home', u'hot', u'hour', u'just', u'know', u'like', u'little', u'll', u'location', u'long', u'looking', u'lot', u'love', u'lunch', u'make', u'meal', u'menu', u'minutes', u'need', u'new', u'nice', u'night', u'order', u'ordered', u'people', u'perfect', u'phoenix', u'pizza', u'place', u'pretty', u'prices', u'really', u'recommend', u'restaurant', u'right', u'said', u'salad', u'sandwich', u'sauce', u'say', u'service', u'staff', u'store', u'sure', u'table', u'thing', u'things', u'think', u'time', u'times', u'took', u'town', u'tried', u'try', u've', u'wait', u'want', u'way', u'went', u'wine', u'work', u'worth', u'years']

vect = CountVectorizer(ngram_range=(1,2), max_features=100000)
tokenize_test(vect)

('Features: ', 100000)
('Accuracy: ', 0.88551859099804309)

#min_df sets the minimum document frequency allowed when creating vocab
vect = CountVectorizer(ngram_range=(1,2), min_df=2)
tokenize_test(vect)

('Features: ', 43957)
('Accuracy: ', 0.93248532289628183)

TextBlob¶

is a Python (2 and 3) library for processing textual data.

print (yelp_best_worst.text[0])

My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.

Do yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I've ever had.  I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.

While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I've ever had.

Anyway, I can't wait to go back!

review = TextBlob(yelp_best_worst.text[0])

review.words

WordList(['My', 'wife', 'took', 'me', 'here', 'on', 'my', 'birthday', 'for', 'breakfast', 'and', 'it', 'was', 'excellent', 'The', 'weather', 'was', 'perfect', 'which', 'made', 'sitting', 'outside', 'overlooking', 'their', 'grounds', 'an', 'absolute', 'pleasure', 'Our', 'waitress', 'was', 'excellent', 'and', 'our', 'food', 'arrived', 'quickly', 'on', 'the', 'semi-busy', 'Saturday', 'morning', 'It', 'looked', 'like', 'the', 'place', 'fills', 'up', 'pretty', 'quickly', 'so', 'the', 'earlier', 'you', 'get', 'here', 'the', 'better', 'Do', 'yourself', 'a', 'favor', 'and', 'get', 'their', 'Bloody', 'Mary', 'It', 'was', 'phenomenal', 'and', 'simply', 'the', 'best', 'I', "'ve", 'ever', 'had', 'I', "'m", 'pretty', 'sure', 'they', 'only', 'use', 'ingredients', 'from', 'their', 'garden', 'and', 'blend', 'them', 'fresh', 'when', 'you', 'order', 'it', 'It', 'was', 'amazing', 'While', 'EVERYTHING', 'on', 'the', 'menu', 'looks', 'excellent', 'I', 'had', 'the', 'white', 'truffle', 'scrambled', 'eggs', 'vegetable', 'skillet', 'and', 'it', 'was', 'tasty', 'and', 'delicious', 'It', 'came', 'with', '2', 'pieces', 'of', 'their', 'griddled', 'bread', 'with', 'was', 'amazing', 'and', 'it', 'absolutely', 'made', 'the', 'meal', 'complete', 'It', 'was', 'the', 'best', 'toast', 'I', "'ve", 'ever', 'had', 'Anyway', 'I', 'ca', "n't", 'wait', 'to', 'go', 'back'])

review.sentences

[Sentence("My wife took me here on my birthday for breakfast and it was excellent."),
 Sentence("The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure."),
 Sentence("Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning."),
 Sentence("It looked like the place fills up pretty quickly so the earlier you get here the better."),
 Sentence("Do yourself a favor and get their Bloody Mary."),
 Sentence("It was phenomenal and simply the best I've ever had."),
 Sentence("I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it."),
 Sentence("It was amazing."),
 Sentence("While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious."),
 Sentence("It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete."),
 Sentence("It was the best "toast" I've ever had."),
 Sentence("Anyway, I can't wait to go back!")]

review.lower()

TextBlob("my wife took me here on my birthday for breakfast and it was excellent.  the weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  our waitress was excellent and our food arrived quickly on the semi-busy saturday morning.  it looked like the place fills up pretty quickly so the earlier you get here the better.

do yourself a favor and get their bloody mary.  it was phenomenal and simply the best i've ever had.  i'm pretty sure they only use ingredients from their garden and blend them fresh when you order it.  it was amazing.

while everything on the menu looks excellent, i had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  it came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  it was the best "toast" i've ever had.

anyway, i can't wait to go back!")

Stemming and lemmatization¶

stemmer = SnowballStemmer('english')
print ([stemmer.stem(word) for word in review.words])

[u'my', u'wife', u'took', u'me', u'here', u'on', u'my', u'birthday', u'for', u'breakfast', u'and', u'it', u'was', u'excel', u'the', u'weather', u'was', u'perfect', u'which', u'made', u'sit', u'outsid', u'overlook', u'their', u'ground', u'an', u'absolut', u'pleasur', u'our', u'waitress', u'was', u'excel', u'and', u'our', u'food', u'arriv', u'quick', u'on', u'the', u'semi-busi', u'saturday', u'morn', u'it', u'look', u'like', u'the', u'place', u'fill', u'up', u'pretti', u'quick', u'so', u'the', u'earlier', u'you', u'get', u'here', u'the', u'better', u'do', u'yourself', u'a', u'favor', u'and', u'get', u'their', u'bloodi', u'mari', u'it', u'was', u'phenomen', u'and', u'simpli', u'the', u'best', u'i', u've', u'ever', u'had', u'i', u"'m", u'pretti', u'sure', u'they', u'onli', u'use', u'ingredi', u'from', u'their', u'garden', u'and', u'blend', u'them', u'fresh', u'when', u'you', u'order', u'it', u'it', u'was', u'amaz', u'while', u'everyth', u'on', u'the', u'menu', u'look', u'excel', u'i', u'had', u'the', u'white', u'truffl', u'scrambl', u'egg', u'veget', u'skillet', u'and', u'it', u'was', u'tasti', u'and', u'delici', u'it', u'came', u'with', u'2', u'piec', u'of', u'their', u'griddl', u'bread', u'with', u'was', u'amaz', u'and', u'it', u'absolut', u'made', u'the', u'meal', u'complet', u'it', u'was', u'the', u'best', u'toast', u'i', u've', u'ever', u'had', u'anyway', u'i', u'ca', u"n't", u'wait', u'to', u'go', u'back']

print ([word.lemmatize() for word in review.words])

['My', 'wife', 'took', 'me', 'here', 'on', 'my', 'birthday', 'for', 'breakfast', 'and', 'it', u'wa', 'excellent', 'The', 'weather', u'wa', 'perfect', 'which', 'made', 'sitting', 'outside', 'overlooking', 'their', u'ground', 'an', 'absolute', 'pleasure', 'Our', 'waitress', u'wa', 'excellent', 'and', 'our', 'food', 'arrived', 'quickly', 'on', 'the', 'semi-busy', 'Saturday', 'morning', 'It', 'looked', 'like', 'the', 'place', u'fill', 'up', 'pretty', 'quickly', 'so', 'the', 'earlier', 'you', 'get', 'here', 'the', 'better', 'Do', 'yourself', 'a', 'favor', 'and', 'get', 'their', 'Bloody', 'Mary', 'It', u'wa', 'phenomenal', 'and', 'simply', 'the', 'best', 'I', "'ve", 'ever', 'had', 'I', "'m", 'pretty', 'sure', 'they', 'only', 'use', u'ingredient', 'from', 'their', 'garden', 'and', 'blend', 'them', 'fresh', 'when', 'you', 'order', 'it', 'It', u'wa', 'amazing', 'While', 'EVERYTHING', 'on', 'the', 'menu', u'look', 'excellent', 'I', 'had', 'the', 'white', 'truffle', 'scrambled', u'egg', 'vegetable', 'skillet', 'and', 'it', u'wa', 'tasty', 'and', 'delicious', 'It', 'came', 'with', '2', u'piece', 'of', 'their', 'griddled', 'bread', 'with', u'wa', 'amazing', 'and', 'it', 'absolutely', 'made', 'the', 'meal', 'complete', 'It', u'wa', 'the', 'best', 'toast', 'I', "'ve", 'ever', 'had', 'Anyway', 'I', 'ca', "n't", 'wait', 'to', 'go', 'back']

#assume every word is a verb
print ([word.lemmatize(pos='v') for word in review.words])

['My', 'wife', u'take', 'me', 'here', 'on', 'my', 'birthday', 'for', 'breakfast', 'and', 'it', u'be', 'excellent', 'The', 'weather', u'be', 'perfect', 'which', u'make', u'sit', 'outside', u'overlook', 'their', u'ground', 'an', 'absolute', 'pleasure', 'Our', 'waitress', u'be', 'excellent', 'and', 'our', 'food', u'arrive', 'quickly', 'on', 'the', 'semi-busy', 'Saturday', 'morning', 'It', u'look', 'like', 'the', 'place', u'fill', 'up', 'pretty', 'quickly', 'so', 'the', 'earlier', 'you', 'get', 'here', 'the', 'better', 'Do', 'yourself', 'a', 'favor', 'and', 'get', 'their', 'Bloody', 'Mary', 'It', u'be', 'phenomenal', 'and', 'simply', 'the', 'best', 'I', "'ve", 'ever', u'have', 'I', "'m", 'pretty', 'sure', 'they', 'only', 'use', 'ingredients', 'from', 'their', 'garden', 'and', 'blend', 'them', 'fresh', 'when', 'you', 'order', 'it', 'It', u'be', u'amaze', 'While', 'EVERYTHING', 'on', 'the', 'menu', u'look', 'excellent', 'I', u'have', 'the', 'white', 'truffle', u'scramble', u'egg', 'vegetable', 'skillet', 'and', 'it', u'be', 'tasty', 'and', 'delicious', 'It', u'come', 'with', '2', u'piece', 'of', 'their', u'griddle', 'bread', 'with', u'be', u'amaze', 'and', 'it', 'absolutely', u'make', 'the', 'meal', 'complete', 'It', u'be', 'the', 'best', 'toast', 'I', "'ve", 'ever', u'have', 'Anyway', 'I', 'ca', "n't", 'wait', 'to', 'go', 'back']

def split_into_lemmas(text):
    text = unicode(text, 'utf-8').lower() #Python 2
    #text = str(text).lower() #Python 3
    words = TextBlob(text).words
    #return [word.lemmatize() for word in words]
    return [stemmer.stem(word) for word in words]

#split review text into lemmas rather than into words (default)
vect = CountVectorizer(analyzer=split_into_lemmas)
tokenize_test(vect)

('Features: ', 13273)
('Accuracy: ', 0.92465753424657537)

print (vect.get_feature_names()[-50:])

[u'yuuuuummmmmyyi', u'yuuuuuuum', u'yuyuyummi', u'yuzu', u'z', u'z-grill', u'z11', u'zach', u'zam', u'zanella', u'zankou', u'zappo', u'zatsiki', u'zen', u'zen-lik', u'zero', u'zero-star', u'zest', u'zexperi', u'zha', u'zhou', u'zia', u'zilch', u'zin', u'zinburg', u'zinburgergeist', u'zinc', u'zinfandel', u'zing', u'zip', u'zipcar', u'zipp', u'zipper', u'ziti', u'zoe', u'zombi', u'zone', u'zoo', u'zoyo', u'zucca', u'zucchini', u'zuchinni', u'zumba', u'zupa', u'zuzu', u'zwiebel-kr\xe4ut', u'zzed', u'\xe9clair', u'\xe9cole', u'\xe9m']

Term Frequency - Inverse Document Frequency (TF-IDF)¶

This is a repeat of the code from the TF-IDF intro section

#example documents
simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']

#term frequency
vect = CountVectorizer()
tf = pd.DataFrame(vect.fit_transform(simple_train).toarray(), columns=vect.get_feature_names())
tf

#document frequency
vect = CountVectorizer(binary=True)
df = vect.fit_transform(simple_train).toarray().sum(axis=0)
pd.DataFrame(df.reshape(1,6), columns=vect.get_feature_names())

#term frequency- inverse document frequency (tf-idf) - Simple version
tf/df

vect = TfidfVectorizer()
pd.DataFrame(vect.fit_transform(simple_train).toarray(), columns=vect.get_feature_names())

Using TF-IDF to summarize a Yelp review¶

#create a document-term matrix using TF-ID
vect = TfidfVectorizer(stop_words='english')
dtm = vect.fit_transform(yelp.text)
features = vect.get_feature_names()
dtm.shape

(10000, 28881)

def summarize():

    #choose a random review that is at least 300 characters
    review_length = 0
    while review_length < 300:
        review_id = np.random.randint(0, len(yelp))
        review_text = unicode(yelp.text[review_id], 'utf-8') #Python 2
        #review_text = str(yelp.text[review_id]) #Python3
        review_length = len(review_text)

    #create a dictionary of words and their TF-IDF scores
    word_scores = {}
    for word in TextBlob(review_text).words:
        word = word.lower()
        if word in features:
            word_scores[word] = dtm[review_id, features.index(word)]

    #print words with the top 5 TF-IDF scores
    print ('TOP SCORING WORDS:')
    top_scores = sorted(word_scores.items(), key=lambda x: x[1], reverse=True)[:5]
    for word, score in top_scores:
        print (word)

    #print the review
    print ('\n' + review_text)

summarize()

TOP SCORING WORDS:
bowl
smashing
soba
circus
facing

I freakin love this place. My favorite thing is to sit and eat facing the counter and watch new people come in and get all confused. Now that's just funny. My first time I was the same way, like what the hell do I do here. Now I'm a pro. Stack it deep and use another bowl for smashing, Soba noodles piled so high it looks like a circus act getting the bowl to the cook. Mmmm....good.

Sentiment Analysis¶

print (review)

My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.

Do yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I've ever had.  I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.

While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I've ever had.

Anyway, I can't wait to go back!

max_i = 0
max_polarity = -float('inf')

min_i = 0
min_polarity = float('inf')

for i in range(len(yelp_best_worst.text)):
    review_text = unicode(yelp_best_worst.text[i], 'utf-8') #Python 2
    #review_text = str(yelp_best_worst.text[i]) #Python3
    this_polarity = TextBlob(review_text).sentiment.polarity

    if this_polarity > max_polarity:
        max_i = i
        max_polarity = this_polarity

    if this_polarity < min_polarity:
        min_i = i
        min_polarity = this_polarity

print (TextBlob(yelp_best_worst.text[max_i]))
print (TextBlob(yelp_best_worst.text[min_i]))

Our server Gary was awesome. Food was amazing...an experience.
This was absolutely horrible. I got the supreme pizza with the mystery meats.  I threw it in the trash. I will wait until I get to my destination to eat. Horrible!!!

#polarity ranges from -1 (most negative) to 1 (most positive)
print(review.sentiment.polarity)
print(max_polarity)
print(min_polarity)

0.402469135802
1.0
-1.0

#understanding the apply method
yelp['length'] = yelp.text.apply(len)

yelp.head(10)

#define a function that accepts text and returns polarity
def detect_sentiment(text):
    return TextBlob(text.decode('utf-8')).sentiment.polarity #Python 2
    #return TextBlob(text).sentiment.polarity Python 3

#create a new DataFrame column for sentiment
yelp['sentiment'] = yelp.text.apply(detect_sentiment)

yelp.boxplot(column='sentiment', by='stars')

<matplotlib.axes._subplots.AxesSubplot at 0x3af4f630>

#reviews with most positive sentiment
yelp[yelp.sentiment == 1].text.head()

254    Our server Gary was awesome. Food was amazing....
347    3 syllables for this place. \nA-MAZ-ING!\n\nTh...
420                                    LOVE the food!!!!
459    Love it!!! Wish we still lived in Arizona as C...
679                                     Excellent burger
Name: text, dtype: object

#reviews with most negative sentiment
yelp[yelp.sentiment == -1].text.head()

773     This was absolutely horrible. I got the suprem...
1517                  Nasty workers and over priced trash
3266    Absolutely awful... these guys have NO idea wh...
4766                                       Very bad food!
5812        I wouldn't send my worst enemy to this place.
Name: text, dtype: object

TextBlob features¶

# spelling correction
TextBlob('15 minuets late').correct()

TextBlob("15 minutes late")

# spellcheck
Word('parot').spellcheck()

[('part', 0.9929478138222849), (u'parrot', 0.007052186177715092)]

# definitions
Word('bank').define('v')

[u'tip laterally',
 u'enclose with a bank',
 u'do business with a bank or keep an account at a bank',
 u'act as the banker in a game or in gambling',
 u'be in the banking business',
 u'put into a bank account',
 u'cover with ashes so to control the rate of burning',
 u'have confidence or faith in']

# language identification
TextBlob('Hola amigos').detect_language()

u'es'

# language identification
TextBlob('Hola amigos').translate(from_lang='auto', to='en')

TextBlob("Hello friends")

#sentiment
TextBlob('That movie was good.').sentiment

Sentiment(polarity=0.7, subjectivity=0.6000000000000001)

	cab	call	me	please	tonight	you
0	0.000000	0.385372	0.000000	0.000000	0.652491	0.652491
1	0.720333	0.425441	0.547832	0.000000	0.000000	0.000000
2	0.000000	0.266075	0.342620	0.901008	0.000000	0.000000

	business_id	date	review_id	stars	text	type	user_id	cool	useful
0	9yKzy9PApeiPPOUJEtnvkg	2011-01-26	fWKvX83p0-ka4JS3dc6E5A	5	My wife took me here on my birthday for breakf...	review	rLtl8ZkDX5vH5nAx9C3q5Q	2	5
1	ZRJwVLyzEJq1VAihDhYiow	2011-07-27	IjZ33sJrzXqU-0X6U8NwyA	5	I have no idea why some people give bad review...	review	0a2KyEL0d3Yb1V6aivbIuQ	0	0
2	6oRAC4uyJCsJl1X0WZpVSA	2012-06-14	IESLBzqUCLdSzSqm0eCSxQ	4	love the gyro plate. Rice is so good and I als...	review	0hT2KtfLiobPvh6cDC8JQg	0	1
3	_1QQZuf4zZOyFCvXc0o6Vg	2010-05-27	G-WvGaISbqqaMHlNnByodA	5	Rosie, Dakota, and I LOVE Chaparral Dog Park!!...	review	uZetl9T0NcROGOyFfughhg	1	2
4	6ozycU1RpktNG2-1BroVtw	2012-01-05	1uJFq2r5QfJG_6ExMRCaGw	5	General Manager Scott Petello is a good egg!!!...	review	vYmM4KTsC8ZfQBg-j5MWkw	0	0

	cab	call	me	please	tonight	you
0	0.000000	0.385372	0.000000	0.000000	0.652491	0.652491
1	0.720333	0.425441	0.547832	0.000000	0.000000	0.000000
2	0.000000	0.266075	0.342620	0.901008	0.000000	0.000000

	business_id	date	review_id	stars	text	type	user_id	cool	useful	funny	length
0	9yKzy9PApeiPPOUJEtnvkg	2011-01-26	fWKvX83p0-ka4JS3dc6E5A	5	My wife took me here on my birthday for breakf...	review	rLtl8ZkDX5vH5nAx9C3q5Q	2	5	0	889
1	ZRJwVLyzEJq1VAihDhYiow	2011-07-27	IjZ33sJrzXqU-0X6U8NwyA	5	I have no idea why some people give bad review...	review	0a2KyEL0d3Yb1V6aivbIuQ	0	0	0	1345
2	6oRAC4uyJCsJl1X0WZpVSA	2012-06-14	IESLBzqUCLdSzSqm0eCSxQ	4	love the gyro plate. Rice is so good and I als...	review	0hT2KtfLiobPvh6cDC8JQg	0	1	0	76
3	_1QQZuf4zZOyFCvXc0o6Vg	2010-05-27	G-WvGaISbqqaMHlNnByodA	5	Rosie, Dakota, and I LOVE Chaparral Dog Park!!...	review	uZetl9T0NcROGOyFfughhg	1	2	0	419
4	6ozycU1RpktNG2-1BroVtw	2012-01-05	1uJFq2r5QfJG_6ExMRCaGw	5	General Manager Scott Petello is a good egg!!!...	review	vYmM4KTsC8ZfQBg-j5MWkw	0	0	0	469
5	-yxfBYGB6SEqszmxJxd97A	2007-12-13	m2CKSsepBCoRYWxiRUsxAg	4	Quiessence is, simply put, beautiful. Full wi...	review	sqYN3lNgvPbPCTRsMFu27g	4	3	1	2094
6	zp713qNhx8d9KCJJnrw1xA	2010-02-12	riFQ3vxNpP4rWLk_CSri2A	5	Drop what you're doing and drive here. After I...	review	wFweIWhv2fREZV_dYkz_1g	7	7	4	1565
7	hW0Ne_HTHEAgGF1rAdmR-g	2012-07-12	JL7GXJ9u4YMx7Rzs05NfiQ	4	Luckily, I didn't have to travel far to make m...	review	1ieuYcKS7zeAv_U15AB13A	0	1	0	274
8	wNUea3IXZWD63bbOQaOH-g	2012-08-17	XtnfnYmnJYi71yIuGsXIUA	4	Definitely come for Happy hour! Prices are ama...	review	Vh_DlizgGhSqQh4qfZ2h6A	0	0	0	349
9	nMHhuYan8e3cONo3PornJA	2010-08-11	jJAIXA46pU1swYyRCdfXtQ	5	Nobuo shows his unique talents with everything...	review	sUNkXg8-KFtCMQDV6zRzQg	0	1	0	186

	cab	call	me	please	tonight	you
0	0.0	0.333333	0.0	0.0	1.0	1.0
1	1.0	0.333333	0.5	0.0	0.0	0.0
2	0.0	0.333333	0.5	2.0	0.0	0.0

	00	000	00a	00am	00pm	01	02	03	03342	04	...	zucchini	zuchinni	zumba	zupa	zuzu	zwiebel	zzed	éclairs	école	ém
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

	00	000	00a	00am	00pm	01	02	03	03342	04	...	zucchini	zuchinni	zumba	zupa	zuzu	zwiebel	zzed	éclairs	école	ém
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

Natural Language Processing - Class Notes