Natural Language Processing - Class Notes

Natural Language Processing - Notes

Natural Language Processing

  • Is a field that is at the intersection of Computer science, Computational lingustics and Artificial Intelligence that allows a computer program to understand human speech as it is spoken.
  • focuses on the interactions between human language and computers.
  • Allows machine to understand how human speak.
  • Is used for sentiment analysis, topic extraction,speech tagging, relationship extraction, stemming
  • Is a very hard field as human's speeches are not precisely stated.

Topics

  • sentence Tokenize: sent_tokenize()
  • word tokenize:
  • Part of Speech tagging: pos_tag()
  • stemming(word root) nltk.stem import SnowballStemmer stem() - finds linguistic basis of word
  • lemmatization: nltk.stem import WordNetLemmatizer - finds conceptual basis of word
  • Named Entity Recognizer: nltk.tag.stanford import NERTagger - classify text elements into pre-defined categories
  • spelling correction correct()
  • translation and language detection: from langdetect import detect - detect()
  • Text Blob .detect_language() .translate()
  • TF-IDF is term frequency inverse document frequency: term freq = # of times word in document, doc freq = # of docs word is in

If there are any missing nltk modules, (nlt.xxxx), call nltk.download()

You can either download the missing modules individually, or download all packages

Words!

  • In a natural language processing, words come into two forms:
    • inflections :
      • Adding a suffix to the word does not change its grammatical category.
        • Plural with nouns
    • derivations :
      • Adding a suffix to the word does change its grammatical category.
        • beauty ---- beautiful, nation--- national
In [1]:
from nltk.tokenize import word_tokenize
word_tokenize("Hello world")
Out[1]:
['Hello', 'world']

What is the difference between split and world_tokenizee?

In [1]:
message = 'I am Nolan Werner, from west africa'
message.split()
Out[1]:
['I', 'am', 'Nolan', 'Werner,', 'from', 'Orange', 'County']
In [2]:
word_tokenize("This's a car")
Out[2]:
['This', "'s", 'a', 'car']

Sentence Tokenization

In [5]:
import nltk
from nltk.tokenize import sent_tokenize
# sent_tokenize tokenizes by sentences. It is used to find the list of sentences 
text="Welcome readers.  I hope you find it interesting.  Please do reply"
print(sent_tokenize(text))
['Welcome readers.', 'I hope you find it interesting.', 'Please do reply']

Glance at list and dictionary comprehensions

In [ ]:
# As usual, we use loops to do lot of things. But, list comprehension makes 
# things so much easier for us, mostly in a very beautiful fashion.
In [1]:
# Suppose that we have a list of numbers, ie
integerNumbers = [0,1,2,3,4,5,6,7,8,9]
In [5]:
# Create an array that contains the square of each elements
size = len(integerNumbers)
reservoir = [0]*size # need this to be populated
for i in xrange(len(integerNumbers)):
    reservoir[i] = integerNumbers[i]**2
reservoir
Out[5]:
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
In [7]:
# The above work could have also be done by just appending
reservoir = []
for i in xrange(size):
    reservoir.append(integerNumbers[i]**2)

reservoir
Out[7]:
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
In [ ]:
# The same job with list comprehension
In [1]:
[i**2 for i in xrange(10)]
Out[1]:
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
In [3]:
# list comprehesension of selecting the even numbers from 0 to 10
[i for i in xrange(10) if i%2==0]
Out[3]:
[0, 2, 4, 6, 8]
In [7]:
[(x,y)
 for x in xrange(2)
 for y in['a','b','c']]
Out[7]:
[(0, 'a'), (0, 'b'), (0, 'c'), (1, 'a'), (1, 'b'), (1, 'c')]
In [ ]:
# conditional if can be used within a list comprehension
In [18]:
even_numbers = [val for val in xrange(10) if val%2==0]
even_numbers
Out[18]:
[0, 2, 4, 6, 8]
In [20]:
[i**2 if i%2==0 else i**3 for i in xrange(10)] # Please take a minute and
# see what is going on here
Out[20]:
[0, 1, 4, 27, 16, 125, 36, 343, 64, 729]
In [22]:
vowels = ['a','']
[word.upper() if word in['a', 'e', 'i', 'o', 'u'] else word.lower() for word in 'africa']
Out[22]:
['A', 'f', 'r', 'I', 'c', 'A']
In [ ]:
# dictionary comprehension
In [2]:
{i:i**2 for i in xrange(10)}
Out[2]:
{0: 0, 1: 1, 2: 4, 3: 9, 4: 16, 5: 25, 6: 36, 7: 49, 8: 64, 9: 81}
In [8]:
# Take a list words and attach to each word, its length.
wordList = ['Moussa','John','Aristide','Abraham','Obama']
{word:len(word) for word in wordList}
Out[8]:
{'Abraham': 7, 'Aristide': 8, 'John': 4, 'Moussa': 6, 'Obama': 5}

Glance at Map,Reduce, and Filter

In [ ]:
# Map takes as arguments a function and an iterable object and apply
# the function and applies it to each element of the ietarable object.
# The result is a list.
In [9]:
def square(x): return x**2
In [11]:
list(map(square,xrange(10)))
Out[11]:
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
In [12]:
list(map(lambda x: x**2,xrange(10)))
Out[12]:
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
In [16]:
map(lambda word:len(word),'Arshad is very smart'.split())
# split takes the string and breaks it into a lsit of words
Out[16]:
[6, 2, 4, 5]
In [26]:
# Filter, as its name stands for, selects elemts that statisfy a given 
# conditions
even_numbers = list(filter(lambda x:x%2==0, xrange(10)))
even_numbers
Out[26]:
[0, 2, 4, 6, 8]
In [28]:
word_with_len_2 = filter(lambda w:len(w)==2,['I','am','from','India'])
word_with_len_2
Out[28]:
['am']
In [ ]:
# Reduce be looked at as follows:
# suppose we have a list of integers [val1,val2,val3,val4,val5].
# And, you aim to sum all the elements present in the list. In here,
# our function is add or +. Thus, the total sum will be done as:
# ((((val1+val2)+val3)+val4)+val5)
In [30]:
reduce(lambda x,y: x+y,[1,2,3,4,5])
Out[30]:
20
In [32]:
# Guess what? One can initialize the sum
reduce(lambda x,y: x+y,[1,2,3,4,5],30) # I suppose my total sum is 30 at 
# the beginning
Out[32]:
45
In [33]:
reduce(lambda x,y:x*y,[1,2,3,4,5]) # I am multiplying all the elements of 
# the list
Out[33]:
120
In [34]:
reduce(lambda x,y:x*y,[1,2,3,4,5],30) # In here, i initialize my product by 30
Out[34]:
3600
In [44]:
# let's put map,filter and at work together
# For, let's suppose i have the elements from 1 to 10. Let's compute
# the sum of the square of all the odd elements in our list
reduce(lambda x,y: x+y,map(lambda x:x**2,filter(lambda x: x%2==1,range(10))))
Out[44]:
165

Word Tokenization

In [6]:
import nltk
# word_tokenize is used to find the list of words in strings
text = nltk.word_tokenize("PierreVinke, 59 years old, will join as a nonexecutive director on Nov. 29.")
print(text)
['PierreVinke', ',', '59', 'years', 'old', ',', 'will', 'join', 'as', 'a', 'nonexecutive', 'director', 'on', 'Nov.', '29', '.']

TreeBankWordTokenizer

In [ ]:
# Treebank tokenizer uses regular expressions to tokenize texts
In [7]:
import nltk
from nltk.tokenize import TreebankWordTokenizer
# Treebank tokenizer uses regualar expressions to tokenize text
tokenizer = TreebankWordTokenizer()
print (tokenizer.tokenize("Have a nice day. I hope you find the book interesting"))
print (tokenizer.tokenize("Don't hesitate to ask questions"))
['Have', 'a', 'nice', 'day.', 'I', 'hope', 'you', 'find', 'the', 'book', 'interesting']
['Do', "n't", 'hesitate', 'to', 'ask', 'questions']

WordPunctTokenizer

In [8]:
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()
print (tokenizer.tokenize("Don't hesitate to ask questions"))
['Don', "'", 't', 'hesitate', 'to', 'ask', 'questions']

RegexpTokenizer

In [10]:
import nltk
from nltk.tokenize import RegexpTokenizer
sent = "She secures 90.56% in class X.  She is a meritorious student"
capt = RegexpTokenizer('[A-Z]\w+')
capt.tokenize(sent)
Out[10]:
['She', 'She']
In [12]:
import nltk
from nltk.tokenize import BlanklineTokenizer
sent = '''She secures 

90.56% in class X.  

She is a meritorious student'''
BlanklineTokenizer().tokenize(sent)
Out[12]:
['She secures', '90.56% in class X.', 'She is a meritorious student']

Stemmer & Lemmatizer

  • Stemmer and lemmatizer are two methods to handle inflections.
  • Stemming and lemmatization tend to "normalize" words to their common base form,
  • Stemmers aim to remove the morphological affixes from words, leaving only the world stem.
  • Lemmatisation is to bring a word to its conventional form as it is a dictionary

PorterStemming & PorterStemmer

  • Porter stemming is an algorithm, a collection of rules that provides ways to better handle English inflections.
  • It is a process from removing suffixes from words in english.
  • Very important in information retrieval
In [11]:
from nltk.stem import PorterStemmer
stemmerporter = PorterStemmer()
In [12]:
from nltk.stem import PorterStemmer
stemmerporter = PorterStemmer()
print(stemmerporter.stem('talking'))
print (stemmerporter.stem('happiness'))
print (stemmerporter.stem('happy'))
print (stemmerporter.stem('unhappy'))
print (stemmerporter.stem('ran'))
print (stemmerporter.stem('is'))
talk
happi
happi
unhappi
ran
is
In [13]:
words = ['houses', 'trains', 'pens', 'cars', 'eaten','sick', 'nice', 'bought', 'selling', 'sized',
           'speech', 'rolling', 'marching', 'identification', 'universal', 'beautiful', 'references', 'countries','called']
In [14]:
single = [stemmerporter.stem(word) for word in words]
single
Out[14]:
[u'hous',
 u'train',
 u'pen',
 u'car',
 u'eaten',
 u'sick',
 u'nice',
 u'bought',
 u'sell',
 u'size',
 u'speech',
 u'roll',
 u'march',
 u'identif',
 u'univers',
 u'beauti',
 u'refer',
 u'countri',
 u'call']

LancasterStemmer

In [15]:
import nltk
from nltk.stem import LancasterStemmer
stemmerLan = LancasterStemmer()
print (stemmerLan.stem('happiness'))
print (stemmerLan.stem('happy'))
print (stemmerLan.stem('unhappy'))
print (stemmerLan.stem('ran'))
print (stemmerLan.stem('is'))
happy
happy
unhappy
ran
is

RegexpStemmer

  • Uses regular expressions to identify morphological affixes. As such, a given substring that matches the regular expressions will be automatically removed.
In [16]:
import nltk
from nltk.stem import RegexpStemmer
stemmerreg = RegexpStemmer('ing')
print (stemmerreg.stem('working'))
print (stemmerreg.stem('happiness'))
print (stemmerreg.stem('pairing'))
work
happiness
pair

SnowballStemmer

  • It contains 16 stemmer algorithms (Danish,Dutch, English, Finnish, French, German, Hungarian,...)
In [13]:
import nltk
from nltk.stem import SnowballStemmer
print (SnowballStemmer.languages)
spanishstemmer = SnowballStemmer('spanish')
print (spanishstemmer.stem('comiendo'))

frenchstemmer = SnowballStemmer('french')
print (frenchstemmer.stem('manger'))
(u'danish', u'dutch', u'english', u'finnish', u'french', u'german', u'hungarian', u'italian', u'norwegian', u'porter', u'portuguese', u'romanian', u'russian', u'spanish', u'swedish')
com
mang
In [16]:
spanishstemmer = SnowballStemmer('french')
print (spanishstemmer.stem('danser'))
dans
In [ ]:

Lemmatization

  • Stands for doing things in the right way based on the use of a vocabulary and that of the morphological analysis of words
  • Aims at removing inflectional endings only. Its purposes is to return the base or the directional form of a given word.
In [18]:
import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer_output = WordNetLemmatizer()
print (lemmatizer_output.lemmatize('working', pos='v'))
print (lemmatizer_output.lemmatize('ran', pos='v'))
print (lemmatizer_output.lemmatize('took', pos='v'))
print (lemmatizer_output.lemmatize('is', pos='v'))
print (lemmatizer_output.lemmatize('happiness'))
print (lemmatizer_output.lemmatize('took'))
work
run
take
be
happiness
took

Part of speech tagging

  • Tags available at Penn Treebank
  • Tagging is a process of classifying words in their part of speech and label then accordingly
  • example :
      - conjunction of coordinations get mapped to cc
      - adverbs get mapped to RB
      - prepositions get mapped to IN
      - something gets mapped to NN
      - adjectives get mapped to jj
      - verbs get mapped to VBZ
In [19]:
import nltk
from nltk import word_tokenize
text = word_tokenize("It is a pleasant day today")
nltk.pos_tag(text) #pos_tagger stands for part of speech tagger
Out[19]:
[('It', 'PRP'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('pleasant', 'JJ'),
 ('day', 'NN'),
 ('today', 'NN')]
In [20]:
text = word_tokenize("They buy the permit in order to be able to attend the event")
nltk.pos_tag(text)
Out[20]:
[('They', 'PRP'),
 ('buy', 'VBP'),
 ('the', 'DT'),
 ('permit', 'NN'),
 ('in', 'IN'),
 ('order', 'NN'),
 ('to', 'TO'),
 ('be', 'VB'),
 ('able', 'JJ'),
 ('to', 'TO'),
 ('attend', 'VB'),
 ('the', 'DT'),
 ('event', 'NN')]
In [ ]:
import nltk
from nltk.tag import DefaultTagger
tag = DefaultTagger('He is the man')
tag.tag(['Beautiful', 'morning'])
#Semantic Analysis #Named Entity Recognizer (NER) #import nltk # locations = [('Jaipur', 'IN', 'Rajasthan'), ('Ajmer', 'IN', 'Rajasthan'),('Udaipur', 'IN', 'Rajasthan')] from nltk.tag import StanfordNERTagger sentence = StanfordNERTagger('english.all.3class.distsim.crf.ser.giz') sentence.tag('John goes to NY'.split())

Language conversion & Text formatting & Grammar

!pip install --trusted-host pypi.python.org autocorrect

In [22]:
from autocorrect import spell
spell("Tghe")
Out[22]:
'The'

!pip install --trusted-host pypi.python.org textblob

In [20]:
from textblob import TextBlob
b = TextBlob("I havv good speling!")
print(b.detect_language())
print (b.correct())
en
I have good spelling!
In [24]:
from textblob import Word
w = Word('falability')
w.spellcheck()
Out[24]:
[(u'fallibility', 0.3333333333333333),
 (u'capability', 0.3333333333333333),
 (u'affability', 0.3333333333333333)]
In [21]:
from langdetect import detect
print (detect("War doesn't show who's right, just who's left."))
print (detect("Ein, zwei, drei, vier"))
print (detect("Eu gosto de mulher"))
en
de
pt

!pip install --trusted-host pypi.python.org langdetect

In [24]:
#en_blob = TextBlob(u'Simple is better than complex.')
#en_blob.translate(to='vi') # vi stands for vietnamese
en_blob = TextBlob(u'I am a free black man loved by Jesus Christ.')
en_blob.translate(to='pt')
Out[24]:
TextBlob("Eu sou um homem negro livre amado por Jesus Cristo.")

TF-IDF Term Frequency - Inverse Document Frequency

  • It is a statistical tool that aims to reflect how much important a word is to a document in a collection or corpus.
  • It can be seen as a weighting factor.
  • How to generate TF-IDF of phrases of Tokens?
    - by using CountVectorizer(what the hell is this?) then feeding the output of that into TfidfTransformer.
    - by directly inputing the collection of text or documents to TfidfVectorizer 

Extraction of numerical features from texts

  • Tokenize strings and attach an integer, called iD, to each obtained token
  • Count the number of times each token appears in a given document
  • Normalize over the occurences in the majority of the documents
  • In doing so, we should note that the frequency of each token is a FEATURE
In [ ]:

A corpus of documents can always be represented as a matrix where one row represent a specific document and each column denotes the occuring of a token (e.g. word) in a given corpus.
In [ ]:
To vectorize (vectorization) aims at turning a collection of text documents into numerical feature vector
In [ ]:
# Questions?
# what is the difference between fit(), 
#fit() : is used to generate learning model parameters from training data
#transform() : parameters generated from fit() method,applied upon model to
# generate transformed data set.
# fit_transform() : combines fit() and transform() applied on same data sets
In [25]:
import numpy as np
import scipy as sp
import pandas as pd
# we need to import and instantiate CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

simple_train = ['Call you tonight', 'Call me a cab', 'please call me... PLEASE!']

vect=CountVectorizer() # CountVectorizer allows one Convert a collection of text
# documents to a matrix of token count. The outputed matrix is a sparse one. What is
# a sparse matrix?
tf = pd.DataFrame(vect.fit_transform(simple_train).toarray(), columns=vect.get_feature_names())
# Take the text or the document and learn the vocabulary
tf
Out[25]:
cab call me please tonight you
0 0 1 0 0 1 1
1 1 1 1 0 0 0
2 0 1 1 2 0 0
In [ ]:
# we can see that it is not displaying the character a. This is mainly because the default
# Check on this case.
In [2]:
vect = CountVectorizer(binary=True)
df = vect.fit_transform(simple_train).toarray().sum(axis=0) # why does axis have to be zero?
pd.DataFrame(df.reshape(1,6), columns=vect.get_feature_names())
# This is about the document frequency
Out[2]:
cab call me please tonight you
0 1 3 2 1 1 1
In [29]:
tf/df # why is that? What is the purpose?
Out[29]:
cab call me please tonight you
0 0.0 0.333333 0.0 0.0 1.0 1.0
1 1.0 0.333333 0.5 0.0 0.0 0.0
2 0.0 0.333333 0.5 2.0 0.0 0.0
In [30]:
vect = TfidfVectorizer()
pd.DataFrame(vect.fit_transform(simple_train).toarray(), columns=vect.get_feature_names())
Out[30]:
cab call me please tonight you
0 0.000000 0.385372 0.000000 0.000000 0.652491 0.652491
1 0.720333 0.425441 0.547832 0.000000 0.000000 0.000000
2 0.000000 0.266075 0.342620 0.901008 0.000000 0.000000

CountVectorizer - Fit Transform with NLP

In [31]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

vectorizer  = CountVectorizer(min_df=1) # min_df represents a threshold. Here, it puts a
# constraint by telling while building building the vocabulary you need to ignore terms 
# that have a document frequency strictly lower than the given threshold
print (vectorizer)

corpus = ['This is the first document','This is the second second document', 'And the third one', 'Is this the first document?']

X = vectorizer.fit_transform(corpus)

tf = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
tf
CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)
Out[31]:
and document first is one second the third this
0 0 1 1 1 0 0 1 0 1
1 0 1 0 1 0 2 1 0 1
2 1 0 0 0 1 0 1 1 0
3 0 1 1 1 0 0 1 0 1
In [32]:
print (X)
  (0, 8)	1
  (0, 3)	1
  (0, 6)	1
  (0, 2)	1
  (0, 1)	1
  (1, 8)	1
  (1, 3)	1
  (1, 6)	1
  (1, 1)	1
  (1, 5)	2
  (2, 6)	1
  (2, 0)	1
  (2, 7)	1
  (2, 4)	1
  (3, 8)	1
  (3, 3)	1
  (3, 6)	1
  (3, 2)	1
  (3, 1)	1

Yelp review analysis

In [30]:
import pandas as pd
import numpy as np
import scipy as sp
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from textblob import TextBlob, Word
from nltk.stem.snowball import SnowballStemmer
%matplotlib inline
In [31]:
yelp = pd.read_csv('yelp.csv')
yelp.head()
Out[31]:
business_id date review_id stars text type user_id cool useful funny
0 9yKzy9PApeiPPOUJEtnvkg 2011-01-26 fWKvX83p0-ka4JS3dc6E5A 5 My wife took me here on my birthday for breakf... review rLtl8ZkDX5vH5nAx9C3q5Q 2 5 0
1 ZRJwVLyzEJq1VAihDhYiow 2011-07-27 IjZ33sJrzXqU-0X6U8NwyA 5 I have no idea why some people give bad review... review 0a2KyEL0d3Yb1V6aivbIuQ 0 0 0
2 6oRAC4uyJCsJl1X0WZpVSA 2012-06-14 IESLBzqUCLdSzSqm0eCSxQ 4 love the gyro plate. Rice is so good and I als... review 0hT2KtfLiobPvh6cDC8JQg 0 1 0
3 _1QQZuf4zZOyFCvXc0o6Vg 2010-05-27 G-WvGaISbqqaMHlNnByodA 5 Rosie, Dakota, and I LOVE Chaparral Dog Park!!... review uZetl9T0NcROGOyFfughhg 1 2 0
4 6ozycU1RpktNG2-1BroVtw 2012-01-05 1uJFq2r5QfJG_6ExMRCaGw 5 General Manager Scott Petello is a good egg!!!... review vYmM4KTsC8ZfQBg-j5MWkw 0 0 0
In [32]:
yelp.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 10 columns):
business_id    10000 non-null object
date           10000 non-null object
review_id      10000 non-null object
stars          10000 non-null int64
text           10000 non-null object
type           10000 non-null object
user_id        10000 non-null object
cool           10000 non-null int64
useful         10000 non-null int64
funny          10000 non-null int64
dtypes: int64(4), object(6)
memory usage: 781.3+ KB
In [33]:
#create a new DataFrame that only contains the 5-star and 1-star reviews
yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]

yelp_best_worst.reset_index(drop=True, inplace=True) # reset the indices. And instead of 
# creating another data frame, let's just do it inplace

x = yelp_best_worst.text #reviews
y = yelp_best_worst.stars #ratings
# print x to look at x
# print y to take a look at
print (x.shape)

#split into training and test sets
x_train, x_test, y_train, y_test = train_test_split(x,y,random_state=1)
(4086L,)
In [39]:
print x
0       My wife took me here on my birthday for breakf...
1       I have no idea why some people give bad review...
2       Rosie, Dakota, and I LOVE Chaparral Dog Park!!...
3       General Manager Scott Petello is a good egg!!!...
4       Drop what you're doing and drive here. After I...
5       Nobuo shows his unique talents with everything...
6       The oldish man who owns the store is as sweet ...
7       Wonderful Vietnamese sandwich shoppe. Their ba...
8       They have a limited time thing going on right ...
9       okay this is the best place EVER! i grew up sh...
10      This place shouldn't even be reviewed - becaus...
11      first time my friend and I went there... it wa...
12      U can go there n check the car out. If u wanna...
13      I love this place! I have been coming here for...
14      I love love LOVE this place. My boss (who is i...
15      Disclaimer: Like many of you, I am a sucker fo...
16      Disgusting!  Had a Groupon so my daughter and ...
17      Never having dealt with a Discount Tire in Pho...
18      I've eaten here many times, but none as bad as...
19      (Un)fortunately for me, lux is close to my hou...
20      Fred M. pretty much said what I would say, so ...
21      Alright, I have been away from Yelp for quite ...
22      This restaurant is incredible, and has the bes...
23      I have always been a fan of Burlington's deals...
24      Another night meeting friends here.  I have to...
25      Not busy at all but took nearly 45 min to get ...
26      This an incredible church that embraces the pr...
27      This is our favorite breakfast place. The food...
28      I had looked at several invitation websites al...
29      Yikes, reading other reviews I realize my bad ...
                              ...
4056    I have a fond place in my heart for this estab...
4057    Cork is an enigma.\n\nWhat makes it enigmatic ...
4058    Went to Yogurt Kingdom for the first time toni...
4059    I find it hilarious that someone would referen...
4060                                      LOVE Five Guys!
4061    This is a great Mexican food restaurant. I eat...
4062    "Hipster,Trendy" ????-I think NOT !!!! Very di...
4063    "So Jimmy, tell the class what you saw at Swee...
4064    Standard Mexican fare - but quite delicious.  ...
4065    My profile says....\n\nMy Last Meal On Earth: ...
4066    Treats: We tried the cookies (chocolate chip a...
4067    I first joined 24 hr fitness about a year ago,...
4068    Leah, the trainer, at Dog House Training Acade...
4069    This place is super cute lunch joint.  I had t...
4070    The staff is great, the food is great, even th...
4071    Wow!  Went on a Sunday around 11am - busy but ...
4072    When I lived in Phoenix, I was a regular at Fe...
4073    Why did I wait so long to try this neighborhoo...
4074    This is the place for a fabulos breakfast!! I ...
4075    Highly recommend. This is my second time here ...
4076    5 stars for the great $5 happy hour specials. ...
4077    We brought the entire family to Giuseppe's las...
4078    Went last night to Whore Foods to get basics t...
4079    The food is delicious.  The service:  discrimi...
4080    Great food and service! Country food at its best!
4081    Yes I do rock the hipster joints.  I dig this ...
4082    Only 4 stars? \n\n(A few notes: The folks that...
4083    I'm not normally one to jump at reviewing a ch...
4084    Let's see...what is there NOT to like about Su...
4085    4-5 locations.. all 4.5 star average.. I think...
Name: text, dtype: object

Tokenization

In [21]:
# use CountVetorizer to create document-term matrices from x_train and x_test
vect = CountVectorizer()
x_train_dtm = vect.fit_transform(x_train) # learn the vocabulary dictionary ad create term document matrix
print (x_train_dtm)
#print (x_train_dtm.shape)
x_test_dtm= vect.transform(x_test)
#print x_test_dtm
#x_test_dtm.shape
  (0, 5773)	1
  (0, 10362)	2
  (0, 12465)	1
  (0, 10069)	1
  (0, 10180)	1
  (0, 16612)	2
  (0, 4631)	1
  (0, 9578)	1
  (0, 15093)	1
  (0, 11186)	1
  (0, 136)	1
  (0, 4809)	1
  (0, 15136)	1
  (0, 10413)	2
  (0, 16195)	1
  (0, 15834)	1
  (0, 12514)	2
  (0, 2789)	1
  (0, 14838)	1
  (0, 10286)	2
  (0, 3679)	1
  (0, 15032)	2
  (0, 1018)	1
  (0, 2286)	2
  (0, 1003)	1
  :	:
  (3063, 2312)	1
  (3063, 9318)	1
  (3063, 879)	1
  (3063, 10352)	2
  (3063, 15968)	1
  (3063, 7181)	1
  (3063, 15042)	1
  (3063, 5333)	1
  (3063, 8189)	2
  (3063, 1548)	1
  (3063, 9807)	1
  (3063, 2818)	1
  (3063, 2735)	1
  (3063, 14836)	1
  (3063, 6718)	1
  (3063, 16599)	1
  (3063, 6974)	1
  (3063, 14137)	1
  (3063, 5139)	1
  (3063, 4538)	1
  (3063, 10805)	1
  (3063, 14994)	1
  (3063, 9438)	1
  (3063, 16162)	1
  (3063, 6616)	1
In [22]:
print x_test
1607    Looking a cutting edge, wanting the best for e...
3409    Greatness in the form of food, just like the o...
1751    The Flower Studio far exceeded my expectations...
2275        So yummy! Strange combination but great place
230     I've been hearing about these cheesecakes from...
902     This has to be the worst restaurant in terms o...
1865    I ate at Scramble last Friday and I have to sa...
636     We decided to eat here on a whim. My husband g...
2625    I LOVE BURRITO EXPRESS. My fiance has been goi...
943     Just open.  I had the roast beef sandwich and ...
1171    Cute busy place in Central Phoenix. Not hiding...
1247    I'm a big fan of Silver Mine. I have been for ...
200     I have now visited Herb n' Flavors several tim...
891     I love to come here from time to time when I'm...
443     Went to Fatburger with our family tonight and ...
2497    This review pertains to carnitas, and as such ...
1673    TIP #1 to Mesa-Gateway fliers: This is the ONL...
745     If you like the stuck up Scottsdale vibe this ...
1105    I'm sorry to be what seems to be the lone one ...
3227    Unprofessional, disorganized, and extremely lo...
1164    Bad music, slow service, disgusting overpriced...
3896    I can't remember the name of the special salad...
4001    Went here last night when on our last night st...
3868    There is only one reason  why I shouldn't love...
4071    Wow!  Went on a Sunday around 11am - busy but ...
1767    Wow! The Penang Curry (chicken) was absolutely...
936     This place is what Desert Ridge wishes it was!...
1249    Andrea is absolutely wonderful. She's pet-sit ...
2095    I am from Chicago - and the italian beef here ...
3803    I was given a $100 gift card to use at Willow ...
                              ...
2867    Totally excited to try this place out, my gran...
1533    Went here for the first time today.  Loved it....
3266    Still a place that is unacceptable in my book-...
407     I took my family here and this was a disappoin...
137     So your going to Scottsdale via Paradise Valle...
973     Very good place to eat.. I go here atleast 3 t...
797     Just did take out.  Great experience.  Easy to...
1808    my husband and i LOVE this place! \n\nno, we c...
1094    Best Greek food I had in Arizona and excellent...
1545              Best ribs in Arizona (besides my own)!!
2346    The greatest community that I've ever been a p...
3330    Love Belly Rubz!! At first little Oscar was ti...
2970    A yummy Mexican Sunnyslope dive.  The oatmeal ...
1540    Have been going to LGO since 2003 and have alw...
3255    I have taken acting classes from Verve Studios...
834     I have been going to the Matador since I was l...
516     Fantastic donuts! Great selection! Coffee was ...
2969    Thanks for helping me to find Valley Eyecare C...
1291    Very good place for breakfast and their pies a...
1156    What an awesome business! Friendly, knowledgab...
220     HELLISH HELLISH SUMMER WEATHER (March thru Oct...
3424    Love this spot - it's pretty close to the conv...
3344    Found the Tuck Shop on my Urban Spoon AP and w...
3685    I usually do not complain about bad food but t...
3141    Wow, this place is still here? I went there as...
2793    Honey jalapeño chicken lollipops and sweet pot...
671                    probably my favorite restaurant :)
3441    A philosophical elder of my profession commonl...
3224    First, I'm sorry this review is lengthy, but i...
3362    You speak Italian to me and provide mouth wate...
Name: text, dtype: object
In [23]:
tf = pd.DataFrame(x_train_dtm.toarray(), columns=vect.get_feature_names())
tf.head()
Out[23]:
00 000 00a 00am 00pm 01 02 03 03342 04 ... zucchini zuchinni zumba zupa zuzu zwiebel zzed éclairs école ém
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 16825 columns

In [24]:
x_train.head()
Out[24]:
2790    FILLY-B's!!!!!  only 8 reviews?? NINE now!!!\n...
725     My husband and I absolutely LOVE this restaura...
1578    We went today after lunch. I got my usual of l...
282     Totally dissapointed.  I had purchased a coupo...
2024    Costco Travel - My husband and I recently retu...
Name: text, dtype: object
In [25]:
#don't lowercase
vect = CountVectorizer(lowercase=False)
x_train_dtm = vect.fit_transform(x_train)
x_train_dtm.shape
Out[25]:
(3064, 20838)
In [26]:
# include 1-grams and 2-grams (an n-gram is N-grams is just all combinations of adjacent words 
# or letters of length n that you can find in your source text)
vect = CountVectorizer(ngram_range=(1,2))
x_train_dtm = vect.fit_transform(x_train)
x_train_dtm.shape
Out[26]:
(3064, 169847)
In [27]:
print (vect.get_feature_names()[-50:]) # The last 50 words
[u'zone out', u'zone when', u'zones', u'zones dolls', u'zoning', u'zoning issues', u'zoo', u'zoo and', u'zoo is', u'zoo not', u'zoo the', u'zoo ve', u'zoyo', u'zoyo for', u'zucca', u'zucca appetizer', u'zucchini', u'zucchini and', u'zucchini bread', u'zucchini broccoli', u'zucchini carrots', u'zucchini fries', u'zucchini pieces', u'zucchini strips', u'zucchini veal', u'zucchini very', u'zucchini with', u'zuchinni', u'zuchinni again', u'zuchinni the', u'zumba', u'zumba class', u'zumba or', u'zumba yogalates', u'zupa', u'zupa flavors', u'zuzu', u'zuzu in', u'zuzu is', u'zuzu the', u'zwiebel', u'zwiebel kr\xe4uter', u'zzed', u'zzed in', u'\xe9clairs', u'\xe9clairs napoleons', u'\xe9cole', u'\xe9cole len\xf4tre', u'\xe9m', u'\xe9m all']

Predict the star rating

In [28]:
vect = CountVectorizer()

x_train_dtm = vect.fit_transform(x_train)
x_test_dtm = vect.transform(x_test)

# Questions?
# what is the difference between f
#fit() : is used to generate learning model parameters from training data
#transform() : parameters generated from fit() method,applied upon model to
# generate transformed data set.
# fit_transform() : combines fit() and transform() api on same data sets

#Naive Bayes
nb = MultinomialNB()
nb.fit(x_train_dtm, y_train)
y_pred_class = nb.predict(x_test_dtm)

print (metrics.accuracy_score(y_test, y_pred_class))
0.918786692759

calculate null accuracy

y_test_binary = np.where(y_test==5, 1, 0) max(y_test_binary.mean(), 1-y_test_binary.mean())

In [29]:
#define a function that accepts a vectorizer and calculates the accuracy
def tokenize_test(vect):
    x_train_dtm = vect.fit_transform(x_train)
    print ('Features: ', x_train_dtm.shape[1])
    x_test_dtm = vect.transform(x_test)
    nb = MultinomialNB()
    nb.fit(x_train_dtm, y_train)
    y_pred_class = nb.predict(x_test_dtm)
    print ('Accuracy: ', metrics.accuracy_score(y_test, y_pred_class))
In [30]:
#include 1-grams and 2-grams
vect = CountVectorizer(ngram_range=(1,2))
tokenize_test(vect)
('Features: ', 169847)
('Accuracy: ', 0.85420743639921726)
In [31]:
vect = CountVectorizer()
tokenize_test(vect)
('Features: ', 16825)
('Accuracy: ', 0.91878669275929548)

Stopword removal

In [32]:
#remove English stop words
vect = CountVectorizer(stop_words='english')
tokenize_test(vect)
('Features: ', 16528)
('Accuracy: ', 0.91585127201565553)
In [33]:
# set of stop words
print (vect.get_stop_words())
frozenset(['all', 'six', 'less', 'being', 'indeed', 'over', 'move', 'anyway', 'four', 'not', 'own', 'through', 'yourselves', 'fify', 'where', 'mill', 'only', 'find', 'before', 'one', 'whose', 'system', 'how', 'somewhere', 'with', 'thick', 'show', 'had', 'enough', 'should', 'to', 'must', 'whom', 'seeming', 'under', 'ours', 'has', 'might', 'thereafter', 'latterly', 'do', 'them', 'his', 'around', 'than', 'get', 'very', 'de', 'none', 'cannot', 'every', 'whether', 'they', 'front', 'during', 'thus', 'now', 'him', 'nor', 'name', 'several', 'hereafter', 'always', 'who', 'cry', 'whither', 'this', 'someone', 'either', 'each', 'become', 'thereupon', 'sometime', 'side', 'two', 'therein', 'twelve', 'because', 'often', 'ten', 'our', 'eg', 'some', 'back', 'up', 'go', 'namely', 'towards', 'are', 'further', 'beyond', 'ourselves', 'yet', 'out', 'even', 'will', 'what', 'still', 'for', 'bottom', 'mine', 'since', 'please', 'forty', 'per', 'its', 'everything', 'behind', 'un', 'above', 'between', 'it', 'neither', 'seemed', 'ever', 'across', 'she', 'somehow', 'be', 'we', 'full', 'never', 'sixty', 'however', 'here', 'otherwise', 'were', 'whereupon', 'nowhere', 'although', 'found', 'alone', 're', 'along', 'fifteen', 'by', 'both', 'about', 'last', 'would', 'anything', 'via', 'many', 'could', 'thence', 'put', 'against', 'keep', 'etc', 'amount', 'became', 'ltd', 'hence', 'onto', 'or', 'con', 'among', 'already', 'co', 'afterwards', 'formerly', 'within', 'seems', 'into', 'others', 'while', 'whatever', 'except', 'down', 'hers', 'everyone', 'done', 'least', 'another', 'whoever', 'moreover', 'couldnt', 'throughout', 'anyhow', 'yourself', 'three', 'from', 'her', 'few', 'together', 'top', 'there', 'due', 'been', 'next', 'anyone', 'eleven', 'much', 'call', 'therefore', 'interest', 'then', 'thru', 'themselves', 'hundred', 'was', 'sincere', 'empty', 'more', 'himself', 'elsewhere', 'mostly', 'on', 'fire', 'am', 'becoming', 'hereby', 'amongst', 'else', 'part', 'everywhere', 'too', 'herself', 'former', 'those', 'he', 'me', 'myself', 'made', 'twenty', 'these', 'bill', 'cant', 'us', 'until', 'besides', 'nevertheless', 'below', 'anywhere', 'nine', 'can', 'of', 'toward', 'my', 'something', 'and', 'whereafter', 'whenever', 'give', 'almost', 'wherever', 'is', 'describe', 'beforehand', 'herein', 'an', 'as', 'itself', 'at', 'have', 'in', 'seem', 'whence', 'ie', 'any', 'fill', 'again', 'hasnt', 'inc', 'thereby', 'thin', 'no', 'perhaps', 'latter', 'meanwhile', 'when', 'detail', 'same', 'wherein', 'beside', 'also', 'that', 'other', 'take', 'which', 'becomes', 'you', 'if', 'nobody', 'see', 'though', 'may', 'after', 'upon', 'most', 'hereupon', 'eight', 'but', 'serious', 'nothing', 'such', 'your', 'why', 'a', 'off', 'whereby', 'third', 'i', 'whole', 'noone', 'sometimes', 'well', 'amoungst', 'yours', 'their', 'rather', 'without', 'so', 'five', 'the', 'first', 'whereas', 'once'])
In [34]:
#max_features
vect = CountVectorizer(stop_words='english', max_features=100)
tokenize_test(vect)
('Features: ', 100)
('Accuracy: ', 0.86986301369863017)
In [35]:
print(vect.get_feature_names())
[u'amazing', u'area', u'atmosphere', u'awesome', u'bad', u'bar', u'best', u'better', u'big', u'came', u'cheese', u'chicken', u'clean', u'coffee', u'come', u'day', u'definitely', u'delicious', u'did', u'didn', u'dinner', u'don', u'eat', u'excellent', u'experience', u'favorite', u'feel', u'food', u'free', u'fresh', u'friendly', u'friends', u'going', u'good', u'got', u'great', u'happy', u'home', u'hot', u'hour', u'just', u'know', u'like', u'little', u'll', u'location', u'long', u'looking', u'lot', u'love', u'lunch', u'make', u'meal', u'menu', u'minutes', u'need', u'new', u'nice', u'night', u'order', u'ordered', u'people', u'perfect', u'phoenix', u'pizza', u'place', u'pretty', u'prices', u'really', u'recommend', u'restaurant', u'right', u'said', u'salad', u'sandwich', u'sauce', u'say', u'service', u'staff', u'store', u'sure', u'table', u'thing', u'things', u'think', u'time', u'times', u'took', u'town', u'tried', u'try', u've', u'wait', u'want', u'way', u'went', u'wine', u'work', u'worth', u'years']
In [36]:
vect = CountVectorizer(ngram_range=(1,2), max_features=100000)
tokenize_test(vect)
('Features: ', 100000)
('Accuracy: ', 0.88551859099804309)
In [37]:
#min_df sets the minimum document frequency allowed when creating vocab
vect = CountVectorizer(ngram_range=(1,2), min_df=2)
tokenize_test(vect)
('Features: ', 43957)
('Accuracy: ', 0.93248532289628183)

TextBlob

  • is a Python (2 and 3) library for processing textual data.
In [38]:
print (yelp_best_worst.text[0])
My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.

Do yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I've ever had.  I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.

While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I've ever had.

Anyway, I can't wait to go back!
In [39]:
review = TextBlob(yelp_best_worst.text[0])
In [40]:
review.words
Out[40]:
WordList(['My', 'wife', 'took', 'me', 'here', 'on', 'my', 'birthday', 'for', 'breakfast', 'and', 'it', 'was', 'excellent', 'The', 'weather', 'was', 'perfect', 'which', 'made', 'sitting', 'outside', 'overlooking', 'their', 'grounds', 'an', 'absolute', 'pleasure', 'Our', 'waitress', 'was', 'excellent', 'and', 'our', 'food', 'arrived', 'quickly', 'on', 'the', 'semi-busy', 'Saturday', 'morning', 'It', 'looked', 'like', 'the', 'place', 'fills', 'up', 'pretty', 'quickly', 'so', 'the', 'earlier', 'you', 'get', 'here', 'the', 'better', 'Do', 'yourself', 'a', 'favor', 'and', 'get', 'their', 'Bloody', 'Mary', 'It', 'was', 'phenomenal', 'and', 'simply', 'the', 'best', 'I', "'ve", 'ever', 'had', 'I', "'m", 'pretty', 'sure', 'they', 'only', 'use', 'ingredients', 'from', 'their', 'garden', 'and', 'blend', 'them', 'fresh', 'when', 'you', 'order', 'it', 'It', 'was', 'amazing', 'While', 'EVERYTHING', 'on', 'the', 'menu', 'looks', 'excellent', 'I', 'had', 'the', 'white', 'truffle', 'scrambled', 'eggs', 'vegetable', 'skillet', 'and', 'it', 'was', 'tasty', 'and', 'delicious', 'It', 'came', 'with', '2', 'pieces', 'of', 'their', 'griddled', 'bread', 'with', 'was', 'amazing', 'and', 'it', 'absolutely', 'made', 'the', 'meal', 'complete', 'It', 'was', 'the', 'best', 'toast', 'I', "'ve", 'ever', 'had', 'Anyway', 'I', 'ca', "n't", 'wait', 'to', 'go', 'back'])
In [41]:
review.sentences
Out[41]:
[Sentence("My wife took me here on my birthday for breakfast and it was excellent."),
 Sentence("The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure."),
 Sentence("Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning."),
 Sentence("It looked like the place fills up pretty quickly so the earlier you get here the better."),
 Sentence("Do yourself a favor and get their Bloody Mary."),
 Sentence("It was phenomenal and simply the best I've ever had."),
 Sentence("I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it."),
 Sentence("It was amazing."),
 Sentence("While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious."),
 Sentence("It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete."),
 Sentence("It was the best "toast" I've ever had."),
 Sentence("Anyway, I can't wait to go back!")]
In [42]:
review.lower()
Out[42]:
TextBlob("my wife took me here on my birthday for breakfast and it was excellent.  the weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  our waitress was excellent and our food arrived quickly on the semi-busy saturday morning.  it looked like the place fills up pretty quickly so the earlier you get here the better.

do yourself a favor and get their bloody mary.  it was phenomenal and simply the best i've ever had.  i'm pretty sure they only use ingredients from their garden and blend them fresh when you order it.  it was amazing.

while everything on the menu looks excellent, i had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  it came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  it was the best "toast" i've ever had.

anyway, i can't wait to go back!")

Stemming and lemmatization

In [43]:
stemmer = SnowballStemmer('english')
print ([stemmer.stem(word) for word in review.words])
[u'my', u'wife', u'took', u'me', u'here', u'on', u'my', u'birthday', u'for', u'breakfast', u'and', u'it', u'was', u'excel', u'the', u'weather', u'was', u'perfect', u'which', u'made', u'sit', u'outsid', u'overlook', u'their', u'ground', u'an', u'absolut', u'pleasur', u'our', u'waitress', u'was', u'excel', u'and', u'our', u'food', u'arriv', u'quick', u'on', u'the', u'semi-busi', u'saturday', u'morn', u'it', u'look', u'like', u'the', u'place', u'fill', u'up', u'pretti', u'quick', u'so', u'the', u'earlier', u'you', u'get', u'here', u'the', u'better', u'do', u'yourself', u'a', u'favor', u'and', u'get', u'their', u'bloodi', u'mari', u'it', u'was', u'phenomen', u'and', u'simpli', u'the', u'best', u'i', u've', u'ever', u'had', u'i', u"'m", u'pretti', u'sure', u'they', u'onli', u'use', u'ingredi', u'from', u'their', u'garden', u'and', u'blend', u'them', u'fresh', u'when', u'you', u'order', u'it', u'it', u'was', u'amaz', u'while', u'everyth', u'on', u'the', u'menu', u'look', u'excel', u'i', u'had', u'the', u'white', u'truffl', u'scrambl', u'egg', u'veget', u'skillet', u'and', u'it', u'was', u'tasti', u'and', u'delici', u'it', u'came', u'with', u'2', u'piec', u'of', u'their', u'griddl', u'bread', u'with', u'was', u'amaz', u'and', u'it', u'absolut', u'made', u'the', u'meal', u'complet', u'it', u'was', u'the', u'best', u'toast', u'i', u've', u'ever', u'had', u'anyway', u'i', u'ca', u"n't", u'wait', u'to', u'go', u'back']
In [44]:
print ([word.lemmatize() for word in review.words])
['My', 'wife', 'took', 'me', 'here', 'on', 'my', 'birthday', 'for', 'breakfast', 'and', 'it', u'wa', 'excellent', 'The', 'weather', u'wa', 'perfect', 'which', 'made', 'sitting', 'outside', 'overlooking', 'their', u'ground', 'an', 'absolute', 'pleasure', 'Our', 'waitress', u'wa', 'excellent', 'and', 'our', 'food', 'arrived', 'quickly', 'on', 'the', 'semi-busy', 'Saturday', 'morning', 'It', 'looked', 'like', 'the', 'place', u'fill', 'up', 'pretty', 'quickly', 'so', 'the', 'earlier', 'you', 'get', 'here', 'the', 'better', 'Do', 'yourself', 'a', 'favor', 'and', 'get', 'their', 'Bloody', 'Mary', 'It', u'wa', 'phenomenal', 'and', 'simply', 'the', 'best', 'I', "'ve", 'ever', 'had', 'I', "'m", 'pretty', 'sure', 'they', 'only', 'use', u'ingredient', 'from', 'their', 'garden', 'and', 'blend', 'them', 'fresh', 'when', 'you', 'order', 'it', 'It', u'wa', 'amazing', 'While', 'EVERYTHING', 'on', 'the', 'menu', u'look', 'excellent', 'I', 'had', 'the', 'white', 'truffle', 'scrambled', u'egg', 'vegetable', 'skillet', 'and', 'it', u'wa', 'tasty', 'and', 'delicious', 'It', 'came', 'with', '2', u'piece', 'of', 'their', 'griddled', 'bread', 'with', u'wa', 'amazing', 'and', 'it', 'absolutely', 'made', 'the', 'meal', 'complete', 'It', u'wa', 'the', 'best', 'toast', 'I', "'ve", 'ever', 'had', 'Anyway', 'I', 'ca', "n't", 'wait', 'to', 'go', 'back']
In [45]:
#assume every word is a verb
print ([word.lemmatize(pos='v') for word in review.words])
['My', 'wife', u'take', 'me', 'here', 'on', 'my', 'birthday', 'for', 'breakfast', 'and', 'it', u'be', 'excellent', 'The', 'weather', u'be', 'perfect', 'which', u'make', u'sit', 'outside', u'overlook', 'their', u'ground', 'an', 'absolute', 'pleasure', 'Our', 'waitress', u'be', 'excellent', 'and', 'our', 'food', u'arrive', 'quickly', 'on', 'the', 'semi-busy', 'Saturday', 'morning', 'It', u'look', 'like', 'the', 'place', u'fill', 'up', 'pretty', 'quickly', 'so', 'the', 'earlier', 'you', 'get', 'here', 'the', 'better', 'Do', 'yourself', 'a', 'favor', 'and', 'get', 'their', 'Bloody', 'Mary', 'It', u'be', 'phenomenal', 'and', 'simply', 'the', 'best', 'I', "'ve", 'ever', u'have', 'I', "'m", 'pretty', 'sure', 'they', 'only', 'use', 'ingredients', 'from', 'their', 'garden', 'and', 'blend', 'them', 'fresh', 'when', 'you', 'order', 'it', 'It', u'be', u'amaze', 'While', 'EVERYTHING', 'on', 'the', 'menu', u'look', 'excellent', 'I', u'have', 'the', 'white', 'truffle', u'scramble', u'egg', 'vegetable', 'skillet', 'and', 'it', u'be', 'tasty', 'and', 'delicious', 'It', u'come', 'with', '2', u'piece', 'of', 'their', u'griddle', 'bread', 'with', u'be', u'amaze', 'and', 'it', 'absolutely', u'make', 'the', 'meal', 'complete', 'It', u'be', 'the', 'best', 'toast', 'I', "'ve", 'ever', u'have', 'Anyway', 'I', 'ca', "n't", 'wait', 'to', 'go', 'back']
In [46]:
def split_into_lemmas(text):
    text = unicode(text, 'utf-8').lower() #Python 2
    #text = str(text).lower() #Python 3
    words = TextBlob(text).words
    #return [word.lemmatize() for word in words]
    return [stemmer.stem(word) for word in words]
In [47]:
#split review text into lemmas rather than into words (default)
vect = CountVectorizer(analyzer=split_into_lemmas)
tokenize_test(vect)
('Features: ', 13273)
('Accuracy: ', 0.92465753424657537)
In [48]:
print (vect.get_feature_names()[-50:])
[u'yuuuuummmmmyyi', u'yuuuuuuum', u'yuyuyummi', u'yuzu', u'z', u'z-grill', u'z11', u'zach', u'zam', u'zanella', u'zankou', u'zappo', u'zatsiki', u'zen', u'zen-lik', u'zero', u'zero-star', u'zest', u'zexperi', u'zha', u'zhou', u'zia', u'zilch', u'zin', u'zinburg', u'zinburgergeist', u'zinc', u'zinfandel', u'zing', u'zip', u'zipcar', u'zipp', u'zipper', u'ziti', u'zoe', u'zombi', u'zone', u'zoo', u'zoyo', u'zucca', u'zucchini', u'zuchinni', u'zumba', u'zupa', u'zuzu', u'zwiebel-kr\xe4ut', u'zzed', u'\xe9clair', u'\xe9cole', u'\xe9m']

Term Frequency - Inverse Document Frequency (TF-IDF)

This is a repeat of the code from the TF-IDF intro section

In [64]:
#example documents
simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']
In [65]:
#term frequency
vect = CountVectorizer()
tf = pd.DataFrame(vect.fit_transform(simple_train).toarray(), columns=vect.get_feature_names())
tf
Out[65]:
cab call me please tonight you
0 0 1 0 0 1 1
1 1 1 1 0 0 0
2 0 1 1 2 0 0
In [66]:
#document frequency
vect = CountVectorizer(binary=True)
df = vect.fit_transform(simple_train).toarray().sum(axis=0)
pd.DataFrame(df.reshape(1,6), columns=vect.get_feature_names())
Out[66]:
cab call me please tonight you
0 1 3 2 1 1 1
In [67]:
#term frequency- inverse document frequency (tf-idf) - Simple version
tf/df
Out[67]:
cab call me please tonight you
0 0.0 0.333333 0.0 0.0 1.0 1.0
1 1.0 0.333333 0.5 0.0 0.0 0.0
2 0.0 0.333333 0.5 2.0 0.0 0.0
In [68]:
vect = TfidfVectorizer()
pd.DataFrame(vect.fit_transform(simple_train).toarray(), columns=vect.get_feature_names())
Out[68]:
cab call me please tonight you
0 0.000000 0.385372 0.000000 0.000000 0.652491 0.652491
1 0.720333 0.425441 0.547832 0.000000 0.000000 0.000000
2 0.000000 0.266075 0.342620 0.901008 0.000000 0.000000

Using TF-IDF to summarize a Yelp review

In [49]:
#create a document-term matrix using TF-ID
vect = TfidfVectorizer(stop_words='english')
dtm = vect.fit_transform(yelp.text)
features = vect.get_feature_names()
dtm.shape
Out[49]:
(10000, 28881)
In [50]:
def summarize():

    #choose a random review that is at least 300 characters
    review_length = 0
    while review_length < 300:
        review_id = np.random.randint(0, len(yelp))
        review_text = unicode(yelp.text[review_id], 'utf-8') #Python 2
        #review_text = str(yelp.text[review_id]) #Python3
        review_length = len(review_text)

    #create a dictionary of words and their TF-IDF scores
    word_scores = {}
    for word in TextBlob(review_text).words:
        word = word.lower()
        if word in features:
            word_scores[word] = dtm[review_id, features.index(word)]

    #print words with the top 5 TF-IDF scores
    print ('TOP SCORING WORDS:')
    top_scores = sorted(word_scores.items(), key=lambda x: x[1], reverse=True)[:5]
    for word, score in top_scores:
        print (word)

    #print the review
    print ('\n' + review_text)
In [51]:
summarize()
TOP SCORING WORDS:
bowl
smashing
soba
circus
facing

I freakin love this place. My favorite thing is to sit and eat facing the counter and watch new people come in and get all confused. Now that's just funny. My first time I was the same way, like what the hell do I do here. Now I'm a pro. Stack it deep and use another bowl for smashing, Soba noodles piled so high it looks like a circus act getting the bowl to the cook. Mmmm....good.

Sentiment Analysis

- Aims to sense people's mood based on the text they write. - Can be done when the text is quantifiable. - Sentiment can be positive or negative.
In [52]:
print (review)
My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.

Do yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I've ever had.  I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.

While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I've ever had.

Anyway, I can't wait to go back!
In [54]:
max_i = 0
max_polarity = -float('inf')

min_i = 0
min_polarity = float('inf')

for i in range(len(yelp_best_worst.text)):
    review_text = unicode(yelp_best_worst.text[i], 'utf-8') #Python 2
    #review_text = str(yelp_best_worst.text[i]) #Python3
    this_polarity = TextBlob(review_text).sentiment.polarity

    if this_polarity > max_polarity:
        max_i = i
        max_polarity = this_polarity

    if this_polarity < min_polarity:
        min_i = i
        min_polarity = this_polarity

print (TextBlob(yelp_best_worst.text[max_i]))
print (TextBlob(yelp_best_worst.text[min_i]))
Our server Gary was awesome. Food was amazing...an experience.
This was absolutely horrible. I got the supreme pizza with the mystery meats.  I threw it in the trash. I will wait until I get to my destination to eat. Horrible!!!
In [55]:
#polarity ranges from -1 (most negative) to 1 (most positive)
print(review.sentiment.polarity)
print(max_polarity)
print(min_polarity)
0.402469135802
1.0
-1.0
In [56]:
#understanding the apply method
yelp['length'] = yelp.text.apply(len)
In [57]:
yelp.head(10)
Out[57]:
business_id date review_id stars text type user_id cool useful funny length
0 9yKzy9PApeiPPOUJEtnvkg 2011-01-26 fWKvX83p0-ka4JS3dc6E5A 5 My wife took me here on my birthday for breakf... review rLtl8ZkDX5vH5nAx9C3q5Q 2 5 0 889
1 ZRJwVLyzEJq1VAihDhYiow 2011-07-27 IjZ33sJrzXqU-0X6U8NwyA 5 I have no idea why some people give bad review... review 0a2KyEL0d3Yb1V6aivbIuQ 0 0 0 1345
2 6oRAC4uyJCsJl1X0WZpVSA 2012-06-14 IESLBzqUCLdSzSqm0eCSxQ 4 love the gyro plate. Rice is so good and I als... review 0hT2KtfLiobPvh6cDC8JQg 0 1 0 76
3 _1QQZuf4zZOyFCvXc0o6Vg 2010-05-27 G-WvGaISbqqaMHlNnByodA 5 Rosie, Dakota, and I LOVE Chaparral Dog Park!!... review uZetl9T0NcROGOyFfughhg 1 2 0 419
4 6ozycU1RpktNG2-1BroVtw 2012-01-05 1uJFq2r5QfJG_6ExMRCaGw 5 General Manager Scott Petello is a good egg!!!... review vYmM4KTsC8ZfQBg-j5MWkw 0 0 0 469
5 -yxfBYGB6SEqszmxJxd97A 2007-12-13 m2CKSsepBCoRYWxiRUsxAg 4 Quiessence is, simply put, beautiful. Full wi... review sqYN3lNgvPbPCTRsMFu27g 4 3 1 2094
6 zp713qNhx8d9KCJJnrw1xA 2010-02-12 riFQ3vxNpP4rWLk_CSri2A 5 Drop what you're doing and drive here. After I... review wFweIWhv2fREZV_dYkz_1g 7 7 4 1565
7 hW0Ne_HTHEAgGF1rAdmR-g 2012-07-12 JL7GXJ9u4YMx7Rzs05NfiQ 4 Luckily, I didn't have to travel far to make m... review 1ieuYcKS7zeAv_U15AB13A 0 1 0 274
8 wNUea3IXZWD63bbOQaOH-g 2012-08-17 XtnfnYmnJYi71yIuGsXIUA 4 Definitely come for Happy hour! Prices are ama... review Vh_DlizgGhSqQh4qfZ2h6A 0 0 0 349
9 nMHhuYan8e3cONo3PornJA 2010-08-11 jJAIXA46pU1swYyRCdfXtQ 5 Nobuo shows his unique talents with everything... review sUNkXg8-KFtCMQDV6zRzQg 0 1 0 186
In [77]:
#define a function that accepts text and returns polarity
def detect_sentiment(text):
    return TextBlob(text.decode('utf-8')).sentiment.polarity #Python 2
    #return TextBlob(text).sentiment.polarity Python 3
In [78]:
#create a new DataFrame column for sentiment
yelp['sentiment'] = yelp.text.apply(detect_sentiment)
In [79]:
yelp.boxplot(column='sentiment', by='stars')
Out[79]:
<matplotlib.axes._subplots.AxesSubplot at 0x3af4f630>
In [80]:
#reviews with most positive sentiment
yelp[yelp.sentiment == 1].text.head()
Out[80]:
254    Our server Gary was awesome. Food was amazing....
347    3 syllables for this place. \nA-MAZ-ING!\n\nTh...
420                                    LOVE the food!!!!
459    Love it!!! Wish we still lived in Arizona as C...
679                                     Excellent burger
Name: text, dtype: object
In [81]:
#reviews with most negative sentiment
yelp[yelp.sentiment == -1].text.head()
Out[81]:
773     This was absolutely horrible. I got the suprem...
1517                  Nasty workers and over priced trash
3266    Absolutely awful... these guys have NO idea wh...
4766                                       Very bad food!
5812        I wouldn't send my worst enemy to this place.
Name: text, dtype: object

TextBlob features

In [82]:
# spelling correction
TextBlob('15 minuets late').correct()
Out[82]:
TextBlob("15 minutes late")
In [83]:
# spellcheck
Word('parot').spellcheck()
Out[83]:
[('part', 0.9929478138222849), (u'parrot', 0.007052186177715092)]
In [84]:
# definitions
Word('bank').define('v')
Out[84]:
[u'tip laterally',
 u'enclose with a bank',
 u'do business with a bank or keep an account at a bank',
 u'act as the banker in a game or in gambling',
 u'be in the banking business',
 u'put into a bank account',
 u'cover with ashes so to control the rate of burning',
 u'have confidence or faith in']
In [85]:
# language identification
TextBlob('Hola amigos').detect_language()
Out[85]:
u'es'
In [86]:
# language identification
TextBlob('Hola amigos').translate(from_lang='auto', to='en')
Out[86]:
TextBlob("Hello friends")
In [87]:
#sentiment
TextBlob('That movie was good.').sentiment
Out[87]:
Sentiment(polarity=0.7, subjectivity=0.6000000000000001)
In [ ]:

rss facebook twitter github youtube mail spotify lastfm instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora quora