Recommendation System - Based on Popularity

Day #9 - Recommendation System Based on Popularity

We need to make the minority class have close to equivalent number of values to the majority class

  1. Can undersample. Take a few data points from the major set and fuse them with the minority points
  2. Can oversample. Add number of values to minority sample using SMOTE after you split the data.

Recommendation Based off of Popularity

In [1]:
#import all necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.cross_validation import train_test_split
C:\Users\nwerner\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
In [2]:
ratings = pd.read_csv('u.data',header=None,sep='\t') #'\t' = separate on the slash and then tab
ratings.head()
Out[2]:
0 1 2 3
0 0 50 5 881250949
1 0 172 5 881250949
2 0 133 1 881250949
3 196 242 3 881250949
4 186 302 3 891717742
In [3]:
r_cols = ['user_id','movie_id','rating']

ratings = pd.read_csv('u.data',sep='\t',names=r_cols,usecols=range(3))
print ratings.head()
m_cols=['movie_id','title']
movies = pd.read_csv('u.item',sep='|',names=m_cols,usecols=range(2))
print movies.head()
   user_id  movie_id  rating
0        0        50       5
1        0       172       5
2        0       133       1
3      196       242       3
4      186       302       3
   movie_id              title
0         1   Toy Story (1995)
1         2   GoldenEye (1995)
2         3  Four Rooms (1995)
3         4  Get Shorty (1995)
4         5     Copycat (1995)
In [4]:
# Merging the dataframes 

df = pd.merge(ratings,movies,on='movie_id')
df.head()
#df['movie_id'].unique()
Out[4]:
user_id movie_id rating title
0 0 50 5 Star Wars (1977)
1 290 50 5 Star Wars (1977)
2 79 50 4 Star Wars (1977)
3 2 50 5 Star Wars (1977)
4 8 50 5 Star Wars (1977)
In [5]:
movies_grouped = df.groupby('title').agg({'rating':[np.size,np.sum,np.mean]})
movies_grouped
Out[5]:
rating
size sum mean
title
'Til There Was You (1997) 9 21 2.333333
1-900 (1994) 5 13 2.600000
101 Dalmatians (1996) 109 317 2.908257
12 Angry Men (1957) 125 543 4.344000
187 (1997) 41 124 3.024390
2 Days in the Valley (1996) 93 300 3.225806
20,000 Leagues Under the Sea (1954) 72 252 3.500000
2001: A Space Odyssey (1968) 259 1028 3.969112
3 Ninjas: High Noon At Mega Mountain (1998) 5 5 1.000000
39 Steps, The (1935) 59 239 4.050847
8 1/2 (1963) 38 145 3.815789
8 Heads in a Duffel Bag (1997) 4 13 3.250000
8 Seconds (1994) 4 15 3.750000
A Chef in Love (1996) 8 33 4.125000
Above the Rim (1994) 5 15 3.000000
Absolute Power (1997) 127 428 3.370079
Abyss, The (1989) 151 542 3.589404
Ace Ventura: Pet Detective (1994) 103 314 3.048544
Ace Ventura: When Nature Calls (1995) 37 99 2.675676
Across the Sea of Time (1995) 4 11 2.750000
Addams Family Values (1993) 87 245 2.816092
Addicted to Love (1997) 54 171 3.166667
Addiction, The (1995) 11 24 2.181818
Adventures of Pinocchio, The (1996) 39 119 3.051282
Adventures of Priscilla, Queen of the Desert, The (1994) 111 399 3.594595
Adventures of Robin Hood, The (1938) 67 254 3.791045
Affair to Remember, An (1957) 26 109 4.192308
African Queen, The (1951) 152 636 4.184211
Afterglow (1997) 18 56 3.111111
Age of Innocence, The (1993) 65 220 3.384615
... ... ... ...
Window to Paris (1994) 1 4 4.000000
Wings of Courage (1995) 1 4 4.000000
Wings of Desire (1987) 57 228 4.000000
Wings of the Dove, The (1997) 75 276 3.680000
Winnie the Pooh and the Blustery Day (1968) 75 285 3.800000
Winter Guest, The (1997) 9 31 3.444444
Wishmaster (1997) 27 66 2.444444
With Honors (1994) 46 141 3.065217
Withnail and I (1987) 13 42 3.230769
Witness (1985) 1 4 4.000000
Wizard of Oz, The (1939) 246 1003 4.077236
Wolf (1994) 67 181 2.701493
Woman in Question, The (1950) 1 1 1.000000
Women, The (1939) 15 55 3.666667
Wonderful, Horrible Life of Leni Riefenstahl, The (1993) 10 40 4.000000
Wonderland (1997) 10 32 3.200000
Wooden Man's Bride, The (Wu Kui) (1994) 3 8 2.666667
World of Apu, The (Apur Sansar) (1959) 6 24 4.000000
Wrong Trousers, The (1993) 118 527 4.466102
Wyatt Earp (1994) 50 155 3.100000
Yankee Zulu (1994) 1 1 1.000000
Year of the Horse (1997) 7 23 3.285714
You So Crazy (1994) 1 3 3.000000
Young Frankenstein (1974) 200 789 3.945000
Young Guns (1988) 101 324 3.207921
Young Guns II (1990) 44 122 2.772727
Young Poisoner's Handbook, The (1995) 41 137 3.341463
Zeus and Roxanne (1997) 6 13 2.166667
unknown 9 31 3.444444
� k�ldum klaka (Cold Fever) (1994) 1 3 3.000000

1664 rows × 3 columns

In [6]:
popular_movies = movies_grouped.sort_values(('rating','mean'),ascending=False)
popular_movies
Out[6]:
rating
size sum mean
title
They Made Me a Criminal (1939) 1 5 5.000000
Marlene Dietrich: Shadow and Light (1996) 1 5 5.000000
Saint of Fort Washington, The (1993) 2 10 5.000000
Someone Else's America (1995) 1 5 5.000000
Star Kid (1997) 3 15 5.000000
Great Day in Harlem, A (1994) 1 5 5.000000
Aiqing wansui (1994) 1 5 5.000000
Santa with Muscles (1996) 2 10 5.000000
Prefontaine (1997) 3 15 5.000000
Entertaining Angels: The Dorothy Day Story (1996) 1 5 5.000000
Pather Panchali (1955) 8 37 4.625000
Some Mother's Son (1996) 2 9 4.500000
Maya Lin: A Strong Clear Vision (1994) 4 18 4.500000
Anna (1996) 2 9 4.500000
Everest (1998) 2 9 4.500000
Close Shave, A (1995) 112 503 4.491071
Schindler's List (1993) 298 1331 4.466443
Wrong Trousers, The (1993) 118 527 4.466102
Casablanca (1942) 243 1083 4.456790
Wallace & Gromit: The Best of Aardman Animation (1996) 67 298 4.447761
Shawshank Redemption, The (1994) 283 1258 4.445230
Rear Window (1954) 209 917 4.387560
Usual Suspects, The (1995) 267 1171 4.385768
Star Wars (1977) 584 2546 4.359589
12 Angry Men (1957) 125 543 4.344000
Bitter Sugar (Azucar Amargo) (1996) 3 13 4.333333
Letter From Death Row, A (1998) 3 13 4.333333
Third Man, The (1949) 72 312 4.333333
Citizen Kane (1941) 198 850 4.292929
Some Folks Call It a Sling Blade (1993) 41 176 4.292683
... ... ... ...
New Age, The (1994) 1 1 1.000000
Good Morning (1971) 1 1 1.000000
Further Gesture, A (1996) 1 1 1.000000
Nobody Loves Me (Keiner liebt mich) (1994) 1 1 1.000000
Lotto Land (1995) 1 1 1.000000
Liebelei (1933) 1 1 1.000000
Girl in the Cadillac (1995) 1 1 1.000000
Vie est belle, La (Life is Rosey) (1987) 1 1 1.000000
Baton Rouge (1988) 1 1 1.000000
Very Natural Thing, A (1974) 1 1 1.000000
Bird of Prey (1996) 1 1 1.000000
Office Killer (1997) 1 1 1.000000
Lashou shentan (1992) 1 1 1.000000
August (1996) 1 1 1.000000
Venice/Venice (1992) 2 2 1.000000
Death in the Garden (Mort en ce jardin, La) (1956) 1 1 1.000000
Careful (1992) 1 1 1.000000
Tigrero: A Film That Was Never Made (1994) 1 1 1.000000
Butterfly Kiss (1995) 1 1 1.000000
Low Life, The (1994) 1 1 1.000000
To Cross the Rubicon (1991) 1 1 1.000000
Modern Affair, A (1995) 1 1 1.000000
Boys in Venice (1996) 2 2 1.000000
Hedd Wyn (1992) 1 1 1.000000
Wend Kuuni (God's Gift) (1982) 1 1 1.000000
Eye of Vichy, The (Oeil de Vichy, L') (1993) 1 1 1.000000
King of New York (1990) 1 1 1.000000
Touki Bouki (Journey of the Hyena) (1973) 1 1 1.000000
Bloody Child, The (1996) 1 1 1.000000
Crude Oasis, The (1995) 1 1 1.000000

1664 rows × 3 columns

In [7]:
group_sum = movies_grouped['rating']['sum'].sum()
group_sum
Out[7]:
352997L
In [8]:
popular_movies['percentage'] = movies_grouped['rating']['sum']/group_sum*100
popular_movies.sort_values('percentage',ascending=False)
Out[8]:
rating percentage
size sum mean
title
Star Wars (1977) 584 2546 4.359589 0.721253
Fargo (1996) 508 2111 4.155512 0.598022
Return of the Jedi (1983) 507 2032 4.007890 0.575642
Contact (1997) 509 1936 3.803536 0.548447
Raiders of the Lost Ark (1981) 420 1786 4.252381 0.505953
Godfather, The (1972) 413 1769 4.283293 0.501137
English Patient, The (1996) 481 1759 3.656965 0.498305
Toy Story (1995) 452 1753 3.878319 0.496605
Silence of the Lambs, The (1991) 390 1673 4.289744 0.473942
Scream (1996) 478 1645 3.441423 0.466010
Pulp Fiction (1994) 394 1600 4.060914 0.453262
Air Force One (1997) 431 1565 3.631090 0.443347
Empire Strikes Back, The (1980) 368 1548 4.206522 0.438531
Liar Liar (1997) 485 1531 3.156701 0.433715
Twelve Monkeys (1995) 392 1489 3.798469 0.421817
Titanic (1997) 350 1486 4.245714 0.420967
Independence Day (ID4) (1996) 429 1475 3.438228 0.417851
Chasing Amy (1997) 379 1455 3.839050 0.412185
Jerry Maguire (1996) 384 1425 3.710938 0.403686
Rock, The (1996) 378 1396 3.693122 0.395471
Fugitive, The (1993) 336 1359 4.044643 0.384989
Princess Bride, The (1987) 324 1352 4.172840 0.383006
Back to the Future (1985) 350 1342 3.834286 0.380173
Star Trek: First Contact (1996) 365 1336 3.660274 0.378473
Schindler's List (1993) 298 1331 4.466443 0.377057
Indiana Jones and the Last Crusade (1989) 331 1301 3.930514 0.368558
Monty Python and the Holy Grail (1974) 316 1285 4.066456 0.364026
Shawshank Redemption, The (1994) 283 1258 4.445230 0.356377
Full Monty, The (1997) 315 1237 3.926984 0.350428
Forrest Gump (1994) 321 1237 3.853583 0.350428
... ... ... ... ...
Further Gesture, A (1996) 1 1 1.000000 0.000283
New Age, The (1994) 1 1 1.000000 0.000283
Terror in a Texas Town (1958) 1 1 1.000000 0.000283
Leopard Son, The (1996) 1 1 1.000000 0.000283
Man from Down Under, The (1943) 1 1 1.000000 0.000283
Hungarian Fairy Tale, A (1987) 1 1 1.000000 0.000283
Mat' i syn (1997) 1 1 1.000000 0.000283
Yankee Zulu (1994) 1 1 1.000000 0.000283
Invitation, The (Zaproszenie) (1986) 1 1 1.000000 0.000283
T-Men (1947) 1 1 1.000000 0.000283
Symphonie pastorale, La (1946) 1 1 1.000000 0.000283
Police Story 4: Project S (Chao ji ji hua) (1993) 1 1 1.000000 0.000283
Shadows (Cienie) (1988) 1 1 1.000000 0.000283
Hostile Intentions (1994) 1 1 1.000000 0.000283
Pharaoh's Army (1995) 1 1 1.000000 0.000283
Power 98 (1995) 1 1 1.000000 0.000283
Daens (1992) 1 1 1.000000 0.000283
Mostro, Il (1994) 1 1 1.000000 0.000283
Shadow of Angels (Schatten der Engel) (1976) 1 1 1.000000 0.000283
Quartier Mozart (1992) 1 1 1.000000 0.000283
I, Worst of All (Yo, la peor de todas) (1990) 1 1 1.000000 0.000283
Woman in Question, The (1950) 1 1 1.000000 0.000283
Vermont Is For Lovers (1992) 1 1 1.000000 0.000283
Every Other Weekend (1990) 1 1 1.000000 0.000283
Homage (1995) 1 1 1.000000 0.000283
Getting Away With Murder (1996) 1 1 1.000000 0.000283
The Courtyard (1995) 1 1 1.000000 0.000283
Promise, The (Versprechen, Das) (1994) 1 1 1.000000 0.000283
JLG/JLG - autoportrait de d�cembre (1994) 1 1 1.000000 0.000283
Crude Oasis, The (1995) 1 1 1.000000 0.000283

1664 rows × 4 columns

In [12]:
popular_movies['Rank'] = popular_movies['percentage'].rank(ascending=False,method='max')
popular_movies.sort_values('Rank',ascending=True)
Out[12]:
rating percentage Rank
size sum mean
title
Star Wars (1977) 584 2546 4.359589 0.721253 1.0
Fargo (1996) 508 2111 4.155512 0.598022 2.0
Return of the Jedi (1983) 507 2032 4.007890 0.575642 3.0
Contact (1997) 509 1936 3.803536 0.548447 4.0
Raiders of the Lost Ark (1981) 420 1786 4.252381 0.505953 5.0
Godfather, The (1972) 413 1769 4.283293 0.501137 6.0
English Patient, The (1996) 481 1759 3.656965 0.498305 7.0
Toy Story (1995) 452 1753 3.878319 0.496605 8.0
Silence of the Lambs, The (1991) 390 1673 4.289744 0.473942 9.0
Scream (1996) 478 1645 3.441423 0.466010 10.0
Pulp Fiction (1994) 394 1600 4.060914 0.453262 11.0
Air Force One (1997) 431 1565 3.631090 0.443347 12.0
Empire Strikes Back, The (1980) 368 1548 4.206522 0.438531 13.0
Liar Liar (1997) 485 1531 3.156701 0.433715 14.0
Twelve Monkeys (1995) 392 1489 3.798469 0.421817 15.0
Titanic (1997) 350 1486 4.245714 0.420967 16.0
Independence Day (ID4) (1996) 429 1475 3.438228 0.417851 17.0
Chasing Amy (1997) 379 1455 3.839050 0.412185 18.0
Jerry Maguire (1996) 384 1425 3.710938 0.403686 19.0
Rock, The (1996) 378 1396 3.693122 0.395471 20.0
Fugitive, The (1993) 336 1359 4.044643 0.384989 21.0
Princess Bride, The (1987) 324 1352 4.172840 0.383006 22.0
Back to the Future (1985) 350 1342 3.834286 0.380173 23.0
Star Trek: First Contact (1996) 365 1336 3.660274 0.378473 24.0
Schindler's List (1993) 298 1331 4.466443 0.377057 25.0
Indiana Jones and the Last Crusade (1989) 331 1301 3.930514 0.368558 26.0
Monty Python and the Holy Grail (1974) 316 1285 4.066456 0.364026 27.0
Shawshank Redemption, The (1994) 283 1258 4.445230 0.356377 28.0
Forrest Gump (1994) 321 1237 3.853583 0.350428 30.0
Full Monty, The (1997) 315 1237 3.926984 0.350428 30.0
... ... ... ... ... ...
Getting Away With Murder (1996) 1 1 1.000000 0.000283 1664.0
Bloody Child, The (1996) 1 1 1.000000 0.000283 1664.0
Stefano Quantestorie (1993) 1 1 1.000000 0.000283 1664.0
JLG/JLG - autoportrait de d�cembre (1994) 1 1 1.000000 0.000283 1664.0
Promise, The (Versprechen, Das) (1994) 1 1 1.000000 0.000283 1664.0
The Courtyard (1995) 1 1 1.000000 0.000283 1664.0
Homage (1995) 1 1 1.000000 0.000283 1664.0
Pharaoh's Army (1995) 1 1 1.000000 0.000283 1664.0
Man from Down Under, The (1943) 1 1 1.000000 0.000283 1664.0
Mille bolle blu (1993) 1 1 1.000000 0.000283 1664.0
Terror in a Texas Town (1958) 1 1 1.000000 0.000283 1664.0
Hungarian Fairy Tale, A (1987) 1 1 1.000000 0.000283 1664.0
Mat' i syn (1997) 1 1 1.000000 0.000283 1664.0
Invitation, The (Zaproszenie) (1986) 1 1 1.000000 0.000283 1664.0
T-Men (1947) 1 1 1.000000 0.000283 1664.0
Symphonie pastorale, La (1946) 1 1 1.000000 0.000283 1664.0
Police Story 4: Project S (Chao ji ji hua) (1993) 1 1 1.000000 0.000283 1664.0
Shadows (Cienie) (1988) 1 1 1.000000 0.000283 1664.0
Hostile Intentions (1994) 1 1 1.000000 0.000283 1664.0
Power 98 (1995) 1 1 1.000000 0.000283 1664.0
Daens (1992) 1 1 1.000000 0.000283 1664.0
Mostro, Il (1994) 1 1 1.000000 0.000283 1664.0
Shadow of Angels (Schatten der Engel) (1976) 1 1 1.000000 0.000283 1664.0
Quartier Mozart (1992) 1 1 1.000000 0.000283 1664.0
I, Worst of All (Yo, la peor de todas) (1990) 1 1 1.000000 0.000283 1664.0
Woman in Question, The (1950) 1 1 1.000000 0.000283 1664.0
Vermont Is For Lovers (1992) 1 1 1.000000 0.000283 1664.0
Every Other Weekend (1990) 1 1 1.000000 0.000283 1664.0
Leopard Son, The (1996) 1 1 1.000000 0.000283 1664.0
Crude Oasis, The (1995) 1 1 1.000000 0.000283 1664.0

1664 rows × 5 columns

In [ ]:

rss facebook twitter github youtube mail spotify lastfm instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora quora