20 minute read

%matplotlib inline
import numpy as np
import pandas as pd
import utils; reload(utils)
from utils import *
!echo "PWD: "`pwd`

CUR_DIR=!pwd
CUR_DIR=CUR_DIR[0]
!echo "CUR_DIR:" $CUR_DIR

DATA_DIR="%s/../data/movielens" % CUR_DIR
!echo "DATA_DIR: "$DATA_DIR
PWD: /Applications/GITHub/keras/course_fast_ai/lesson_1
CUR_DIR: /Applications/GITHub/keras/course_fast_ai/lesson_1
DATA_DIR: /Applications/GITHub/keras/course_fast_ai/lesson_1/../data/movielens

One time setup

Create the data directories

!mkdir $DATA_DIR
!mkdir $DATA_DIR/sample
mkdir: cannot create directory ‘/Applications/GITHub/keras/course_fast_ai/lesson_1/../data/movielens’: File exists
mkdir: cannot create directory ‘/Applications/GITHub/keras/course_fast_ai/lesson_1/../data/movielens/sample’: File exists

Download the 100k sample data(from github mirror since currently the main site is down)

!wget https://raw.githubusercontent.com/nchah/movielens-recommender/master/data/ml-movies.csv -O $DATA_DIR/sample/movies.csv
!wget https://raw.githubusercontent.com/nchah/movielens-recommender/master/data/ml-ratings-100k-sample.csv -O $DATA_DIR/sample/ratings.csv  
--2017-06-06 18:40:59--  https://raw.githubusercontent.com/nchah/movielens-recommender/master/data/ml-movies.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.112.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.112.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1397542 (1,3M) [text/plain]
Saving to: ‘/Applications/GITHub/keras/course_fast_ai/lesson_1/../data/movielens/sample/movies.csv’

/Applications/GITHu 100%[===================>]   1,33M  3,99MB/s    in 0,3s    

2017-06-06 18:41:01 (3,99 MB/s) - ‘/Applications/GITHub/keras/course_fast_ai/lesson_1/../data/movielens/sample/movies.csv’ saved [1397542/1397542]

--2017-06-06 18:41:01--  https://raw.githubusercontent.com/nchah/movielens-recommender/master/data/ml-ratings-100k-sample.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.112.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.112.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1215898 (1,2M) [text/plain]
Saving to: ‘/Applications/GITHub/keras/course_fast_ai/lesson_1/../data/movielens/sample/ratings.csv’

/Applications/GITHu 100%[===================>]   1,16M  2,60MB/s    in 0,4s    

2017-06-06 18:41:02 (2,60 MB/s) - ‘/Applications/GITHub/keras/course_fast_ai/lesson_1/../data/movielens/sample/ratings.csv’ saved [1215898/1215898]

Load the model and test it works

DATA = "%s/sample" % DATA_DIR
ratings = pd.read_csv(DATA+"/ratings.csv")
ratings.head()
userId movieId rating timestamp
0 1 31 2.5 1260759144
1 1 1029 3.0 1260759179
2 1 1061 3.0 1260759182
3 1 1129 2.0 1260759185
4 1 1172 4.0 1260759205
len(ratings)
100004
movie_names = pd.read_csv(DATA+"/movies.csv")
movie_names.head()
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy
movie_names.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9125 entries, 0 to 9124
Data columns (total 3 columns):
movieId    9125 non-null int64
title      9125 non-null object
genres     9125 non-null object
dtypes: int64(1), object(2)
memory usage: 213.9+ KB
movie_names.set_index('movieId')['title']
movieId
1                                          Toy Story (1995)
2                                            Jumanji (1995)
3                                   Grumpier Old Men (1995)
4                                  Waiting to Exhale (1995)
5                        Father of the Bride Part II (1995)
6                                               Heat (1995)
7                                            Sabrina (1995)
8                                       Tom and Huck (1995)
9                                       Sudden Death (1995)
10                                         GoldenEye (1995)
11                           American President, The (1995)
12                       Dracula: Dead and Loving It (1995)
13                                             Balto (1995)
14                                             Nixon (1995)
15                                  Cutthroat Island (1995)
16                                            Casino (1995)
17                             Sense and Sensibility (1995)
18                                        Four Rooms (1995)
19                    Ace Ventura: When Nature Calls (1995)
20                                       Money Train (1995)
21                                        Get Shorty (1995)
22                                           Copycat (1995)
23                                         Assassins (1995)
24                                            Powder (1995)
25                                 Leaving Las Vegas (1995)
26                                           Othello (1995)
27                                      Now and Then (1995)
28                                        Persuasion (1995)
29        City of Lost Children, The (Cité des enfants p...
30        Shanghai Triad (Yao a yao yao dao waipo qiao) ...
                                ...                        
159690    Teenage Mutant Ninja Turtles: Out of the Shado...
159755            Popstar: Never Stop Never Stopping (2016)
159858                               The Conjuring 2 (2016)
159972                       Approaching the Unknown (2016)
160080                                  Ghostbusters (2016)
160271                          Central Intelligence (2016)
160438                                  Jason Bourne (2016)
160440                               The Maid's Room (2014)
160563                          The Legend of Tarzan (2016)
160565                      The Purge: Election Year (2016)
160567                Mike & Dave Need Wedding Dates (2016)
160590                           Survive and Advance (2013)
160656                                      Tallulah (2016)
160718                                         Piper (2016)
160954                                         Nerve (2016)
161084                         My Friend Rockefeller (2015)
161155                                     Sunspring (2016)
161336                    Author: The JT LeRoy Story (2016)
161582                            Hell or High Water (2016)
161594                 Kingsglaive: Final Fantasy XV (2016)
161830                                          Body (2015)
161918                  Sharknado 4: The 4th Awakens (2016)
161944                The Last Brickmaker in America (2001)
162376                                      Stranger Things
162542                                        Rustom (2016)
162672                                  Mohenjo Daro (2016)
163056                                 Shin Godzilla (2016)
163949    The Beatles: Eight Days a Week - The Touring Y...
164977                             The Gay Desperado (1936)
164979                                Women of '69, Unboxed
Name: title, dtype: object
users = ratings.userId.unique()
movies = ratings.movieId.unique()
print(len(users))
print(len(movies))
671
9066
movieIdx = {idx: i for i, idx in enumerate(movies)}
userIdx = {idx: i for i, idx in enumerate(users)}
print(len(movieIdx))
assert len(movies) == len(movieIdx)

sorted_users = np.array([userIdx[user] for user in ratings.userId])
sorted_movies = np.array([movieIdx[movie] for movie in ratings.movieId])

assert len(sorted_movies) == len(sorted_users)

permutation = np.random.permutation(np.arange(0, len(sorted_users)))

train_user = sorted_users[permutation][:80000]
train_movies = sorted_movies[permutation][:80000]
train_ratings = ratings.rating[permutation][:80000]

val_user = sorted_users[permutation][80000:]
val_movies = sorted_users[permutation][80000:]
val_ratings = ratings.rating[permutation][80000:]
9066

Model 1: Neural Network

from keras.layers import Input, InputLayer, Dense, Embedding, Flatten
from keras.layers import merge
from keras.engine import Model
from keras.layers.core import Dropout
from keras.regularizers import l2

user =  Input(shape=(1,), name="in_user", dtype='int64')
emb_user = Embedding(input_length=1, input_dim=len(users), output_dim=50, W_regularizer=l2(0.0001))(user)

movie =  Input(shape=(1,), name="in_movie",  dtype='int64')
emb_movie = Embedding(input_length=1, input_dim=len(movies), output_dim=50, W_regularizer=l2(0.0001))(movie)

merged = merge([emb_user, emb_movie], mode='concat')
merged = Flatten()(merged)

lm = Dense(output_dim=100,activation="relu")(merged)
lm = Dropout(0.5)(lm)
lm = Dense(output_dim=100,activation="relu")(lm)
lm = Dropout(0.6)(lm)
lm = Dense(output_dim=50,activation="relu")(lm)
lm = Dropout(0.7)(lm)
lm = Dense(output_dim=1,activation="linear")(lm)
model = Model(input=[user, movie], output=lm, name="out_linear")
model.compile(optimizer="adam", loss='mse')
model.summary()
____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
in_user (InputLayer)             (None, 1)             0                                            
____________________________________________________________________________________________________
in_movie (InputLayer)            (None, 1)             0                                            
____________________________________________________________________________________________________
embedding_183 (Embedding)        (None, 1, 50)         33550       in_user[0][0]                    
____________________________________________________________________________________________________
embedding_184 (Embedding)        (None, 1, 50)         453300      in_movie[0][0]                   
____________________________________________________________________________________________________
merge_118 (Merge)                (None, 1, 100)        0           embedding_183[0][0]              
                                                                   embedding_184[0][0]              
____________________________________________________________________________________________________
flatten_158 (Flatten)            (None, 100)           0           merge_118[0][0]                  
____________________________________________________________________________________________________
dense_24 (Dense)                 (None, 100)           10100       flatten_158[0][0]                
____________________________________________________________________________________________________
dropout_4 (Dropout)              (None, 100)           0           dense_24[0][0]                   
____________________________________________________________________________________________________
dense_25 (Dense)                 (None, 100)           10100       dropout_4[0][0]                  
____________________________________________________________________________________________________
dropout_5 (Dropout)              (None, 100)           0           dense_25[0][0]                   
____________________________________________________________________________________________________
dense_26 (Dense)                 (None, 50)            5050        dropout_5[0][0]                  
____________________________________________________________________________________________________
dropout_6 (Dropout)              (None, 50)            0           dense_26[0][0]                   
____________________________________________________________________________________________________
dense_27 (Dense)                 (None, 1)             51          dropout_6[0][0]                  
====================================================================================================
Total params: 512,151
Trainable params: 512,151
Non-trainable params: 0
____________________________________________________________________________________________________

Model 2: Dot product model

solution = user_vars * movie_vars + user_bias + movie_bias

user = Input(shape=(1,))
b_user = Embedding(input_length=1, input_dim=len(users), output_dim=1)(user)
b_user = Flatten()(b_user)
print(b_user.get_shape)

hidden_factors = 10

emb_user = Embedding(input_length=1, input_dim=len(users), output_dim=hidden_factors, W_regularizer=l2(0.000001))(user)
emb_user =Flatten()(emb_user)
print(emb_user.get_shape)

movie = Input(shape=(1,))
b_movie = Embedding(input_length=1, input_dim=len(movies), output_dim=1)(movie)
b_movie = Flatten()(b_movie)

emb_movie = Embedding(input_length=1, input_dim=len(movies), output_dim=hidden_factors, W_regularizer=l2(0.000001))(movie)
emb_movie = Flatten()(emb_movie)

merged = merge([emb_user, emb_movie], mode='dot')
# merged = Flatten()(merged)
merged = merge([merged, b_user], mode='sum')
merged = merge([merged, b_movie], mode='sum')

model = Model(input=[user, movie], output=merged)
model.summary()
model.compile(optimizer='adam', loss='mse')
model.optimizer.lr=0.001
<bound method Tensor.get_shape of <tf.Tensor 'Reshape_168:0' shape=(?, ?) dtype=float32>>
<bound method Tensor.get_shape of <tf.Tensor 'Reshape_169:0' shape=(?, ?) dtype=float32>>
____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
input_93 (InputLayer)            (None, 1)             0                                            
____________________________________________________________________________________________________
input_94 (InputLayer)            (None, 1)             0                                            
____________________________________________________________________________________________________
embedding_190 (Embedding)        (None, 1, 10)         6710        input_93[0][0]                   
____________________________________________________________________________________________________
embedding_192 (Embedding)        (None, 1, 10)         90660       input_94[0][0]                   
____________________________________________________________________________________________________
flatten_164 (Flatten)            (None, 10)            0           embedding_190[0][0]              
____________________________________________________________________________________________________
flatten_166 (Flatten)            (None, 10)            0           embedding_192[0][0]              
____________________________________________________________________________________________________
embedding_189 (Embedding)        (None, 1, 1)          671         input_93[0][0]                   
____________________________________________________________________________________________________
merge_122 (Merge)                (None, 1)             0           flatten_164[0][0]                
                                                                   flatten_166[0][0]                
____________________________________________________________________________________________________
flatten_163 (Flatten)            (None, 1)             0           embedding_189[0][0]              
____________________________________________________________________________________________________
embedding_191 (Embedding)        (None, 1, 1)          9066        input_94[0][0]                   
____________________________________________________________________________________________________
merge_123 (Merge)                (None, 1)             0           merge_122[0][0]                  
                                                                   flatten_163[0][0]                
____________________________________________________________________________________________________
flatten_165 (Flatten)            (None, 1)             0           embedding_191[0][0]              
____________________________________________________________________________________________________
merge_124 (Merge)                (None, 1)             0           merge_123[0][0]                  
                                                                   flatten_165[0][0]                
====================================================================================================
Total params: 107,107
Trainable params: 107,107
Non-trainable params: 0
____________________________________________________________________________________________________

Model 3: Hybrid model

Add a two-loss function model, one on the linear layer and the second on the dot product one that both update the same embeddings. The linear layer converges very fast but reaches a plateau, whereas the dot product seems to impove continuesly but really slow.

from keras.layers import Input, InputLayer, Dense, Embedding, Flatten
from keras.layers import merge
from keras.engine import Model
from keras.regularizers import l2
from keras.optimizers import Adam

user = Input(shape=(1,), dtype='int32')
emb_user = Embedding(input_dim=len(users), output_dim=50, input_length=1, W_regularizer=l2(0.0001))(user)
# emb_user = Flatten()(emb_user)
b_user = Embedding(input_dim=len(users), output_dim=1, input_length=1, init='zero')(user)
b_user =Flatten()(b_user)



movie = Input(shape=(1,), dtype='int32')
emb_movie = Embedding(input_dim=len(movies), output_dim=50, input_length=1, W_regularizer=l2(0.0001))(movie)
# emb_movie = Flatten()(emb_movie)
b_movie = Embedding(input_dim=len(movies), output_dim=1, input_length=1, init='zero')(movie)
b_movie = Flatten()(b_movie)

lm = merge([emb_user, emb_movie], mode='concat')
lm = Flatten()(lm)
lm = Dense(output_dim=1, activation='linear', name='output_lm')(lm)

merged = merge([emb_user, emb_movie], mode='dot')
merged = Flatten()(merged)
merged = merge([merged, b_user], mode='sum')
merged = merge([merged, b_movie], mode='sum', name='output_dot')

model = Model(input=[user, movie], output=[lm, merged])
model.compile(optimizer=Adam(0.001), loss={'output_lm': 'mse', 'output_dot': 'mse'}, loss_wights={'output_lm': 0.2, 'output_dot': 1.})
model.summary()

model.fit(x=[train_user, train_movies], y=[train_ratings, train_ratings], batch_size=32, nb_epoch=5, validation_data=([val_user, val_movies], [val_ratings, val_ratings]))
____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
input_56 (InputLayer)            (None, 1)             0                                            
____________________________________________________________________________________________________
input_57 (InputLayer)            (None, 1)             0                                            
____________________________________________________________________________________________________
embedding_110 (Embedding)        (None, 1, 50)         33550       input_56[0][0]                   
____________________________________________________________________________________________________
embedding_112 (Embedding)        (None, 1, 50)         453300      input_57[0][0]                   
____________________________________________________________________________________________________
merge_70 (Merge)                 (None, 1, 1)          0           embedding_110[0][0]              
                                                                   embedding_112[0][0]              
____________________________________________________________________________________________________
embedding_111 (Embedding)        (None, 1, 1)          671         input_56[0][0]                   
____________________________________________________________________________________________________
merge_69 (Merge)                 (None, 1, 100)        0           embedding_110[0][0]              
                                                                   embedding_112[0][0]              
____________________________________________________________________________________________________
flatten_101 (Flatten)            (None, 1)             0           merge_70[0][0]                   
____________________________________________________________________________________________________
flatten_98 (Flatten)             (None, 1)             0           embedding_111[0][0]              
____________________________________________________________________________________________________
embedding_113 (Embedding)        (None, 1, 1)          9066        input_57[0][0]                   
____________________________________________________________________________________________________
flatten_100 (Flatten)            (None, 100)           0           merge_69[0][0]                   
____________________________________________________________________________________________________
merge_71 (Merge)                 (None, 1)             0           flatten_101[0][0]                
                                                                   flatten_98[0][0]                 
____________________________________________________________________________________________________
flatten_99 (Flatten)             (None, 1)             0           embedding_113[0][0]              
____________________________________________________________________________________________________
output_lm (Dense)                (None, 1)             101         flatten_100[0][0]                
____________________________________________________________________________________________________
output_dot (Merge)               (None, 1)             0           merge_71[0][0]                   
                                                                   flatten_99[0][0]                 
====================================================================================================
Total params: 496,688
Trainable params: 496,688
Non-trainable params: 0
____________________________________________________________________________________________________


/home/cristi/Envs/deep/local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py:1960: UserWarning: Expected no kwargs, you passed 1
kwargs passed to function are ignored with Tensorflow backend
  warnings.warn('\n'.join(msg))


Train on 80000 samples, validate on 20004 samples
Epoch 1/5
80000/80000 [==============================] - 49s - loss: 10.1750 - output_lm_loss: 1.8258 - output_dot_loss: 8.2223 - val_loss: 4.8403 - val_output_lm_loss: 1.1173 - val_output_dot_loss: 3.3941
Epoch 2/5
80000/80000 [==============================] - 49s - loss: 3.2804 - output_lm_loss: 0.7973 - output_dot_loss: 1.9949 - val_loss: 3.6001 - val_output_lm_loss: 1.1193 - val_output_dot_loss: 1.8935
Epoch 3/5
80000/80000 [==============================] - 49s - loss: 2.6331 - output_lm_loss: 0.7730 - output_dot_loss: 1.2329 - val_loss: 3.3541 - val_output_lm_loss: 1.1136 - val_output_dot_loss: 1.5954
Epoch 4/5
80000/80000 [==============================] - 49s - loss: 2.4451 - output_lm_loss: 0.7657 - output_dot_loss: 1.0366 - val_loss: 3.2823 - val_output_lm_loss: 1.1273 - val_output_dot_loss: 1.5230
Epoch 5/5
80000/80000 [==============================] - 48s - loss: 2.3190 - output_lm_loss: 0.7592 - output_dot_loss: 0.9438 - val_loss: 3.1707 - val_output_lm_loss: 1.1157 - val_output_dot_loss: 1.4603





<keras.callbacks.History at 0x7f2dfc6094d0>

Model 4: Pretraining model

Train the embeddings using the first aapproach then, translate to second model and train again

from keras.layers import Input, InputLayer, Dense, Embedding, Flatten
from keras.layers import merge
from keras.engine import Model
from keras.regularizers import l2
from keras.optimizers import Adam

user = Input(shape=(1,), dtype='int32')
emb_user = Embedding(input_dim=len(users), output_dim=50, input_length=1, W_regularizer=l2(0.0001))(user)
# emb_user = Flatten()(emb_user)
b_user = Embedding(input_dim=len(users), output_dim=1, input_length=1, init='zero')(user)
b_user =Flatten()(b_user)



movie = Input(shape=(1,), dtype='int32')
emb_movie = Embedding(input_dim=len(movies), output_dim=50, input_length=1, W_regularizer=l2(0.0001))(movie)
# emb_movie = Flatten()(emb_movie)
b_movie = Embedding(input_dim=len(movies), output_dim=1, input_length=1, init='zero')(movie)
b_movie = Flatten()(b_movie)

lm = merge([emb_user, emb_movie], mode='concat')
lm = Flatten()(lm)
lm = Dense(output_dim=100, activation='relu')(lm)
lm = Dense(output_dim=1, activation='linear', name='output_lm')(lm)
lm_model = Model(input=[user, movie], output=lm)
lm_model.compile(optimizer=Adam(0.001), loss='mse')
lm_model.summary()
lm_model.fit(x=[train_user, train_movies], y=train_ratings, batch_size=64, nb_epoch=1, validation_data=([val_user, val_movies], val_ratings))


merged = merge([emb_user, emb_movie], mode='dot')
merged = Flatten()(merged)
merged = merge([merged, b_user], mode='sum')
merged = merge([merged, b_movie], mode='sum', name='output_dot')

merge_model = Model(input=[user, movie], output=merged)
merge_model.compile(optimizer=Adam(0.001), loss='mse')
merge_model.summary()

merge_model.fit(x=[train_user, train_movies], y=train_ratings, batch_size=64, nb_epoch=5, validation_data=([val_user, val_movies], val_ratings))
____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
input_64 (InputLayer)            (None, 1)             0                                            
____________________________________________________________________________________________________
input_65 (InputLayer)            (None, 1)             0                                            
____________________________________________________________________________________________________
embedding_126 (Embedding)        (None, 1, 50)         33550       input_64[0][0]                   
____________________________________________________________________________________________________
embedding_128 (Embedding)        (None, 1, 50)         453300      input_65[0][0]                   
____________________________________________________________________________________________________
merge_77 (Merge)                 (None, 1, 100)        0           embedding_126[0][0]              
                                                                   embedding_128[0][0]              
____________________________________________________________________________________________________
flatten_114 (Flatten)            (None, 100)           0           merge_77[0][0]                   
____________________________________________________________________________________________________
dense_15 (Dense)                 (None, 100)           10100       flatten_114[0][0]                
____________________________________________________________________________________________________
output_lm (Dense)                (None, 1)             101         dense_15[0][0]                   
====================================================================================================
Total params: 497,051
Trainable params: 497,051
Non-trainable params: 0
____________________________________________________________________________________________________
Train on 80000 samples, validate on 20004 samples
Epoch 1/1
80000/80000 [==============================] - 26s - loss: 1.4305 - val_loss: 1.0945
____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
input_64 (InputLayer)            (None, 1)             0                                            
____________________________________________________________________________________________________
input_65 (InputLayer)            (None, 1)             0                                            
____________________________________________________________________________________________________
embedding_126 (Embedding)        (None, 1, 50)         33550       input_64[0][0]                   
____________________________________________________________________________________________________
embedding_128 (Embedding)        (None, 1, 50)         453300      input_65[0][0]                   
____________________________________________________________________________________________________
merge_78 (Merge)                 (None, 1, 1)          0           embedding_126[0][0]              
                                                                   embedding_128[0][0]              
____________________________________________________________________________________________________
embedding_127 (Embedding)        (None, 1, 1)          671         input_64[0][0]                   
____________________________________________________________________________________________________
flatten_115 (Flatten)            (None, 1)             0           merge_78[0][0]                   
____________________________________________________________________________________________________
flatten_112 (Flatten)            (None, 1)             0           embedding_127[0][0]              
____________________________________________________________________________________________________
embedding_129 (Embedding)        (None, 1, 1)          9066        input_65[0][0]                   
____________________________________________________________________________________________________
merge_79 (Merge)                 (None, 1)             0           flatten_115[0][0]                
                                                                   flatten_112[0][0]                
____________________________________________________________________________________________________
flatten_113 (Flatten)            (None, 1)             0           embedding_129[0][0]              
____________________________________________________________________________________________________
output_dot (Merge)               (None, 1)             0           merge_79[0][0]                   
                                                                   flatten_113[0][0]                
====================================================================================================
Total params: 496,587
Trainable params: 496,587
Non-trainable params: 0
____________________________________________________________________________________________________
Train on 80000 samples, validate on 20004 samples
Epoch 1/5
80000/80000 [==============================] - 25s - loss: 7.6557 - val_loss: 3.4311
Epoch 2/5
80000/80000 [==============================] - 24s - loss: 2.4828 - val_loss: 2.5332
Epoch 3/5
80000/80000 [==============================] - 24s - loss: 1.9854 - val_loss: 2.3767
Epoch 4/5
80000/80000 [==============================] - 25s - loss: 1.8386 - val_loss: 2.3174
Epoch 5/5
80000/80000 [==============================] - 25s - loss: 1.7504 - val_loss: 2.2697





<keras.callbacks.History at 0x7f2de621b090>

Model 5: Pretrain + Dot Product

Pretrain using the hybrid model, then reuse the embeddings in the single dot product model and continue training.

from keras.layers import Input, InputLayer, Dense, Embedding, Flatten
from keras.layers import merge
from keras.engine import Model
from keras.regularizers import l2
from keras.optimizers import Adam

user = Input(shape=(1,), dtype='int32')
emb_user = Embedding(input_dim=len(users), output_dim=50, input_length=1, W_regularizer=l2(0.0001))(user)
# emb_user = Flatten()(emb_user)
b_user = Embedding(input_dim=len(users), output_dim=1, input_length=1, init='zero')(user)
b_user =Flatten()(b_user)



movie = Input(shape=(1,), dtype='int32')
emb_movie = Embedding(input_dim=len(movies), output_dim=50, input_length=1, W_regularizer=l2(0.0001))(movie)
# emb_movie = Flatten()(emb_movie)
b_movie = Embedding(input_dim=len(movies), output_dim=1, input_length=1, init='zero')(movie)
b_movie = Flatten()(b_movie)

lm = merge([emb_user, emb_movie], mode='concat')
lm = Flatten()(lm)
lm = Dense(output_dim=1, activation='linear', name='output_lm')(lm)

merged = merge([emb_user, emb_movie], mode='dot')
merged = Flatten()(merged)
merged = merge([merged, b_user], mode='sum')
merged = merge([merged, b_movie], mode='sum', name='output_dot')

model = Model(input=[user, movie], output=[lm, merged])
model.compile(optimizer=Adam(0.001), loss={'output_lm': 'mse', 'output_dot': 'mse'}, loss_wights={'output_lm': 0.2, 'output_dot': 1.})
model.summary()
model.fit(x=[train_user, train_movies], y=[train_ratings, train_ratings], batch_size=32, nb_epoch=15, validation_data=([val_user, val_movies], [val_ratings, val_ratings]))


merge_model = Model(input=[user, movie], output=merged)
merge_model.compile(optimizer=Adam(0.001), loss='mse')
merge_model.summary()
merge_model.fit(x=[train_user, train_movies], y=train_ratings, batch_size=64, nb_epoch=5, validation_data=([val_user, val_movies], val_ratings))
____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
input_89 (InputLayer)            (None, 1)             0                                            
____________________________________________________________________________________________________
input_90 (InputLayer)            (None, 1)             0                                            
____________________________________________________________________________________________________
embedding_177 (Embedding)        (None, 1, 50)         33550       input_89[0][0]                   
____________________________________________________________________________________________________
embedding_179 (Embedding)        (None, 1, 50)         453300      input_90[0][0]                   
____________________________________________________________________________________________________
merge_115 (Merge)                (None, 1, 1)          0           embedding_177[0][0]              
                                                                   embedding_179[0][0]              
____________________________________________________________________________________________________
embedding_178 (Embedding)        (None, 1, 1)          671         input_89[0][0]                   
____________________________________________________________________________________________________
merge_114 (Merge)                (None, 1, 100)        0           embedding_177[0][0]              
                                                                   embedding_179[0][0]              
____________________________________________________________________________________________________
flatten_156 (Flatten)            (None, 1)             0           merge_115[0][0]                  
____________________________________________________________________________________________________
flatten_153 (Flatten)            (None, 1)             0           embedding_178[0][0]              
____________________________________________________________________________________________________
embedding_180 (Embedding)        (None, 1, 1)          9066        input_90[0][0]                   
____________________________________________________________________________________________________
flatten_155 (Flatten)            (None, 100)           0           merge_114[0][0]                  
____________________________________________________________________________________________________
merge_116 (Merge)                (None, 1)             0           flatten_156[0][0]                
                                                                   flatten_153[0][0]                
____________________________________________________________________________________________________
flatten_154 (Flatten)            (None, 1)             0           embedding_180[0][0]              
____________________________________________________________________________________________________
output_lm (Dense)                (None, 1)             101         flatten_155[0][0]                
____________________________________________________________________________________________________
output_dot (Merge)               (None, 1)             0           merge_116[0][0]                  
                                                                   flatten_154[0][0]                
====================================================================================================
Total params: 496,688
Trainable params: 496,688
Non-trainable params: 0
____________________________________________________________________________________________________
Train on 80000 samples, validate on 20004 samples
Epoch 1/15
80000/80000 [==============================] - 53s - loss: 10.5632 - output_lm_loss: 1.8326 - output_dot_loss: 8.6220 - val_loss: 5.2309 - val_output_lm_loss: 1.1616 - val_output_dot_loss: 3.7865
Epoch 2/15
80000/80000 [==============================] - 49s - loss: 3.4042 - output_lm_loss: 0.7963 - output_dot_loss: 2.1609 - val_loss: 3.6635 - val_output_lm_loss: 1.1338 - val_output_dot_loss: 1.9798
Epoch 3/15
80000/80000 [==============================] - 47s - loss: 2.6309 - output_lm_loss: 0.7709 - output_dot_loss: 1.2647 - val_loss: 3.3833 - val_output_lm_loss: 1.1361 - val_output_dot_loss: 1.6275
Epoch 4/15
80000/80000 [==============================] - 48s - loss: 2.4169 - output_lm_loss: 0.7618 - output_dot_loss: 1.0346 - val_loss: 3.2555 - val_output_lm_loss: 1.1396 - val_output_dot_loss: 1.5022
Epoch 5/15
80000/80000 [==============================] - 48s - loss: 2.2907 - output_lm_loss: 0.7572 - output_dot_loss: 0.9339 - val_loss: 3.1425 - val_output_lm_loss: 1.1154 - val_output_dot_loss: 1.4481
Epoch 6/15
80000/80000 [==============================] - 49s - loss: 2.1928 - output_lm_loss: 0.7533 - output_dot_loss: 0.8766 - val_loss: 3.1198 - val_output_lm_loss: 1.1328 - val_output_dot_loss: 1.4446
Epoch 7/15
80000/80000 [==============================] - 51s - loss: 2.1098 - output_lm_loss: 0.7517 - output_dot_loss: 0.8334 - val_loss: 3.0324 - val_output_lm_loss: 1.1188 - val_output_dot_loss: 1.4111
Epoch 8/15
80000/80000 [==============================] - 59s - loss: 2.0325 - output_lm_loss: 0.7491 - output_dot_loss: 0.7978 - val_loss: 2.9974 - val_output_lm_loss: 1.1317 - val_output_dot_loss: 1.3993
Epoch 9/15
80000/80000 [==============================] - 52s - loss: 1.9567 - output_lm_loss: 0.7449 - output_dot_loss: 0.7607 - val_loss: 2.9526 - val_output_lm_loss: 1.1317 - val_output_dot_loss: 1.3879
Epoch 10/15
80000/80000 [==============================] - 55s - loss: 1.8840 - output_lm_loss: 0.7428 - output_dot_loss: 0.7200 - val_loss: 2.9329 - val_output_lm_loss: 1.1527 - val_output_dot_loss: 1.3761
Epoch 11/15
80000/80000 [==============================] - 52s - loss: 1.8170 - output_lm_loss: 0.7416 - output_dot_loss: 0.6819 - val_loss: 2.8622 - val_output_lm_loss: 1.1341 - val_output_dot_loss: 1.3486
Epoch 12/15
80000/80000 [==============================] - 55s - loss: 1.7549 - output_lm_loss: 0.7420 - output_dot_loss: 0.6414 - val_loss: 2.8193 - val_output_lm_loss: 1.1366 - val_output_dot_loss: 1.3255
Epoch 13/15
80000/80000 [==============================] - 53s - loss: 1.6970 - output_lm_loss: 0.7405 - output_dot_loss: 0.6059 - val_loss: 2.8243 - val_output_lm_loss: 1.1454 - val_output_dot_loss: 1.3407
Epoch 14/15
80000/80000 [==============================] - 59s - loss: 1.6442 - output_lm_loss: 0.7392 - output_dot_loss: 0.5735 - val_loss: 2.7540 - val_output_lm_loss: 1.1339 - val_output_dot_loss: 1.2990
Epoch 15/15
80000/80000 [==============================] - 65s - loss: 1.5961 - output_lm_loss: 0.7373 - output_dot_loss: 0.5434 - val_loss: 2.7973 - val_output_lm_loss: 1.1443 - val_output_dot_loss: 1.3481
____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
input_89 (InputLayer)            (None, 1)             0                                            
____________________________________________________________________________________________________
input_90 (InputLayer)            (None, 1)             0                                            
____________________________________________________________________________________________________
embedding_177 (Embedding)        (None, 1, 50)         33550       input_89[0][0]                   
____________________________________________________________________________________________________
embedding_179 (Embedding)        (None, 1, 50)         453300      input_90[0][0]                   
____________________________________________________________________________________________________
merge_115 (Merge)                (None, 1, 1)          0           embedding_177[0][0]              
                                                                   embedding_179[0][0]              
____________________________________________________________________________________________________
embedding_178 (Embedding)        (None, 1, 1)          671         input_89[0][0]                   
____________________________________________________________________________________________________
flatten_156 (Flatten)            (None, 1)             0           merge_115[0][0]                  
____________________________________________________________________________________________________
flatten_153 (Flatten)            (None, 1)             0           embedding_178[0][0]              
____________________________________________________________________________________________________
embedding_180 (Embedding)        (None, 1, 1)          9066        input_90[0][0]                   
____________________________________________________________________________________________________
merge_116 (Merge)                (None, 1)             0           flatten_156[0][0]                
                                                                   flatten_153[0][0]                
____________________________________________________________________________________________________
flatten_154 (Flatten)            (None, 1)             0           embedding_180[0][0]              
____________________________________________________________________________________________________
output_dot (Merge)               (None, 1)             0           merge_116[0][0]                  
                                                                   flatten_154[0][0]                
====================================================================================================
Total params: 496,587
Trainable params: 496,587
Non-trainable params: 0
____________________________________________________________________________________________________
Train on 80000 samples, validate on 20004 samples
Epoch 1/5
80000/80000 [==============================] - 32s - loss: 0.8495 - val_loss: 1.5022
Epoch 2/5
80000/80000 [==============================] - 31s - loss: 0.7631 - val_loss: 1.5018
Epoch 3/5
80000/80000 [==============================] - 30s - loss: 0.7164 - val_loss: 1.4875
Epoch 4/5
80000/80000 [==============================] - 29s - loss: 0.6847 - val_loss: 1.4869
Epoch 5/5
80000/80000 [==============================] - 30s - loss: 0.6592 - val_loss: 1.4939





<keras.callbacks.History at 0x7f2dde263990>
model.optimizer.lr=0.0001
model.fit(x=[train_user, train_movies], y=[train_ratings, train_ratings], batch_size=32, nb_epoch=10, validation_data=([val_user, val_movies], [val_ratings, val_ratings]))
---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-59-f43558884ee4> in <module>()
      1 model.optimizer.lr=0.0001
----> 2 model.fit(x=[train_user, train_movies], y=[train_ratings, train_ratings], batch_size=32, nb_epoch=10, validation_data=([val_user, val_movies], [val_ratings, val_ratings]))


/home/cristi/Envs/deep/local/lib/python2.7/site-packages/keras/engine/training.pyc in fit(self, x, y, batch_size, nb_epoch, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch)
   1114             class_weight=class_weight,
   1115             check_batch_axis=False,
-> 1116             batch_size=batch_size)
   1117         # prepare validation data
   1118         if validation_data:


/home/cristi/Envs/deep/local/lib/python2.7/site-packages/keras/engine/training.pyc in _standardize_user_data(self, x, y, sample_weight, class_weight, check_batch_axis, batch_size)
   1031                                    output_shapes,
   1032                                    check_batch_axis=False,
-> 1033                                    exception_prefix='model target')
   1034         sample_weights = standardize_sample_weights(sample_weight,
   1035                                                     self.output_names)


/home/cristi/Envs/deep/local/lib/python2.7/site-packages/keras/engine/training.pyc in standardize_input_data(data, names, shapes, check_batch_axis, exception_prefix)
     63                                  'the following list of ' + str(len(data)) +
     64                                  ' arrays: ' + str(data)[:200] +
---> 65                                  '...')
     66             else:
     67                 if len(names) == 1:


ValueError: Error when checking model target: the list of Numpy arrays that you are passing to your model is not the size the model expected. Expected to see 1 arrays but instead got the following list of 2 arrays: [44935    5.0
44316    5.0
54700    4.0
20183    4.0
50521    5.0
52132    2.5
59109    1.5
86141    4.0
76238    4.5
38188    5.0
27651    4.5
92426    2.5
23686    4.0
94417    4.0
81913    5.0
3222...

Training sequence

model.optimizer.lr = 0.001
model.fit(x=[train_user, train_movies], y=train_ratings, 
          validation_data=([val_user, val_movies], val_ratings), nb_epoch=3, batch_size=64)

model.optimizer.lr = 0.01
model.fit(x=[train_user, train_movies], y=train_ratings, 
          validation_data=([val_user, val_movies], val_ratings), nb_epoch=6, batch_size=64)

model.optimizer.lr = 0.001
model.fit(x=[train_user, train_movies], y=train_ratings, 
          validation_data=([val_user, val_movies], val_ratings), nb_epoch=10, batch_size=64)

model.optimizer.lr = 0.0001
model.fit(x=[train_user, train_movies], y=train_ratings, 
          validation_data=([val_user, val_movies], val_ratings), nb_epoch=15, batch_size=64)
Train on 80000 samples, validate on 20004 samples
Epoch 1/3
80000/80000 [==============================] - 6s - loss: 0.7197 - val_loss: 1.2516
Epoch 2/3
80000/80000 [==============================] - 6s - loss: 0.7009 - val_loss: 1.2413
Epoch 3/3
80000/80000 [==============================] - 6s - loss: 0.6837 - val_loss: 1.2535





<keras.callbacks.History at 0x7f2dd6419b50>

Inspect the bias variable of movies.

The bias term, once trained, containes a normalized score that attests how good a movie is based on all the predictions of all the users. Even if the users have their own preferences, and the movie has it’s own characteristics, the trained bias is the best value, independendent of all these latent factors that help the model fitt the data (ratings).

As such, the bias term of movies is a natural (unbiased :) ) way of scoring their value.

movie_score = Model(input=movie, output=b_movie)

Note: If you build a keras model only to reuse weights and do predictions (no training) you are not required to issue model.compile()

Predict the bias values of all the movies and sort them in order. List the top 10 most liked movies and the top 10 less liked ones.

def title_of(movieId):
     return movie_names.where(movie_names.movieId == movies[i]).dropna().title.values[0]

scores = movie_score.predict(x=np.arange(len(movies))).ravel()
scores_and_titles = np.array([
        (score, title_of(movies[i])) 
        for i, score in enumerate(scores)
    ])

top_args = np.argsort(scores)[::-1]
hiscores = scores_and_titles[top_args]
hiscores[:15], hiscores[-15:]
# top_args[:10]
# movies[top_args[:10]]
# movie_names[movies[top_args[:10]]]
# movie_names.title.where(movie_names['movieId'] == 2791)
# movie_names.movieId[2791]
# movie_names.where(movie_names.movieId.isin(movies[top_args[:10]])).dropna()
# movie_names.values[2235]
# movie_names.movieId.where(movie_names['movieId'] == 2791)
(array([['1.67707', 'Gladiator (1992)'],
        ['1.58143', 'Drunken Master (Jui kuen) (1978)'],
        ['1.57151', 'General, The (1926)'],
        ['1.54792', 'Letters from Iwo Jima (2006)'],
        ['1.53384', 'Mister Roberts (1955)'],
        ['1.50553', 'Shawshank Redemption, The (1994)'],
        ['1.5017', 'Godfather, The (1972)'],
        ['1.49125', "Howl's Moving Castle (Hauru no ugoku shiro) (2004)"],
        ['1.46924', 'My Neighbor Totoro (Tonari no Totoro) (1988)'],
        ['1.46283', "Amores Perros (Love's a Bitch) (2000)"],
        ['1.4622', 'It Happened One Night (1934)'],
        ['1.45984', 'Smoke (1995)'],
        ['1.45612', 'Rush (2013)'],
        ['1.45528', 'Whiplash (2014)'],
        ['1.45415', 'Band of Brothers (2001)']], 
       dtype='|S159'), array([['-0.637723', 'Year One (2009)'],
        ['-0.656114', 'In the Cut (2003)'],
        ['-0.662641', "Barney's Great Adventure (1998)"],
        ['-0.670775', 'Meet the Spartans (2008)'],
        ['-0.684494', 'Turbo: A Power Rangers Movie (1997)'],
        ['-0.685414', 'Human Centipede, The (First Sequence) (2009)'],
        ['-0.692244', 'Rage: Carrie 2, The (1999)'],
        ['-0.716578', 'Ernest Goes to Camp (1987)'],
        ['-0.728922', 'Disaster Movie (2008)'],
        ['-0.745111', 'Wicker Man, The (2006)'],
        ['-0.749972', 'Driven (2001)'],
        ['-0.764651', 'Carnosaur 3: Primal Species (1996)'],
        ['-0.820302', 'Spy Kids 3-D: Game Over (2003)'],
        ['-0.834345', 'Beverly Hills Chihuahua (2008)'],
        ['-0.840344', 'Vampires Suck (2010)']], 
       dtype='|S159'))
hiscores[:10], hiscores[-10:]
(array([['1.67707', 'Gladiator (1992)'],
        ['1.58143', 'Drunken Master (Jui kuen) (1978)'],
        ['1.57151', 'General, The (1926)'],
        ['1.54792', 'Letters from Iwo Jima (2006)'],
        ['1.53384', 'Mister Roberts (1955)'],
        ['1.50553', 'Shawshank Redemption, The (1994)'],
        ['1.5017', 'Godfather, The (1972)'],
        ['1.49125', "Howl's Moving Castle (Hauru no ugoku shiro) (2004)"],
        ['1.46924', 'My Neighbor Totoro (Tonari no Totoro) (1988)'],
        ['1.46283', "Amores Perros (Love's a Bitch) (2000)"]], 
       dtype='|S159'),
 array([['-0.685414', 'Human Centipede, The (First Sequence) (2009)'],
        ['-0.692244', 'Rage: Carrie 2, The (1999)'],
        ['-0.716578', 'Ernest Goes to Camp (1987)'],
        ['-0.728922', 'Disaster Movie (2008)'],
        ['-0.745111', 'Wicker Man, The (2006)'],
        ['-0.749972', 'Driven (2001)'],
        ['-0.764651', 'Carnosaur 3: Primal Species (1996)'],
        ['-0.820302', 'Spy Kids 3-D: Game Over (2003)'],
        ['-0.834345', 'Beverly Hills Chihuahua (2008)'],
        ['-0.840344', 'Vampires Suck (2010)']], 
       dtype='|S159'))

Predict only for the top most rated 2000 movies (presumably the most known and watched 2000 movies)

g = ratings.groupby('movieId')['rating'].count()
top_movies = g.sort_values(ascending=False)[:2000]
top_movies = top_movies.index.values

from operator import itemgetter

top_movieIdxs = [movieIdx[top] for top in top_movies]
scores = movie_score.predict(top_movieIdxs).ravel()
movie_scores = [(score, movie_names.where(movie_names.movieId == movieId).dropna().title.values[0]) for movieId, score in zip(top_movies, scores)]
print(sorted(movie_scores, key=itemgetter(0))[-10:][::-1])

top_args = np.argsort(scores)
movies[top_args[:10]]
# movie_names.where(movie_names.movieId.isin(movies[top_args[-10:]])).dropna()
[(1.6770685, 'Gladiator (1992)'), (1.5715058, 'General, The (1926)'), (1.5338421, 'Mister Roberts (1955)'), (1.5055339, 'Shawshank Redemption, The (1994)'), (1.5016968, 'Godfather, The (1972)'), (1.4912499, "Howl's Moving Castle (Hauru no ugoku shiro) (2004)"), (1.4692358, 'My Neighbor Totoro (Tonari no Totoro) (1988)'), (1.462826, "Amores Perros (Love's a Bitch) (2000)"), (1.4622027, 'It Happened One Night (1934)'), (1.4598429, 'Smoke (1995)')]





array([  6534,  56775,  38061, 127136, 122924, 161155,   8366,   3993,   6870, 142507])

Dimensionality reduction of the hidden embeddings

embedding_model = Model(input=movie, output=emb_movie)
embeddings = embedding_model.predict(x=np.arange(len(movies)))
embeddings.shape
(9066, 10)
from sklearn.decomposition import PCA

pca = PCA(n_components=3)
movie_pca = pca.fit(embeddings.T).components_
movie_pca.shape
(3, 9066)

Projection 1: Try to sort the movies acording to this projection

projection_1 = [(score, title_of(movieId=movies[i])) for i, score in enumerate(movie_pca[0])]
sorted(projection_1, key=itemgetter(0))[-10:][::-1]
[(0.018930364238552547, 'Civil War, The (1990)'),
 (0.018886939556339865, 'It Happened One Night (1934)'),
 (0.018756959350428728,
  "Hachiko: A Dog's Story (a.k.a. Hachi: A Dog's Tale) (2009)"),
 (0.018610435255534269, 'Creed (2015)'),
 (0.018590744575493152, 'Grand Illusion (La grande illusion) (1937)'),
 (0.018516467860382291,
  'Dear Zachary: A Letter to a Son About His Father (2008)'),
 (0.018501210982614757, 'In a Lonely Place (1950)'),
 (0.018492615745538662, 'White Balloon, The (Badkonake sefid) (1995)'),
 (0.018395650579599441, '3:10 to Yuma (1957)'),
 (0.01832466861929118, 'Grey Gardens (1975)')]
sorted(projection_1, key=itemgetter(0))[:10]
[(-0.0091007586246863656, 'One from the Heart (1982)'),
 (-0.0090915923102729401, 'The Boyfriend School (1990)'),
 (-0.0089504526829059673, 'Manos: The Hands of Fate (1966)'),
 (-0.0089210133621088622, 'Principal, The (1987)'),
 (-0.0088867791919859038, 'Rumble Fish (1983)'),
 (-0.0088840308811219508, 'Shoot the Moon (1982)'),
 (-0.0082169101866127869, 'Angel at My Table, An (1990)'),
 (-0.0079492079413028029, 'Karla (2006)'),
 (-0.007823210683154537, 'Agony and the Ecstasy of Phil Spector, The (2009)'),
 (-0.0077007008518970932, 'Poltergeist (2015)')]

Projection 2: Try to sort the movies acording to this projection. Fantasy?

projection_2 = [(score, title_of(movieId=movies[i])) for i, score in enumerate(movie_pca[1])]
sorted(projection_2, key=itemgetter(0))[-10:][::-1]
[(0.094698585119647613, 'Outbreak (1995)'),
 (0.092277756768438829, 'Ghost (1990)'),
 (0.090228136891318478, 'Drop Zone (1994)'),
 (0.089363800709257701, 'Speed (1994)'),
 (0.089350926405966286, 'Waterworld (1995)'),
 (0.087202608833448186, 'Batman Forever (1995)'),
 (0.084968415640497688, 'Star Trek: Generations (1994)'),
 (0.079538352869493109, 'True Lies (1994)'),
 (0.079431014576049161, 'Firm, The (1993)'),
 (0.079377529877693809, 'GoldenEye (1995)')]
sorted(projection_2, key=itemgetter(0))[:10]
[(-0.031892693513855687, 'Prophecy, The (1995)'),
 (-0.030828004336102745, 'Repulsion (1965)'),
 (-0.028312860188068847, 'Freaks (1932)'),
 (-0.028308622179453437, 'Elephant Man, The (1980)'),
 (-0.027531771946190785, 'Chinatown (1974)'),
 (-0.02733397919614871, 'Manhattan (1979)'),
 (-0.026747841908554844, 'Gold Rush, The (1925)'),
 (-0.026529899866533504, 'Ghost Dog: The Way of the Samurai (1999)'),
 (-0.02639925626646563, 'Texas Chainsaw Massacre, The (1974)'),
 (-0.025022033177043369, 'Station Agent, The (2003)')]

Projection 3: Try to sort the movies acording to this projection

projection_3 = [(score, title_of(movieId=movies[i])) for i, score in enumerate(movie_pca[2])]
import sys
stdout, stderr = sys.stdout, sys.stderr # save notebook stdout and stderr
reload(sys)
sys.setdefaultencoding('utf-8')
sys.stdout, sys.stderr = stdout, stderr # restore notebook stdout and stderr
sorted(projection_3, key=itemgetter(0))[-10:][::-1]
[(0.048198463587742699,
  'Tie Me Up! Tie Me Down! (\xc2\xa1\xc3\x81tame!) (1990)'),
 (0.043658370087475369, 'Everything Must Go (2010)'),
 (0.042698816979869318, 'Grown Ups (2010)'),
 (0.042541031844921903, 'Last Days of Disco, The (1998)'),
 (0.040012181056818828, 'Audrey Rose (1977)'),
 (0.039755213568088324, "Look Who's Talking Too (1990)"),
 (0.038648165158394754, 'Career Girls (1997)'),
 (0.038462203128774734, 'Baby Boom (1987)'),
 (0.037932740247598851, 'Confessions of a Shopaholic (2009)'),
 (0.037804393688104697, 'Go Fish (1994)')]
sorted(projection_3, key=itemgetter(0))[:10]
[(-0.056768368009184125, 'Fugitive, The (1993)'),
 (-0.055556016420371439, 'Broken Arrow (1996)'),
 (-0.052805402857436859, "Mr. Holland's Opus (1995)"),
 (-0.050968656000512889, 'Eraser (1996)'),
 (-0.047842044275062792, 'Casper (1995)'),
 (-0.047675176568388551, 'Cliffhanger (1993)'),
 (-0.046578109319507746, 'Waterworld (1995)'),
 (-0.046156592898948953, 'Drop Zone (1994)'),
 (-0.044805085186528322, 'Under Siege 2: Dark Territory (1995)'),
 (-0.043827595810730889, 'White Heat (1949)')]

Plot the results in a 2-by-2 figure

movies_indexed_by_movieId = movie_names.set_index('movieId')['title'].to_dict()
start=0; end=20
X = movie_pca[1][start:end]
Y = movie_pca[2][start:end]
plt.figure(figsize=(15,15))
plt.scatter(X, Y)
for i, x, y in zip(range(start,end), X, Y):
    plt.text(x,y,movies_indexed_by_movieId[movies[i]], color=np.random.rand(3)*0.7, fontsize=14)
plt.show()

png

Updated:

Comments