PCI – Chapter Two – Item Based Filtering (MovieLens dataset)

With our recently generated 45 movie set, we will pick three random movies.

Using these as our imaginary baseline (i.e. – these are the three movies the user states he likes and has rated or the first three movies he user clicked and ranked on a website. I am using myself (my ratings) as source of movie ratings in exercise. We can build our recommendation system as described here.

First, lets retrieve three random movies.

SELECT a.CurrentItemId, b.title, c.rating
FROM pearsonsimmovies a INNER JOIN item b ON a.CurrentItemId = b.movieId
INNER JOIN data c ON a.CurrentItemId = c.itemId AND c.userId = 944
WHERE a.PearsonSim <> 1
GROUP BY a.CurrentItemId
ORDER BY rand() LIMIT 3

Random result-set:

CurrentItemId     title                       rating
----------------  --------------------------  ---------
11                Seven (Se7en) (1995)        2
343               Alien: Resurrection (1997)  4
25                Birdcage, The (1996)        4

Now, as described in the book in page 24, Lets generate a table based on these three movies. Also, lets rank all movies (in our recently generated set) based on their similarity to these.

Restricting myself to mySQL for Chapter Two (this may not be a good idea for more complicated chapters), here’s the query that returns data we are going to be working with:

SELECT b.title ‘movie’ , c.title ‘movie’
, sum(a.PearsonSim)
, sum(d.rating*a.pearsonsim) ‘Rx’
FROM pearsonsimmovies a INNER JOIN item b ON a.CurrentItemId = b.movieId
INNER JOIN item c ON a.ItemId = c.movieId AND c.movieId IN (SELECT CurrentItemId FROM pearsonsimmovies)
INNER JOIN data d ON a.CurrentItemId = d.itemId AND d.userId = 944
WHERE a.CurrentItemId IN (11,343,25)
GROUP BY b.title, c.title</font

Partial result-set:

movie                       movie                        sum(a.PearsonSim)     Rx

Alien: Resurrection (1997)  Angels and Insects (1995)    0.0858352035284042    0.343340814113617
Alien: Resurrection (1997)  Antonia's Line (1995)        -0.121566131711006    -0.486264526844025
Alien: Resurrection (1997)  Apollo 13 (1995)             0.168530061841011     0.674120247364044
Birdcage, The (1996)        Angels and Insects (1995)    -0.13074591755867     -0.52298367023468
Birdcage, The (1996)        Antonia's Line (1995)        0.0317920446395874    0.12716817855835
Birdcage, The (1996)        Apollo 13 (1995)             0.0997573956847191    0.399029582738876
Seven (Se7en) (1995)        Angels and Insects (1995)    -0.0412117689847946   -0.0824235379695892
Seven (Se7en) (1995)        Antonia's Line (1995)        0.304266840219498     0.608533680438995
Seven (Se7en) (1995)        Apollo 13 (1995)             0.0346525646746159    0.0693051293492317

Complete result-set is 135 records. This includes the sim score between each of the movies in the data-set multiplied by the score of each movie user provided ratings for.

From a quick trip to excel and back, we get a table lots like the one in the book:



				Angels and Insects (1995)    Antonia's Line (1995)  
	Pearson
Alien: Resurrection (1997)   	0.085835204                  -0.121566132
Birdcage, The (1996)          	-0.130745918                 0.031792045
Seven (Se7en) (1995)          	-0.041211769                 0.30426684
Total Pearson                 	-0.086122483                 0.21449275  

         Rx
Alien: Resurrection (1997)    	0.343340814                  -0.486264527
Birdcage, The (1996)          	-0.52298367                  0.127168179
Seven (Se7en) (1995)          	-0.082423538                 0.60853368
Total Rx                      	-0.262066394                 0.249437332

Normalized                    	3.042949819                  1.162917294

…This is but a sample of data generated. In reality table contains one column for each movie in data-set. Such layout does not display well here.

More importantly, now we have all the data we need to recommend movies to the user.

Pivoting both Movie and Normalized data, we obtain a table of movies we could recommend to user. Most peculiar, we have the user’s predicted rating for all of these movies based on his initial input. The more ratings the user provides in the beginning, the better these recommendations are. Here is a sample from pivoted table from results above:


Leaving Las Vegas        4.944198620
Fallen (1998)            4.192423985
Mad Love (1995)          3.833175273
Apollo 13 (1995)         3.771224915
Batman Forever (1995)    3.673963831
Babe (1995)              2.544153587
Get Shorty (1995)        2.085588587

This has been a great exercise for me. It would be a lot of fun to build a recommendations engine professionally. This is one those great added-value services more and more websites are providing their users. Very impressive indeed.

PCI – Chapter Two – Building the Item Comparison Dataset

In order to do movie recommendations with item based filtering, we need to generate the complete dataset of similar items. In a nutshell this entails calculating (using Pearson’s Similarity Score like before) the similarity any one item has to every other item in the dataset.

Generating this dataset is very cpu intensive. It takes an average of 14 minutes per movie on dataset on my laptop. I believe there are 1682 movies in items table so calculating the complete dataset is out of the question for me. I’ve spent the good part of last weekend generating complete similarity metrics for a 45 movie set.

This will give me a decent feeling of how good this recommendation system is and, as the book notes, coming up with recommendations this way takes no time since we’ve done all the hard work beforehand.

The first step is to create a table like so: PearsonSimMovies(ItemId,CriticsRatedCommon,PearsonSim,CurrentItemId)

The second step is to insert records per movie and store the similarity ‘score’ to every other movie based on the ratings we have from our users. Luckily, this means we wont create a cross-product of this item table since some movies we cannot correlate to each other based on the review set we have. Nice.

Here’s the query I used for generating this dataset:

INSERT INTO PearsonSimMovies(ItemId,CriticsRatedCommon,PearsonSim,CurrentItemId)
SELECT b.ItemId ‘ItemId’,
count(a.rating) ‘CriticsRatedCommon’,
(sum(a.rating*b.rating)-(sum(a.rating)*sum(b.rating)/count(*)))/
(sqrt((sum(pow(a.rating,2))-(pow(sum(a.rating),2)/count(*)))*
(sum(pow(b.rating,2))-(pow(sum(b.rating),2)/count(*)))))
‘PearsonSim’,
a.ItemId ‘CurrentItemId’
FROM data a INNER JOIN data b
ON a.Userid = b.Userid AND a.itemId b.itemId
INNER JOIN data c ON b.itemId = c.itemId
WHERE a.itemId = 38
GROUP BY b.itemid
ORDER BY ‘PearsonSim’ DESC

With this 45 movie set, I’ll be calculating some recommendations on random movies from it.

PCI – Chapter Two – Recommending Movies For Multiple Users

In order to provide recommendations for multiple critics, I have revised the SQL Schema and the queries for both saving pearson similarity metric for obtaining the movie rankings for a user.

First step is to rename table PearsonSim944 to PearsonSimCritics and add a column ‘CurrentUserId’.

For every user who you want to find movie recommendations for, run the following query:

INSERT INTO PearsonSimCritics(UserId,MoviesRatedCommon,PearsonSim,CurrentUserId)
SELECT b.UserId ‘UserId’,
count(a.rating) ‘MoviesRatedCommon’,
(sum(a.rating*b.rating)-(sum(a.rating)*sum(b.rating)/count(*))) / (sqrt((sum(pow(a.rating,2))-(pow(sum(a.rating),2)/count(*)))*(sum(pow(b.rating,2))-(pow(sum(b.rating),2)/count(*))))) ‘PearsonSim’,
a.userId
FROM data a INNER JOIN data b ON a.itemid = b.itemid AND a.userId b.userId
INNER JOIN data c ON b.userId = c.userId
WHERE a.userId = 944
GROUP BY b.userid
ORDER BY ‘PearsonSim’ DESC

…and for obtaining movie recommendations for such users you’ve calculated similarity metrics for:

SELECT sum(b.rating*a.PearsonSim)/sum(a.PearsonSim) ‘Suggested Rating’,
k.rating ‘my rating’ , c.Title, a.CurrentUserId
FROM data b INNER JOIN PearsonSimCritics a ON b.userId = a.UserId
INNER JOIN item c ON b.itemId = c.movieId
LEFT JOIN data k ON c.movieId = k.itemId and k.userId = a.CurrentUserId
WHERE a.CurrentUserId = 944
GROUP BY b.itemID
HAVING count(b.ItemId) > 15
ORDER BY 1 DESC
LIMIT 20

PCI – Chapter Two – Recommending Movies

The query that follows makes use of the previously saver data to generate movie recommendations for me. This time, I’ve excluded movies that had less than 15 ratings. Movies with a small number of reviews in our dataset do not allow us to infer much new information and i did not find the results insightful.

SELECT sum(b.rating*a.PearsonSim)/sum(a.PearsonSim) ‘Suggested Rating’,
k.rating ‘my rating’ , c.Title
FROM data b INNER JOIN PearsonSim944 a ON b.userId = a.UserId
INNER JOIN item c ON b.itemId = c.movieId
LEFT JOIN data k ON c.movieId = k.itemId and k.userId = 944
GROUP BY b.itemID
HAVING count(b.ItemId) > 15
ORDER BY 1 DESC
LIMIT 20

As you can see, I decided not to exclude movies I’ve seen on purpose on this to correlate my ratings to the predicted rating generated from these recommendation.

And here follows a table of the top movies I should put on my to-watch list 🙂

 Suggested Rating     my rating     Title
 -------------------  ------------  ------------------------------------------------------
 4.7323498717054      (null)        Welcome To Sarajevo (1997)
 4.6998348050636      5             Godfather, The (1972)
 4.6711593732692      5             Boogie Nights (1997)
 4.6088872136439      5             Schindler's List (1993)
 4.6088513709185      4             Star Wars (1977)
 4.6035380915543      5             Raiders of the Lost Ark (1981)
 4.5669663298883      3             Wallace & Gromit: The Best of Aardman Animation (1996)
 4.5381657408068      (null)        Wrong Trousers, The (1993)
 4.5149011723813      4             Shawshank Redemption, The (1994)
 4.4973936132774      5             Empire Strikes Back, The (1980)
 4.4941478597302      (null)        Casablanca (1942)
 4.4654235371938      (null)        When We Were Kings (1996)
 4.4652896069685      5             Godfather: Part II, The (1974)
 4.4557483936264      (null)        Lawrence of Arabia (1962)
 4.4432307373246      (null)        Secrets & Lies (1996)
 4.4371513818567      (null)        Usual Suspects, The (1995)
 4.4330745351495      (null)        Close Shave, A (1995)
 4.4309795230429      (null)        One Flew Over the Cuckoo's Nest (1975)
 4.4176616177839      (null)        Laura (1944)
 4.4137219730932      (null)        Paths of Glory (1957)

Not too bad eh? I am happy with these results and have learned how to generate lots of great information from nothing more than a list of items reviewed (item, reviewer, rating). The possibilities how this can be of use in systems is endless and but the tip of the iceberg of what is to come in book.

NOTE: on copy-pasting query into analyzer… WordPress seems to replace my single quotes with sql incompatible single quotes… revise accordingly.

PCI – Chapter Two – Similarity Score (Pearson)

The following query creates an ‘in-between’ table which saves the similarity scores between myself (user id 944) and all the critics.

INSERT INTO PearsonSim944(UserId,PearsonSim)
SELECT b.UserId ‘UserId’,
(sum(a.rating*b.rating)-(sum(a.rating)*sum(b.rating)/count(*))) / (sqrt((sum(pow(a.rating,2))-(pow(sum(a.rating),2)/count(*)))*(sum(pow(b.rating,2))-(pow(sum(b.rating),2)/count(*))))) ‘PearsonSim’
FROM data a INNER JOIN data b ON a.itemid = b.itemid AND a.userId <> b.userId
INNER JOIN data c ON b.userId = c.userId
WHERE a.userId = 944
GROUP BY b.userid
ORDER BY ‘PearsonSim’ DESC

I’ve choose not to exclude any data as to preserve scores for future use. Much more data can be generated and saved to aide future calculations but this is all needed to wrap up exercise set at hand. I’ve used Pearson instead of Euclidean because I found it easier to work with and book outlines benefits when dealing with certain types of dataset.

PCI – Chapter Two – Recommending Items

Creating recommendations for Mario:

 userId     itemId     rating     weighted score     title
---------  ---------  ---------  -----------------  -----------------------------
 808        875        4          3.61813613493316   She's So Lovely (1997)
 808        872        5          4.52267016866645   Love Jones (1997)
 808        751        3          2.71360210119987   Tomorrow Never Dies (1997)
 808        750        5          4.52267016866645   Amistad (1997)
 808        748        4          3.61813613493316   Saint, The (1997)
 519        1617       5          4.76731294622795   Hugo Pool (1997)
 519        1612       5          4.76731294622795   Leading Man, The (1996)
 519        1592       5          4.76731294622795   Magic Hour, The (1998)
 519        1591       5          4.76731294622795   Duoluo tianshi (1995)
 519        1434       5          4.76731294622795   Shooting Fish (1997)
 317        879        3          2.75567596063107   Peacemaker, The (1997)
 317        748        5          4.59279326771845   Saint, The (1997)
 317        683        2          1.83711730708738   Rocket Man (1997)
 317        678        2          1.83711730708738   Volcano (1997)
 317        355        4          3.67423461417476   Sphere (1998)

This is a sample resultset from obtaining weighted scores for movies which might be of interest to me.

Queries for these will follow; I really have to figure out how to display neat code… yellow divs do not help much.

Very interesting exercise thou; more later.

PCI – Chapter Two – Ranking the critics.

I decided to rate the movies myself in order to get a feel for how good the recommendations turn out to be. I will later retrieve these with and without movies I’ve seen in the past to examine how close does ‘my predicted score’ is to my gut feeling.

After adding my ratings to the movies from set (about 1/10th of items table)…

Here follows the query for retrieving top 3 critics who most resemble my movie tastes (I am userId 944). This is inline with the top of page 15 in the book.


SELECT (sum(a.rating*b.rating)-(sum(a.rating)*sum(b.rating)/count(*))) /(sqrt((sum(pow(a.rating,2))-(pow(sum(a.rating),2)/count(*)))*
(sum(pow(b.rating,2))-(pow(sum(b.rating),2)/count(*))))) ‘Pearson’,
b.UserId ‘User’
FROM data a INNER JOIN data b ON a.itemid = b.itemid AND a.userId <> b.userId
INNER JOIN data c ON b.userId = c.userId
WHERE a.userId = 944
GROUP BY b.userid
HAVING count(a.rating) > 90
ORDER BY ‘pearson’ DESC
LIMIT 3

Note: In the query above, I’ve restricted resultset to critics with whom I have at least 90 movies in common. I feel too little movies in common lead to vague recommendations. (i.e – Pearson Correlation Factors very close to one) This sounds like an unusually high number and I am sure would lend itself very well to some sort of ‘slider’ in a GUI for adjustment.

These results are going to be the basis for wrapping up chapter two of the book. Results will shortly follow together with my take on these…

 Pearson           User
 ----------------  -------
 0.95346258924559  519
 0.91855865354369  317
 0.90453403373329  808

PCI – Chapter Two – Pearson Score, MovieLens

Ha, the following query (mySQL) gives the Pearson Score for one critic (userId = 6) against all others in set:

SELECT count(*),
sum(a.rating) ‘sum1’,
sum(pow(a.rating,2)) ‘sum1Sq’,
sum(b.rating) ‘sum2’,
sum(pow(b.rating,2)) ‘sum2Sq’,
sum(a.rating*b.rating) ‘pSum’,
sum(a.rating*b.rating)-(sum(a.rating)*sum(b.rating)/count(*)) ‘num’,
sqrt((sum(pow(a.rating,2))-(pow(sum(a.rating),2)/count(*)))*
(sum(pow(b.rating,2))-(pow(sum(b.rating),2)/count(*)))) ‘den’,
(sum(a.rating*b.rating)-(sum(a.rating)*sum(b.rating)/count(*))) / (sqrt((sum(pow(a.rating,2))-(pow(sum(a.rating),2)/count(*)))*
(sum(pow(b.rating,2))-(pow(sum(b.rating),2)/count(*))))) ‘pearson’
FROM data a INNER JOIN data b ON a.itemid = b.itemid AND a.userId <> b.userId
WHERE a.userId = 6
GROUP BY b.userid

Please excuse my lack of blogging skills… I’ll make pretty as soon as I am able.

Best to index appropriately as work that follows may give pc-fan a workout.

And for a list of critics userId 6 should be paying attention to:

SELECT count(*) ‘movies in common’,
sum(a.rating) ‘sum1’,
sum(pow(a.rating,2)) ‘sum1Sq’,
sum(b.rating) ‘sum2’,
sum(pow(b.rating,2)) ‘sum2Sq’,
sum(a.rating*b.rating) ‘pSum’,
sum(a.rating*b.rating)-(sum(a.rating)*sum(b.rating)/count(*)) ‘num’,
sqrt((sum(pow(a.rating,2))-(pow(sum(a.rating),2)/count(*)))*
(sum(pow(b.rating,2))-(pow(sum(b.rating),2)/count(*)))) ‘den’,
b.UserId ‘User’,
(sum(a.rating*b.rating)-(sum(a.rating)*sum(b.rating)/count(*))) / (sqrt((sum(pow(a.rating,2))-(pow(sum(a.rating),2)/count(*)))*
(sum(pow(b.rating,2))-(pow(sum(b.rating),2)/count(*))))) ‘pearson’
FROM data a INNER JOIN data b ON a.itemid = b.itemid AND a.userId <> b.userId
WHERE a.userId = 6
GROUP BY b.userid
ORDER BY 10 DESC

Interesting Note – These critic recommendations are of limited values if critic in question has but a few movies in common with us (userId 6). I bet companies that use these methods for recommending other users, etc. have a sweet spot. i.e. – Too little items in common would skew score one way and too many the other.

PCI – Chapter Two – Update!

After following all the examples in the chapter (spending most of my time in the Movielens data), I’ve moved the movie data to a mySQL database and calculated sim scores between arbitrary users and whole set. I wish mySQL had better tools…

Before moving on, I’ll play with data and see what I find most interesting. I’ve thought about rating the movies myself to use me as the baseline for comparison… It would surely give me a sense of how ‘good’ these recommendations are.

Darn, sql backup of movieLens (100k reviews, they have a set with one million reviews!) is 3.1 Mb.

Lastly, I am getting stumped with some very simple examples in the book. Seems to me the numbers used in some formulas are incorrect or from slightly different dataset (different ratings). The math sound as are the results. I’m gonna chuck it up to book error data. I’ll follow up on this and see where I miss a step or something.

PCI – Chapter Two – Making Recommendations

Chapter two is all about using freely available datasets to make recommendations. The math is easy to follow and the python examples help digest all the dense bits. Finishing all the examples, I looked up the movielens data and to my surprise, there is a book dataset of some sort as well. This is great!

I am going to follow up with some quick apps as proof of concepts, first with the movie data and then with the book data. I’ll try to put some twist on these to complement what I’ve learned from the book. BTW- the author has made the code available on his blog.

My tools of choice, however, are any flavor of SQL and Coldfusion, maybe rails. I’ll be using mySQL for now, not sure on programming language yet. If anyone cares, I can provide source when I am done.