PCI – Chapter Two – Item Based Filtering (MovieLens dataset)

With our recently generated 45 movie set, we will pick three random movies.

Using these as our imaginary baseline (i.e. – these are the three movies the user states he likes and has rated or the first three movies he user clicked and ranked on a website. I am using myself (my ratings) as source of movie ratings in exercise. We can build our recommendation system as described here.

First, lets retrieve three random movies.

SELECT a.CurrentItemId, b.title, c.rating
FROM pearsonsimmovies a INNER JOIN item b ON a.CurrentItemId = b.movieId
INNER JOIN data c ON a.CurrentItemId = c.itemId AND c.userId = 944
WHERE a.PearsonSim <> 1
GROUP BY a.CurrentItemId
ORDER BY rand() LIMIT 3

Random result-set:

CurrentItemId     title                       rating
----------------  --------------------------  ---------
11                Seven (Se7en) (1995)        2
343               Alien: Resurrection (1997)  4
25                Birdcage, The (1996)        4

Now, as described in the book in page 24, Lets generate a table based on these three movies. Also, lets rank all movies (in our recently generated set) based on their similarity to these.

Restricting myself to mySQL for Chapter Two (this may not be a good idea for more complicated chapters), here’s the query that returns data we are going to be working with:

SELECT b.title ‘movie’ , c.title ‘movie’
, sum(a.PearsonSim)
, sum(d.rating*a.pearsonsim) ‘Rx’
FROM pearsonsimmovies a INNER JOIN item b ON a.CurrentItemId = b.movieId
INNER JOIN item c ON a.ItemId = c.movieId AND c.movieId IN (SELECT CurrentItemId FROM pearsonsimmovies)
INNER JOIN data d ON a.CurrentItemId = d.itemId AND d.userId = 944
WHERE a.CurrentItemId IN (11,343,25)
GROUP BY b.title, c.title</font

Partial result-set:

movie                       movie                        sum(a.PearsonSim)     Rx

Alien: Resurrection (1997)  Angels and Insects (1995)    0.0858352035284042    0.343340814113617
Alien: Resurrection (1997)  Antonia's Line (1995)        -0.121566131711006    -0.486264526844025
Alien: Resurrection (1997)  Apollo 13 (1995)             0.168530061841011     0.674120247364044
Birdcage, The (1996)        Angels and Insects (1995)    -0.13074591755867     -0.52298367023468
Birdcage, The (1996)        Antonia's Line (1995)        0.0317920446395874    0.12716817855835
Birdcage, The (1996)        Apollo 13 (1995)             0.0997573956847191    0.399029582738876
Seven (Se7en) (1995)        Angels and Insects (1995)    -0.0412117689847946   -0.0824235379695892
Seven (Se7en) (1995)        Antonia's Line (1995)        0.304266840219498     0.608533680438995
Seven (Se7en) (1995)        Apollo 13 (1995)             0.0346525646746159    0.0693051293492317

Complete result-set is 135 records. This includes the sim score between each of the movies in the data-set multiplied by the score of each movie user provided ratings for.

From a quick trip to excel and back, we get a table lots like the one in the book:



				Angels and Insects (1995)    Antonia's Line (1995)  
	Pearson
Alien: Resurrection (1997)   	0.085835204                  -0.121566132
Birdcage, The (1996)          	-0.130745918                 0.031792045
Seven (Se7en) (1995)          	-0.041211769                 0.30426684
Total Pearson                 	-0.086122483                 0.21449275  

         Rx
Alien: Resurrection (1997)    	0.343340814                  -0.486264527
Birdcage, The (1996)          	-0.52298367                  0.127168179
Seven (Se7en) (1995)          	-0.082423538                 0.60853368
Total Rx                      	-0.262066394                 0.249437332

Normalized                    	3.042949819                  1.162917294

…This is but a sample of data generated. In reality table contains one column for each movie in data-set. Such layout does not display well here.

More importantly, now we have all the data we need to recommend movies to the user.

Pivoting both Movie and Normalized data, we obtain a table of movies we could recommend to user. Most peculiar, we have the user’s predicted rating for all of these movies based on his initial input. The more ratings the user provides in the beginning, the better these recommendations are. Here is a sample from pivoted table from results above:


Leaving Las Vegas        4.944198620
Fallen (1998)            4.192423985
Mad Love (1995)          3.833175273
Apollo 13 (1995)         3.771224915
Batman Forever (1995)    3.673963831
Babe (1995)              2.544153587
Get Shorty (1995)        2.085588587

This has been a great exercise for me. It would be a lot of fun to build a recommendations engine professionally. This is one those great added-value services more and more websites are providing their users. Very impressive indeed.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s