PCI – Chapter Two – Building the Item Comparison Dataset

In order to do movie recommendations with item based filtering, we need to generate the complete dataset of similar items. In a nutshell this entails calculating (using Pearson’s Similarity Score like before) the similarity any one item has to every other item in the dataset.

Generating this dataset is very cpu intensive. It takes an average of 14 minutes per movie on dataset on my laptop. I believe there are 1682 movies in items table so calculating the complete dataset is out of the question for me. I’ve spent the good part of last weekend generating complete similarity metrics for a 45 movie set.

This will give me a decent feeling of how good this recommendation system is and, as the book notes, coming up with recommendations this way takes no time since we’ve done all the hard work beforehand.

The first step is to create a table like so: PearsonSimMovies(ItemId,CriticsRatedCommon,PearsonSim,CurrentItemId)

The second step is to insert records per movie and store the similarity ‘score’ to every other movie based on the ratings we have from our users. Luckily, this means we wont create a cross-product of this item table since some movies we cannot correlate to each other based on the review set we have. Nice.

Here’s the query I used for generating this dataset:

INSERT INTO PearsonSimMovies(ItemId,CriticsRatedCommon,PearsonSim,CurrentItemId)
SELECT b.ItemId ‘ItemId’,
count(a.rating) ‘CriticsRatedCommon’,
(sum(a.rating*b.rating)-(sum(a.rating)*sum(b.rating)/count(*)))/
(sqrt((sum(pow(a.rating,2))-(pow(sum(a.rating),2)/count(*)))*
(sum(pow(b.rating,2))-(pow(sum(b.rating),2)/count(*)))))
‘PearsonSim’,
a.ItemId ‘CurrentItemId’
FROM data a INNER JOIN data b
ON a.Userid = b.Userid AND a.itemId b.itemId
INNER JOIN data c ON b.itemId = c.itemId
WHERE a.itemId = 38
GROUP BY b.itemid
ORDER BY ‘PearsonSim’ DESC

With this 45 movie set, I’ll be calculating some recommendations on random movies from it.

Leave a comment