PCI – Chapter Two – Similarity Score (Pearson)

The following query creates an ‘in-between’ table which saves the similarity scores between myself (user id 944) and all the critics.

INSERT INTO PearsonSim944(UserId,PearsonSim)
SELECT b.UserId ‘UserId’,
(sum(a.rating*b.rating)-(sum(a.rating)*sum(b.rating)/count(*))) / (sqrt((sum(pow(a.rating,2))-(pow(sum(a.rating),2)/count(*)))*(sum(pow(b.rating,2))-(pow(sum(b.rating),2)/count(*))))) ‘PearsonSim’
FROM data a INNER JOIN data b ON a.itemid = b.itemid AND a.userId <> b.userId
INNER JOIN data c ON b.userId = c.userId
WHERE a.userId = 944
GROUP BY b.userid
ORDER BY ‘PearsonSim’ DESC

I’ve choose not to exclude any data as to preserve scores for future use. Much more data can be generated and saved to aide future calculations but this is all needed to wrap up exercise set at hand. I’ve used Pearson instead of Euclidean because I found it easier to work with and book outlines benefits when dealing with certain types of dataset.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s