Mario Talavera Writes

- My Development Journal

Movielens Datasets for BI – 10 Million Movie Ratings

with 5 comments

The 10 million ratings set from Movielens allows us to create two fact tables (linked?!).  We can create a fact table for ratings and another one for tags.

Worth noting that a userIds between these two schemas (one from ratings.dat and the other from tags.dat) do match across sets.  I.E. – userId 1234 in tags dataset is user 1234 (if existing) in ratings dataset.  So we could link these but, for now, its simpler not to.

Information provided in the 10 million ratings set allows us to create similar star schemas as follows:

Ratings

Tags

I’ll be running these proposed schemas by my peers at work and post back any insight I am sure to be missing from these.  A lot of design here is based on my perception of what matters and its worth seeking advice before I trail off too much.

It is odd that we have no more information for the users in our set.  The smaller, 1 million set does provide move interesting user information but no tags from said users.  Perhaps it is worth considering both datasets at the same time and treating them as two different sets of study.

About these ads

Written by mariotalavera

January 20, 2010 at 10:21 pm

5 Responses

Subscribe to comments with RSS.

  1. ‘m a MSc student in Information Technology enginear and my thesis is about a new personalization approach using web usage & content mining. To demonstrate the effectiveness of my personalization system I should test its performance on a general web site that allows me to analyze both web log data and web pages.

    Please how can I obtain real IMDb URLs of movies in MovieLens data sets. These URLs exist in u.item file and I entered these URLs in the internet explorer but unfortunately I faced with the following error:
    The page you were looking for
    hasn’t been found.
    So, I couldn’t download content of movies from IMDb site.
    I would greatly appreciate it if you will guide me about it if possible.
    Please help me. I’m waiting for your answer.

    rose

    February 8, 2010 at 4:05 pm

    • Sorry, I misread your comment. The url format at imdb has since changed. If you look at old posts of mine. The urls you have used to work before. I do not know how to help you now. Perhaps if imdb had a developer program or open api? Worth contacting them…

      mariotalavera

      February 10, 2010 at 6:09 pm

  2. Hello Rose,

    I wish you the best on your Thesis work. I may have put the wrong link out there. Not a thing; here is where I got data from: http://www.grouplens.org/node/73#attachments.

    Again, Thesis Work sounds interesting and I hope you do great. Do let me know if you have issues getting data you need. I may make info available if allowed by Grouplens.

    mariotalavera

    February 9, 2010 at 2:40 pm

  3. There is no relation between the values that you are going to find in the Movielens dataset and IMDB. Even the movie titles are different in a lot of cases.
    Matching the titles between the IMDB text files and the Movielens dataset its a long and manual process. You will be able to match a large portion of the titles but will still be left with a large number of movie titles that do not match. Worst of all, you still won’t have an IMDB id as it is not provided on the data files. But with the updated titles and a good script that tries to find the IMDB id by “consulting” their website you should be able to get the id.
    Been there, done that and I do not recommend it. I had to do it for my masters too… Different subject though – evaluation of recommendation algorithms.

    Liam

    March 22, 2011 at 6:19 pm

    • Good point, I had a heck of a time the deeper I got into this project. I was, however, using the data as subject for a variety of things I wanted to learn. Its remarkable that I am now using this dataset for some data warehousing projects as well. Hope you learned a lot for your masters as well.

      mariotalavera

      March 23, 2011 at 1:36 pm


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 54 other followers

%d bloggers like this: