Movielens Datasets for BI – 10 Million Movie Ratings

The 10 million ratings set from Movielens allows us to create two fact tables (linked?!).  We can create a fact table for ratings and another one for tags.

Worth noting that a userIds between these two schemas (one from ratings.dat and the other from tags.dat) do match across sets.  I.E. – userId 1234 in tags dataset is user 1234 (if existing) in ratings dataset.  So we could link these but, for now, its simpler not to.

Information provided in the 10 million ratings set allows us to create similar star schemas as follows:

Ratings

Tags

I’ll be running these proposed schemas by my peers at work and post back any insight I am sure to be missing from these.  A lot of design here is based on my perception of what matters and its worth seeking advice before I trail off too much.

It is odd that we have no more information for the users in our set.  The smaller, 1 million set does provide move interesting user information but no tags from said users.  Perhaps it is worth considering both datasets at the same time and treating them as two different sets of study.

About these ads

5 thoughts on “Movielens Datasets for BI – 10 Million Movie Ratings

  1. ‘m a MSc student in Information Technology enginear and my thesis is about a new personalization approach using web usage & content mining. To demonstrate the effectiveness of my personalization system I should test its performance on a general web site that allows me to analyze both web log data and web pages.

    Please how can I obtain real IMDb URLs of movies in MovieLens data sets. These URLs exist in u.item file and I entered these URLs in the internet explorer but unfortunately I faced with the following error:
    The page you were looking for
    hasn’t been found.
    So, I couldn’t download content of movies from IMDb site.
    I would greatly appreciate it if you will guide me about it if possible.
    Please help me. I’m waiting for your answer.

    • Sorry, I misread your comment. The url format at imdb has since changed. If you look at old posts of mine. The urls you have used to work before. I do not know how to help you now. Perhaps if imdb had a developer program or open api? Worth contacting them…

  2. There is no relation between the values that you are going to find in the Movielens dataset and IMDB. Even the movie titles are different in a lot of cases.
    Matching the titles between the IMDB text files and the Movielens dataset its a long and manual process. You will be able to match a large portion of the titles but will still be left with a large number of movie titles that do not match. Worst of all, you still won’t have an IMDB id as it is not provided on the data files. But with the updated titles and a good script that tries to find the IMDB id by “consulting” their website you should be able to get the id.
    Been there, done that and I do not recommend it. I had to do it for my masters too… Different subject though – evaluation of recommendation algorithms.

    • Good point, I had a heck of a time the deeper I got into this project. I was, however, using the data as subject for a variety of things I wanted to learn. Its remarkable that I am now using this dataset for some data warehousing projects as well. Hope you learned a lot for your masters as well.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s