Revisiting the MovieLens Database

What a great opportunity to put the MovieLens data to good use (again).  Previously, I had used this dataset to build a movie recommendation engine based on the book Collective Intelligence.  I will be spending lots of time at work creating OLAP cubes for business reporting.

Immediately, I thought the Movielens dataset would be as good as it comes to practice concepts and practices I’ll be exposed to at work.  I am going to need all the help and practice since the world of Business Intelligence is new to me.

A little history about the dataset I am referring to can be found at the GroupLens Research.  The dataset I am planning on using can be found here. This time, I’ll be using the 10 million ratings set since it seems everything at work is measured in millions.  There are a few other variations of this data. Look around; you may find something you like.

After downloading the files, I proceeded to import all the data as is into our ‘original’ database.  I am using some late flavor of mySQL and was able to import the data without much fuss.  The only thing I had to do was replace the delimiters (double colons ::) in the movies data file for something else.  Movie titles with colons where being inadvertently ‘split’ across more than one column on import.  Doing this produces a database with three initial tables.

One table holds information on the users rating at least 20 movies, another one holds all the movie information and the last one holds the 10 million ratings for these movies looking like this:

10m_ratings(userID, movieID, rating, timestamp)
10m_tags(userID, movieID, tag, timestamp)
10m_movies(movieID, title, genres)

Table naming prefix aids infer purpose since I am restricting myself for one database for this project.  More so, sql styles and the like suit me fine 😉

In total, you end up with a database a bit less than 400 MBs.  Relevant information for data I will be using can be found here.  If there is any interest and licensing allows it, I’ll post a sql dump or db backup to share and same someone some time.


