Movielens Dataset – One Million

After some thought, I’ve decided to switch data sources from the ten million movie rating set to the one million movie rating set.  As seen below, this dataset just has lots more interesting data which will provide with more dimensions to explore.

The exact same data scrubbing applies (same sql as well) as I had done in the other data set a few posts ago.  Also, all the secondary supporting data generated (time, date dimensions) will fit just as well.

Unfortunately, we loose the ability to dig into description tags applied to movies by movie reviewers.  On the bright side, We have information on gender, age range, location (zip code) and occupation.  Clearly, the looses are less than all this information gained and will make for a much stronger dataset to learn from.

As previously mention, all side work done for ten million rating set will be reused here including time, date dimensions, etc.  No loss creating these either.

Movielens OLAP Model – Dimension Tables

I ran into ‘time’ difficulties on populating fact tables with timeIds and dateIds.  Even a relatively simple query was taking too long to fetch correct values into table.  I guess doing 10 million row joins between two tables (10m_ratings and fact_rating) joined to dime_time/dime_date is a bit too much for my laptop.  Maybe I stared at it too much.

The only thing that I can think of is regenerating both fact tables and including the original timeStamp in both of them this time.  This will save a join and, hopefully, make populating these fact tables speedy enough not to drive me nuts.  I can remove these columns when they are not needed anymore.

Properly indexing fields used in the UPDATEs from prior posting will help tremendously as well.

This is not even the cool part of this project but more of a foundation so that I know how things work.  Commercial products do these things for you anyways I think.

This revised schema better illustrates this change.

Movielens – Completing fact tables

With the previous schema changes in place, it is now a matter of running queries in order to complete our fact_rating and fact_tag tables.  Both these tables are missing timeId and dateId.

For the table fact_rating, the following potentially long time queries need be executed. For timeId:

UPDATE fact_rating SET timeId = (
SELECT a.timeId FROM dim_time a
WHERE a.military = hour(from_unixtime(fact_rating.timestamp))
AND a.minute = minute(from_unixtime(fact_rating.timestamp))
)

For dateId:

UPDATE fact_rating SET dateId = (
SELECT a.dateId FROM dim_date a
WHERE a.year = year(from_unixtime(fact_rating.timestamp))
AND a.month = month(from_unixtime(fact_rating.timestamp))
AND a.day = day(from_unixtime(fact_rating.timestamp))
)

Similarly, for the table fact_tag, we run the following for timeId:

UPDATE fact_tag SET timeId = (
SELECT a.timeId FROM dim_time a
WHERE a.military = hour(from_unixtime(fact_tag.timestamp))
AND a.minute = minute(from_unixtime(fact_tag.timestamp))
)

For dateId, again, the snippet is similar to the fact_rating one above:

UPDATE fact_tag SET dateId = (
SELECT a.dateId FROM dim_date a
WHERE a.year = year(from_unixtime(fact_tag.timestamp))
AND a.month = month(from_unixtime(fact_tag.timestamp))
AND a.day = day(from_unixtime(fact_tag.timestamp))
)

It is worth adding a where clause to each of these and exclude items with id being updated equals null. (I.E. – for dateId,  append ‘WHERE dateId is null’ at the end of update query.  Some of these may (just may) take days so this is a no brainer although I didn’t do in all cases.

If anyone is thinking about how long these take, I recall the timeId population for table fact_rating took my Core 2 Duo 2.26, 4GB of ram laptop 18 hours and change.  Its an ‘execute and go to bed’ operation for sure for me.  Looking forward to having a completed star-schema to play with!

Movielens Datasets for BI – 10 Million Movie Ratings

The 10 million ratings set from Movielens allows us to create two fact tables (linked?!).  We can create a fact table for ratings and another one for tags.

Worth noting that a userIds between these two schemas (one from ratings.dat and the other from tags.dat) do match across sets.  I.E. – userId 1234 in tags dataset is user 1234 (if existing) in ratings dataset.  So we could link these but, for now, its simpler not to.

Information provided in the 10 million ratings set allows us to create similar star schemas as follows:

Ratings

Tags

I’ll be running these proposed schemas by my peers at work and post back any insight I am sure to be missing from these.  A lot of design here is based on my perception of what matters and its worth seeking advice before I trail off too much.

It is odd that we have no more information for the users in our set.  The smaller, 1 million set does provide move interesting user information but no tags from said users.  Perhaps it is worth considering both datasets at the same time and treating them as two different sets of study.

Revisiting the MovieLens Database

What a great opportunity to put the MovieLens data to good use (again).  Previously, I had used this dataset to build a movie recommendation engine based on the book Collective Intelligence.  I will be spending lots of time at work creating OLAP cubes for business reporting.

Immediately, I thought the Movielens dataset would be as good as it comes to practice concepts and practices I’ll be exposed to at work.  I am going to need all the help and practice since the world of Business Intelligence is new to me.

A little history about the dataset I am referring to can be found at the GroupLens Research.  The dataset I am planning on using can be found here. This time, I’ll be using the 10 million ratings set since it seems everything at work is measured in millions.  There are a few other variations of this data. Look around; you may find something you like.

After downloading the files, I proceeded to import all the data as is into our ‘original’ database.  I am using some late flavor of mySQL and was able to import the data without much fuss.  The only thing I had to do was replace the delimiters (double colons ::) in the movies data file for something else.  Movie titles with colons where being inadvertently ‘split’ across more than one column on import.  Doing this produces a database with three initial tables.

One table holds information on the users rating at least 20 movies, another one holds all the movie information and the last one holds the 10 million ratings for these movies looking like this:

10m_ratings(userID, movieID, rating, timestamp)
10m_tags(userID, movieID, tag, timestamp)
10m_movies(movieID, title, genres)

Table naming prefix aids infer purpose since I am restricting myself for one database for this project.  More so, sql styles and the like suit me fine 😉

In total, you end up with a database a bit less than 400 MBs.  Relevant information for data I will be using can be found here.  If there is any interest and licensing allows it, I’ll post a sql dump or db backup to share and same someone some time.

What Movie Now?

Finally, I was able to secure a low cost hosting for trying skills form book out.

After tons of inconveniences, I’ve launched whatmovienow.com. This is a work in progress and I will try to add features based on collective intelligence book as best I can. The site mostly employs the ranking algorithms form book. It does not re-evaluates movies based on ranking from site visitors as that would take too many resources. Updates should be a lot quicker now that the hard work is done :).

Getting site off the ground has taken more time that I had intended.

First, I had to change from mySQL to MSSQL. I was going to use Dreamhost for hosting mySQL database but performance was very irregular and sluggish.

Second, I originally wrote website in Coldfusion, using the Model-Glue framework. This worked fine on my local computer, however, hosting provider had some restrictions which further delayed deployment. I ended up with two sligthly different versions of site one for dev locally and one for live :(. I intend to configure local computer to better reflect production server.

How apologetic… the only thing that matters is that site is out and I can resume writing blog.