Movielens Dataset – One Million

After some thought, I’ve decided to switch data sources from the ten million movie rating set to the one million movie rating set.  As seen below, this dataset just has lots more interesting data which will provide with more dimensions to explore.

The exact same data scrubbing applies (same sql as well) as I had done in the other data set a few posts ago.  Also, all the secondary supporting data generated (time, date dimensions) will fit just as well.

Unfortunately, we loose the ability to dig into description tags applied to movies by movie reviewers.  On the bright side, We have information on gender, age range, location (zip code) and occupation.  Clearly, the looses are less than all this information gained and will make for a much stronger dataset to learn from.

As previously mention, all side work done for ten million rating set will be reused here including time, date dimensions, etc.  No loss creating these either.

Movielens OLAP Model – Dimension Tables

I ran into ‘time’ difficulties on populating fact tables with timeIds and dateIds.  Even a relatively simple query was taking too long to fetch correct values into table.  I guess doing 10 million row joins between two tables (10m_ratings and fact_rating) joined to dime_time/dime_date is a bit too much for my laptop.  Maybe I stared at it too much.

The only thing that I can think of is regenerating both fact tables and including the original timeStamp in both of them this time.  This will save a join and, hopefully, make populating these fact tables speedy enough not to drive me nuts.  I can remove these columns when they are not needed anymore.

Properly indexing fields used in the UPDATEs from prior posting will help tremendously as well.

This is not even the cool part of this project but more of a foundation so that I know how things work.  Commercial products do these things for you anyways I think.

This revised schema better illustrates this change.

Movielens – Completing fact tables

With the previous schema changes in place, it is now a matter of running queries in order to complete our fact_rating and fact_tag tables.  Both these tables are missing timeId and dateId.

For the table fact_rating, the following potentially long time queries need be executed. For timeId:

UPDATE fact_rating SET timeId = (
SELECT a.timeId FROM dim_time a
WHERE a.military = hour(from_unixtime(fact_rating.timestamp))
AND a.minute = minute(from_unixtime(fact_rating.timestamp))
)

For dateId:

UPDATE fact_rating SET dateId = (
SELECT a.dateId FROM dim_date a
WHERE a.year = year(from_unixtime(fact_rating.timestamp))
AND a.month = month(from_unixtime(fact_rating.timestamp))
AND a.day = day(from_unixtime(fact_rating.timestamp))
)

Similarly, for the table fact_tag, we run the following for timeId:

UPDATE fact_tag SET timeId = (
SELECT a.timeId FROM dim_time a
WHERE a.military = hour(from_unixtime(fact_tag.timestamp))
AND a.minute = minute(from_unixtime(fact_tag.timestamp))
)

For dateId, again, the snippet is similar to the fact_rating one above:

UPDATE fact_tag SET dateId = (
SELECT a.dateId FROM dim_date a
WHERE a.year = year(from_unixtime(fact_tag.timestamp))
AND a.month = month(from_unixtime(fact_tag.timestamp))
AND a.day = day(from_unixtime(fact_tag.timestamp))
)

It is worth adding a where clause to each of these and exclude items with id being updated equals null. (I.E. – for dateId,  append ‘WHERE dateId is null’ at the end of update query.  Some of these may (just may) take days so this is a no brainer although I didn’t do in all cases.

If anyone is thinking about how long these take, I recall the timeId population for table fact_rating took my Core 2 Duo 2.26, 4GB of ram laptop 18 hours and change.  Its an ‘execute and go to bed’ operation for sure for me.  Looking forward to having a completed star-schema to play with!

Movielens Datasets for BI – 10 Million Movie Ratings

The 10 million ratings set from Movielens allows us to create two fact tables (linked?!).  We can create a fact table for ratings and another one for tags.

Worth noting that a userIds between these two schemas (one from ratings.dat and the other from tags.dat) do match across sets.  I.E. – userId 1234 in tags dataset is user 1234 (if existing) in ratings dataset.  So we could link these but, for now, its simpler not to.

Information provided in the 10 million ratings set allows us to create similar star schemas as follows:

Ratings

Tags

I’ll be running these proposed schemas by my peers at work and post back any insight I am sure to be missing from these.  A lot of design here is based on my perception of what matters and its worth seeking advice before I trail off too much.

It is odd that we have no more information for the users in our set.  The smaller, 1 million set does provide move interesting user information but no tags from said users.  Perhaps it is worth considering both datasets at the same time and treating them as two different sets of study.

Revisiting the MovieLens Database

What a great opportunity to put the MovieLens data to good use (again).  Previously, I had used this dataset to build a movie recommendation engine based on the book Collective Intelligence.  I will be spending lots of time at work creating OLAP cubes for business reporting.

Immediately, I thought the Movielens dataset would be as good as it comes to practice concepts and practices I’ll be exposed to at work.  I am going to need all the help and practice since the world of Business Intelligence is new to me.

A little history about the dataset I am referring to can be found at the GroupLens Research.  The dataset I am planning on using can be found here. This time, I’ll be using the 10 million ratings set since it seems everything at work is measured in millions.  There are a few other variations of this data. Look around; you may find something you like.

After downloading the files, I proceeded to import all the data as is into our ‘original’ database.  I am using some late flavor of mySQL and was able to import the data without much fuss.  The only thing I had to do was replace the delimiters (double colons ::) in the movies data file for something else.  Movie titles with colons where being inadvertently ‘split’ across more than one column on import.  Doing this produces a database with three initial tables.

One table holds information on the users rating at least 20 movies, another one holds all the movie information and the last one holds the 10 million ratings for these movies looking like this:

10m_ratings(userID, movieID, rating, timestamp)
10m_tags(userID, movieID, tag, timestamp)
10m_movies(movieID, title, genres)

Table naming prefix aids infer purpose since I am restricting myself for one database for this project.  More so, sql styles and the like suit me fine 😉

In total, you end up with a database a bit less than 400 MBs.  Relevant information for data I will be using can be found here.  If there is any interest and licensing allows it, I’ll post a sql dump or db backup to share and same someone some time.

PCI – Programming Collective Intelligence Book Errors

Programming Collective Intelligence is a wonderful book but, being a first edition, means it has quite a few errors which will throw off a Python newbie very easily.

If only to keep me sane; this page at OReilly’s Website is invaluable while working through this book. Specially, the unconfirmed page which seems lots more up to date.

I’ll be submitting errors as I go if they are not there already. So far, I’ve run into two issues not there and submitted. I hope they update frequently.

PCI – Searching and Ranking – PySQLite tests fail on Mac

Hoping to find an error in my script, I going over my steps for chapter four of the book Programming Collective Intelligence. So far, my code looks good as far as I can tell (I’m no expert).

Also, I tested all dependencies for this chapter, mainly the installation of BeautifulSoup, an updates version of SQLite and pysqlite.

BeautifulSoup’s and SQLite’s installations completed successfully (again). PySQLite also installed fine, however, testing installation with an included script fails on my computer. I did not run this test the first time, need to be more thorough. It seems there is a defect in test script for Mac.

If you’re running pySqlite-2.4.2 on a Mac, you can find more information here. This is not the cause of my previous errors, however, since my script cannot open the files to index them.

More later…

PCI – Chapter Four – Adding to the Index and SQLite

Post of confusion. My python script does not read the files ‘into’ SQLite database.

I had to install SQLite3 (Mac came with a 2.x version) in order to install pysqlite-2.4.2. Maybe pysqlite is using older SQLite version on computer… I am following the book to the tee. Not trying any fancy Coldfusion nor SQL.

Everything seems to be installed properly. When ‘searchengine.py’ runs from terminal; it reports that it could open any of the files in my directory. I tried feeding it all sorts of files to no avail. Very frustrating…

On the other hand, I am having a blast playing with SQLite. By the way, SQLite is the coolest thing I’ve fiddled with today. This is very powerful and convenient.

FYI – SQLite is a ‘db in a file’. It’s very convenient for development and there is much fanfare from people using it on production systems as well. Seems to come embeded in everything from Mail.app (yes it is!) to cel phones and all sorts of electronice devices.

For lots of cool information on this, you can go to Leo Laporte’s FLOSS Weekly (lately it is 😉 episode 26 podcast. Also, (this is where I fist saw SQLite) Google video has this great overview of SQLite, thou a little dated.

Lastly – Searching on Google Video for ‘genre:educational _stuff_‘ usually returns the most informative videos on topics of interest.

Next on PCI TODO: index my movie files using Python!

PCI – Chapter Four – Cleaning the Data

After retrieving all the movie files form Wikipedia, some cleanup was in order. I decided to remove, based on size, the files which did not point to actual movie in question. Simple file check cut down file size considerably, to a little more than 430 movie files. While this does not make for an amazing data representation of the movies in the database; it will be more than enough data for the sake of the exercises. Conveniently, this operations leaves us with only 21 mbs of textual data to work with, a third of original set retrieved.

PCI Chapter 4 – Searching and Ranking Colecting Data

I’ve just completed the first step towards building the movie pages search engine. Having spent the better part of an hour scraping Wikipedia; I’ve been able to save to my laptop more than 67 MBs of movie pages.

DISCLAIMER – I am positive Wikipedia‘s servers do not break a sweat fulfilling requests like these, please bear in mind that abusing someone else’s servers is in bad taste.

I am going to have to scrub data clean of all HTML tags, etc. Ideally, I will only keep the ‘plot’ of each movie in each page. This will both keep data size down and provide more relevant search results.

Here’s code (Coldfusion) snippet for grabbing the pages form Wikipedia:

<cfset urlLink = "http://en.wikipedia.com/wiki/"><cfquery name="getMovies"
datasource="movieLens100k-mysql">
SELECT itemId, Title
FROM Items
WHERE 1 = 1
ORDER BY 1
</cfquery>

<cfflush>
<cfoutput query=”getMovies”>

<cfset movieStr = title />

<cftry>
<cfif find(“(“,movieStr)>
<cfset movieStr =
replace(left(title, find(“(“,title)-2), ” “,”_”,”all”) />
</cfif>
<cfif find(“,”,movieStr)>
<cfset movieStr =
trim(right(left(title, find(“(“,title)-2),len(left(title, find(“(“,title)-2))-find(“,”,title)-1)) & ” ”
&left(left(title, find(“(“,title)-2),find(“,”,title)-1)/>
</cfif>

<cfset movieUrl = variables.urlLink & movieStr />

<cfhttp url=”#movieUrl#” method=”get” resolveurl=”yes”>

<cffile action=”write” output=”#cfhttp.FileContent#” file=”#expandPath(“pages”)#/#title#.html”>

<cfcatch>
<font color=”red”>#itemId# – #movieStr#</font><br/>

</cfcatch>
</cftry>
</cfoutput>

No, my code does not look like that in my IDE. It is properly indented and spaced, etc. I still do not know how to use WordPress 😦 This leaves us with a directory with about 1600 html files, one per movie.

At this point; my main focus will be to decrease the size of my data to index as effectively as possible hoping to end up with a clean set of movies per page. This will be the source of my index for movie text (data) to search on eventually. Wow, sound like a lot of indexing even if I can scrub a lot of the useless parts of the pages.

Since I am starting with about 67 MBs of text, its in my best interest to clean up as much as possible. Lots of scrubbing and parsing ahead. Lets see how much of this textual data can be scrapped off.