Movielens – Movie Ratings Analysis with OLAP Cubes
For this post, I will describe how to use the previously provided database to create data cubes from the Movielens Dataset. With these cubes, I will then create a few reports using Adobe Flex to illustrate the advantages of using data cubes for reporting instead of the more traditional ‘query and report’ practices from live databases, etc.
OLAP cubes are structures where data is grouped and arranged as a relation between different pieces of information of the same relationship for ease of consuming instead of storing. For example, a ‘product-sales-time’ cube could be constructed to express the relationship between products and the sale of products over time. Starting from a traditional relational model, a query can be built to return a de-normalized resultset with a record for every sale of every product for a specific span of time. Saving these results to a table would give us our ‘product-sales-time’ cube with two dimensions, products and time, and one fact, sales. Think of a cube now with each axis representing one of these pieces of information.
Revisiting our Movilens OLAP database…
From the schema above, we see that we have our fact table for ratings surrounded by multiple dimensions we can use for analysis.
Using this schema, we could create many different cubes like:
- ratings-zipcode-time (this would overlay beautifully on a map ;)
In general, combining different dimensions and facts will result in all sorts of different cubes to analyze different types of relations and information.
Another option that may be more valuable depending on the purpose of our reporting is to create a cube with our facts and multiple dimensions. This is sometimes called a hypercube and inherently provides more insight into the relationships that a cube with less dimensions.
I decided to go this route and create a cube with as many dimensions as possible to slice and dice our data. A cube like this:
Disclaimer – It is important to state that a movie may belong to more than one genre and this introduces a many-to-many relationship between movies and genres. As long as we use our cube to analyze ratings to our genre dimensions versus other dimensions, our analysis will be sound. However, computations for creating this cube do introduce duplicate entries for every movie rating with more than one genre and this would give false results for some cases. For example, analyzing ratings over time by occupation would not yield accurate insight since every movie will be represented as many times as the number of genres it has and we are not interested (in this case) about genres.
Creating almost the same cube, but without genres, would provide us with an accurate cube for these cases.
None of this matters for this post since creating one cube versus the other requires no different skills nor does it introduce any complications. Suffice it to state the implications of ‘many-to-many’ relationships in our data and use the proper cube for the proper purpose. I am certain there is an OLAP answer to dealing with these; I just haven’t gotten to that part of the book yet.
The following sql snippet will create the cube previously described for us:
Copy / paste here: CREATE TABLE `movieratingsage` ( `uid` int NOT NULL AUTO_INCREMENT, `agerange` varchar(50) NOT NULL, `genre` varchar(50) NOT NULL, `gender` varchar(50) NOT NULL, `occupation` varchar(50) NOT NULL, `title` varchar(255) NOT NULL, `rating` decimal(10,0) NOT NULL, `zipcode` char(20) NOT NULL, `month` varchar(50) DEFAULT NULL, `quarter` int(11) DEFAULT NULL, `year` char(4) DEFAULT NULL, PRIMARY KEY (`uid`) ) SELECT c.agerange, f.genre, h.gender, i.occupation, d.title, a.rating, j.zipcode, g.monthasstring as month, g.quarter, g.year FROM fact_rating_1m a INNER JOIN dim_user_1m b ON a.userid = b.userid INNER JOIN dim_agerange c ON b.agerangeid = c.agerangeid INNER JOIN dim_movie d ON a.movieid = d.movieid INNER JOIN movieGenre_assoc_1m e ON d.movieId = e.movieId INNER JOIN dim_genre f ON e.genreId = f.genreId INNER JOIN dim_date g ON a.dateid = g.dateid INNER JOIN dim_gender h ON b.genderId = h.genderId INNER JOIN dim_occupation i ON b.occupationId = i.occupationId INNER JOIN dim_zipcode j ON b.zipcodeId = j.zipcodeId ORDER BY g.dateId,1,2,3,4,5,6′
Since our data is not growing over time, we only have to create this cube once. Usually, we would be extracting data from a relational database and/or log files on some schedule. Following that, we would be transforming the data into our olap database. Finally, we would be recreating (or adding) this new data to our cube. That, in one brief paragraph, is the practice labeled ETL (extract, transform and load) in a nutshell.
The query above could take a long time to complete depending on the computer you’re in. It took more than an hour to create this cube for me. But, as just mentioned, it is a one time deal. What we end up with is a multi dimensional cube (hypercube) that relates our fact (movie ratings) to all our dimensions gender, genre, zip, occupation, age range, etc.
The basic premise of creating reports from this cube is to return the records of interest by filtering and ‘smashing’ them per dimension by aggregation.
For example, say we are interested in the relation between gender and rating, we could easily construct a sql query to return ratings counts, averages, highs and lows for a time span and then aggregate these by genre.
To illustrate the power of using this cube for reporting, I have created a Flex Application using Flex OLAP Components. Clicking on the image will open the app in a new window.
I have restricted the results to a random five thousand ratings for convenience. The flex application should behave the same regardless of the size of the cube (within reason); it would just take longer. I have locally loaded 100k cubes and load times are comparable to proprietary tools used at work. After loading, crunching reports (on top bar) is as timely and speedy as proprietary tools I’ve used, only prettier ;)
I have a few things I would like to change on app, like proper ordering of months, decent color scheme (I copied this one from an older app), etc but the data speaks for itself ;) Cool stuff.
Its interesting how much information can be inferred from using these olap components with a quick glance at the application. We could determine popularity of movies by age group, occupation, etc. Even nicer, as seen on later reports on app, we can examine multiple dimensions for correlations between tow or more dimensions.
Do you want to know how popular scifi is for the 56 year old + crowd who’s unemployed? Easy… It is easy to see the value of this information to a marketing department or movie studio. These same skills can be applied to all sorts of data even if you (the developer) does not even know what the specific reports or questions management (or whoever) is asking.
I’ll be putting more time into Flex OLAP Components and notes to see where this takes me…