Real Time Dashboards 2 – update

Oops,  missed a field in log create script on my previous post.

You can pick proper file here -> log-revised-sql.  Create script fir step table is fine, file is here -> step.sql.

After running these two scripts (remember to rename to ‘.sql’), you should have the following tables in your database:

Next time, we’ll be setting up the Java part of the application…

Real Time Dashboards 2 – SQL layer

This is part 2 of Real Time Dashboards. Reading the previous introductory post (it is short) will put this post in context.

As previously mentioned, our monitor checks up on an ETL process which is itself updating many databases as it executes.

For this post series, we are going to randomly grab an existing workflow (thank you Mazda) to monitor. We will be monitoring a ‘100-point’ car inspection. Lets imagine a car inspector checking things off in his tricorder or some other digital device. This device is, of course, running a program that records everything to a simple database.

I know, one would actually wait in a lobby while technicians check a car out. Feel free to swap my workflow with your workflow; the point to take from this is that our monitor will work well regardless the process being monitored 🙂

Hoping to keep this part basic, I am constructing a table called Step with the following fields: id, label, parentId, workEffort. WorkEffort will be a numeric field, a hypothetical amount of work each respective step would take (could define time, hammers to the head, amount of cowbell, etc.). This basic structure will allow the definition of any number of main and sub processes in a tree hierarchy. This table can be used to describe any number of main steps, each with any number of sub steps. It is important to note that we would be assigning ‘workEffort’ values to the children of our main steps for as many levels as we may have. I.E. – only the children of these steps (the inner steps) will have workEffort… Aggregating these values would, in turn, yield the workEffort of its parent. Peachy.

We will have a second table, Log, where the status of each of these processes can be recorded. The log table will have the following fields: id, active, success, startTime, endTime. Both active and success will be booleans and will indicate which process is active and which process, if any, fails.

Simple but complete enough to represent many real life processes we could monitor. At work, our typical process consists of a series of main steps each comprised of many sub steps. The tables described here would even allow us to take this a bit further as long as we remain sensible in our expectations on performance and complexity.

I have included the scripts to create each of these tables step and log for you. Download and replace the doc extension for sql. WordPress, free and wonderful, wouldn’t allow me to upload .sql files. Bah!

For the next post I will be describing this ‘car inspection’ process and how we can both model and mimic realistically.

Real Time Dashboards 1 – Introduction

Based on my presentation at BarCamp Orlando 2010, I’ve decided to do a series of posts generalizing my work.  Hopefully this will help me remember the skills used and, more importantly, help someone else in turn.

The dashboard built for my presentation consisted of an Adobe Flex application as the front end, a Java backend and BlazeDS (Adobe’s as well) in the middle to connect these two.  I had developed a tech demo for work where this tool would provide real time operational insight into the progress of a long running task.  More specifically, this operational dashboard would relay progress, realtime, on an Extract, Transform and Load (ETL from now on) job we needed to monitor.

This ETL job is a combination of java tools and different databases where data is extracted, then curated and then processed to be reused in different ways.  It’s a thing of beauty in execution but a bit intimidating to someone like me who lacks a thorough understanding of what I do for a living 🙂

I thought it would be nice to build a tool to monitor on this process.  It would have to be non intrusive to my team or to ETL, and had to be completely independent of the tool it was keeping tabs on.  Thinking this would be a fun way to use Flex and BlazeDS (I miss working with Flex), I set out to built such a tool.  Unbeknownst to be, the task would prove to be bit more complicated than I thought it would be (ignorance keeps me bold).  In the end, some of the most daunting requirements where easily met with a completely free set of tools and the end result was not bad looking either.  it even works!

I used quite a few tools to complete a working example and will try to break posts down into main components that can be completed independently.  Feel free to stay tuned (or not) for the juicy parts. Hopefully, I can at least save you some time showing you what not to do.

Briefly, the dashboard was constructed with the following technologies, all freely available:
For the software parts, Abode Flex (ActionScript 3), Java and Hibernate.
For the server components, I used Tomcat and Adobe BlazeDS. Oh yeah, you are going to need a database server of your choice.  I will be using MySQL.
Any text editor would do as well; I am most comfortable in Eclipse but you may use anything you want.

Movielens – Date Dimension

Using the book The Data Warehouse Toolkit as a reference (thanks Peter), I’ve revised the date dimension.  I figure having a better base in our data will enable me and give me more options creating reports. I am only a few steps in this project from the diving into reporting.

Previously, our date dimension looked like this:

While not bad (I think it looks perfect), I admit book does a lot better job of preparing a date dimension with a lot more information.  Essentially, data warehousing initiatives tend to do a lot of the work ahead of time in order to save cycles on report creation.  Per the book, we could revise our dimension to be better represented as follows:

Although subtle and seemingly unnecessary, it is common practice to store all these (and lots more) derived values in our database in order to easy report construction.  Clearly, we could derive all of these recently added bits of information at report run time.  The point is saving this work since it does nothing but diminish report performance and complicate report creation.  Taking things a bit further, we could replace our date dimension with a dateStamp and be on our way but this would be a chore.  The same hold true, I admit, for other date information.  Harddrive space is cheap so I am not going to complain.

Movielens Dataset – One Million

After some thought, I’ve decided to switch data sources from the ten million movie rating set to the one million movie rating set.  As seen below, this dataset just has lots more interesting data which will provide with more dimensions to explore.

The exact same data scrubbing applies (same sql as well) as I had done in the other data set a few posts ago.  Also, all the secondary supporting data generated (time, date dimensions) will fit just as well.

Unfortunately, we loose the ability to dig into description tags applied to movies by movie reviewers.  On the bright side, We have information on gender, age range, location (zip code) and occupation.  Clearly, the looses are less than all this information gained and will make for a much stronger dataset to learn from.

As previously mention, all side work done for ten million rating set will be reused here including time, date dimensions, etc.  No loss creating these either.

Movielens OLAP Model – Dimension Tables

I ran into ‘time’ difficulties on populating fact tables with timeIds and dateIds.  Even a relatively simple query was taking too long to fetch correct values into table.  I guess doing 10 million row joins between two tables (10m_ratings and fact_rating) joined to dime_time/dime_date is a bit too much for my laptop.  Maybe I stared at it too much.

The only thing that I can think of is regenerating both fact tables and including the original timeStamp in both of them this time.  This will save a join and, hopefully, make populating these fact tables speedy enough not to drive me nuts.  I can remove these columns when they are not needed anymore.

Properly indexing fields used in the UPDATEs from prior posting will help tremendously as well.

This is not even the cool part of this project but more of a foundation so that I know how things work.  Commercial products do these things for you anyways I think.

This revised schema better illustrates this change.

Movielens – Completing fact tables

With the previous schema changes in place, it is now a matter of running queries in order to complete our fact_rating and fact_tag tables.  Both these tables are missing timeId and dateId.

For the table fact_rating, the following potentially long time queries need be executed. For timeId:

UPDATE fact_rating SET timeId = (
SELECT a.timeId FROM dim_time a
WHERE a.military = hour(from_unixtime(fact_rating.timestamp))
AND a.minute = minute(from_unixtime(fact_rating.timestamp))
)

For dateId:

UPDATE fact_rating SET dateId = (
SELECT a.dateId FROM dim_date a
WHERE a.year = year(from_unixtime(fact_rating.timestamp))
AND a.month = month(from_unixtime(fact_rating.timestamp))
AND a.day = day(from_unixtime(fact_rating.timestamp))
)

Similarly, for the table fact_tag, we run the following for timeId:

UPDATE fact_tag SET timeId = (
SELECT a.timeId FROM dim_time a
WHERE a.military = hour(from_unixtime(fact_tag.timestamp))
AND a.minute = minute(from_unixtime(fact_tag.timestamp))
)

For dateId, again, the snippet is similar to the fact_rating one above:

UPDATE fact_tag SET dateId = (
SELECT a.dateId FROM dim_date a
WHERE a.year = year(from_unixtime(fact_tag.timestamp))
AND a.month = month(from_unixtime(fact_tag.timestamp))
AND a.day = day(from_unixtime(fact_tag.timestamp))
)

It is worth adding a where clause to each of these and exclude items with id being updated equals null. (I.E. – for dateId,  append ‘WHERE dateId is null’ at the end of update query.  Some of these may (just may) take days so this is a no brainer although I didn’t do in all cases.

If anyone is thinking about how long these take, I recall the timeId population for table fact_rating took my Core 2 Duo 2.26, 4GB of ram laptop 18 hours and change.  Its an ‘execute and go to bed’ operation for sure for me.  Looking forward to having a completed star-schema to play with!

Movielens Datasets for BI – 10 Million Movie Ratings

The 10 million ratings set from Movielens allows us to create two fact tables (linked?!).  We can create a fact table for ratings and another one for tags.

Worth noting that a userIds between these two schemas (one from ratings.dat and the other from tags.dat) do match across sets.  I.E. – userId 1234 in tags dataset is user 1234 (if existing) in ratings dataset.  So we could link these but, for now, its simpler not to.

Information provided in the 10 million ratings set allows us to create similar star schemas as follows:

Ratings

Tags

I’ll be running these proposed schemas by my peers at work and post back any insight I am sure to be missing from these.  A lot of design here is based on my perception of what matters and its worth seeking advice before I trail off too much.

It is odd that we have no more information for the users in our set.  The smaller, 1 million set does provide move interesting user information but no tags from said users.  Perhaps it is worth considering both datasets at the same time and treating them as two different sets of study.

Revisiting the MovieLens Database

What a great opportunity to put the MovieLens data to good use (again).  Previously, I had used this dataset to build a movie recommendation engine based on the book Collective Intelligence.  I will be spending lots of time at work creating OLAP cubes for business reporting.

Immediately, I thought the Movielens dataset would be as good as it comes to practice concepts and practices I’ll be exposed to at work.  I am going to need all the help and practice since the world of Business Intelligence is new to me.

A little history about the dataset I am referring to can be found at the GroupLens Research.  The dataset I am planning on using can be found here. This time, I’ll be using the 10 million ratings set since it seems everything at work is measured in millions.  There are a few other variations of this data. Look around; you may find something you like.

After downloading the files, I proceeded to import all the data as is into our ‘original’ database.  I am using some late flavor of mySQL and was able to import the data without much fuss.  The only thing I had to do was replace the delimiters (double colons ::) in the movies data file for something else.  Movie titles with colons where being inadvertently ‘split’ across more than one column on import.  Doing this produces a database with three initial tables.

One table holds information on the users rating at least 20 movies, another one holds all the movie information and the last one holds the 10 million ratings for these movies looking like this:

10m_ratings(userID, movieID, rating, timestamp)
10m_tags(userID, movieID, tag, timestamp)
10m_movies(movieID, title, genres)

Table naming prefix aids infer purpose since I am restricting myself for one database for this project.  More so, sql styles and the like suit me fine 😉

In total, you end up with a database a bit less than 400 MBs.  Relevant information for data I will be using can be found here.  If there is any interest and licensing allows it, I’ll post a sql dump or db backup to share and same someone some time.