Archive for the ‘Movielens’ Category
I’ve added ability to consume URLs whose output is XML as a data-source for Flex OLAP Cube. Check it out
Going thru this exercise has helped me understand a bit about creating data cubes and the uses they serve. This is but one of the many interesting (to me) things I am exposed to at work. Doing this from scratch provides me with insight unattainable with a ‘shrink-wrap’ tool.
Note – I had originally fetched data from accross the web but had to store xml files internally for show and tell. Enjoy.
I wanted to write the post ‘Slicing Your Own OLAP Cube’ but I am not there yet. From my last post, a recurring theme in my friends comments was that the dimensions and measures that can be inspected where set in stone. I thought I was doing fairly well but I can see their point. Having a cube and having it sliced in a way you don’t need is kinda useless.
For this post, I will describe how to use the previously provided database to create data cubes from the Movielens Dataset. With these cubes, I will then create a few reports using Adobe Flex to illustrate the advantages of using data cubes for reporting instead of the more traditional ‘query and report’ practices from live databases, etc.
1. I have broken down mysql dump file to a set of individual files per table. I got some complaints on unreasonable file size.
2. I’ve now included 10 million movie ratings as well which I hadn’t because of size as well. Now its a file to itself and you can skip if you find it difficult to import.
The set can be downloaded from Infochimps here. I will post any updates there as well.
Oops, missed a field in log create script on my previous post.
After running these two scripts (remember to rename to ‘.sql’), you should have the following tables in your database:
Next time, we’ll be setting up the Java part of the application…
This is part 2 of Real Time Dashboards. Reading the previous introductory post (it is short) will put this post in context.
As previously mentioned, our monitor checks up on an ETL process which is itself updating many databases as it executes.
For this post series, we are going to randomly grab an existing workflow (thank you Mazda) to monitor. We will be monitoring a ’100-point’ car inspection. Lets imagine a car inspector checking things off in his tricorder or some other digital device. This device is, of course, running a program that records everything to a simple database.
I know, one would actually wait in a lobby while technicians check a car out. Feel free to swap my workflow with your workflow; the point to take from this is that our monitor will work well regardless the process being monitored
Hoping to keep this part basic, I am constructing a table called Step with the following fields: id, label, parentId, workEffort. WorkEffort will be a numeric field, a hypothetical amount of work each respective step would take (could define time, hammers to the head, amount of cowbell, etc.). This basic structure will allow the definition of any number of main and sub processes in a tree hierarchy. This table can be used to describe any number of main steps, each with any number of sub steps. It is important to note that we would be assigning ‘workEffort’ values to the children of our main steps for as many levels as we may have. I.E. – only the children of these steps (the inner steps) will have workEffort… Aggregating these values would, in turn, yield the workEffort of its parent. Peachy.
We will have a second table, Log, where the status of each of these processes can be recorded. The log table will have the following fields: id, active, success, startTime, endTime. Both active and success will be booleans and will indicate which process is active and which process, if any, fails.
Simple but complete enough to represent many real life processes we could monitor. At work, our typical process consists of a series of main steps each comprised of many sub steps. The tables described here would even allow us to take this a bit further as long as we remain sensible in our expectations on performance and complexity.
I have included the scripts to create each of these tables step and log for you. Download and replace the doc extension for sql. WordPress, free and wonderful, wouldn’t allow me to upload .sql files. Bah!
For the next post I will be describing this ‘car inspection’ process and how we can both model and mimic realistically.
Based on my presentation at BarCamp Orlando 2010, I’ve decided to do a series of posts generalizing my work. Hopefully this will help me remember the skills used and, more importantly, help someone else in turn.
The dashboard built for my presentation consisted of an Adobe Flex application as the front end, a Java backend and BlazeDS (Adobe’s as well) in the middle to connect these two. I had developed a tech demo for work where this tool would provide real time operational insight into the progress of a long running task. More specifically, this operational dashboard would relay progress, realtime, on an Extract, Transform and Load (ETL from now on) job we needed to monitor.
This ETL job is a combination of java tools and different databases where data is extracted, then curated and then processed to be reused in different ways. It’s a thing of beauty in execution but a bit intimidating to someone like me who lacks a thorough understanding of what I do for a living
I thought it would be nice to build a tool to monitor on this process. It would have to be non intrusive to my team or to ETL, and had to be completely independent of the tool it was keeping tabs on. Thinking this would be a fun way to use Flex and BlazeDS (I miss working with Flex), I set out to built such a tool. Unbeknownst to be, the task would prove to be bit more complicated than I thought it would be (ignorance keeps me bold). In the end, some of the most daunting requirements where easily met with a completely free set of tools and the end result was not bad looking either. it even works!
I used quite a few tools to complete a working example and will try to break posts down into main components that can be completed independently. Feel free to stay tuned (or not) for the juicy parts. Hopefully, I can at least save you some time showing you what not to do.
Briefly, the dashboard was constructed with the following technologies, all freely available:
For the software parts, Abode Flex (ActionScript 3), Java and Hibernate.
For the server components, I used Tomcat and Adobe BlazeDS. Oh yeah, you are going to need a database server of your choice. I will be using MySQL.
Any text editor would do as well; I am most comfortable in Eclipse but you may use anything you want.
barcamporlando is here again. All the information is over at http://www.barcamporlando.org/.
I’ll be there with presentation in hand. I’ve been playing with BlazeDS and the tools at Inquira and would like to do a little show and tell. Fun stuff.
Using the book The Data Warehouse Toolkit as a reference (thanks Peter), I’ve revised the date dimension. I figure having a better base in our data will enable me and give me more options creating reports. I am only a few steps in this project from the diving into reporting.
Previously, our date dimension looked like this:
While not bad (I think it looks perfect), I admit book does a lot better job of preparing a date dimension with a lot more information. Essentially, data warehousing initiatives tend to do a lot of the work ahead of time in order to save cycles on report creation. Per the book, we could revise our dimension to be better represented as follows:
Although subtle and seemingly unnecessary, it is common practice to store all these (and lots more) derived values in our database in order to easy report construction. Clearly, we could derive all of these recently added bits of information at report run time. The point is saving this work since it does nothing but diminish report performance and complicate report creation. Taking things a bit further, we could replace our date dimension with a dateStamp and be on our way but this would be a chore. The same hold true, I admit, for other date information. Harddrive space is cheap so I am not going to complain.