Mario Talavera Writes

- My Development Journal

Movielens OLAP Cube – Dimension Tables

leave a comment »

I ran into ‘time’ difficulties on populating fact tables with timeIds and dateIds.  Even a relatively simple query was taking too long to fetch correct values into table.  I guess doing 10 million row joins between two tables (10m_ratings and fact_rating) joined to dime_time/dime_date is a bit too much for my laptop.  Maybe I stared at it too much.

The only thing that I can think of is regenerating both fact tables and including the original timeStamp in both of them this time.  This will save a join and, hopefully, make populating these fact tables speedy enough not to drive me nuts.  I can remove these columns when they are not needed anymore.

Properly indexing fields used in the UPDATEs from prior posting will help tremendously as well.

This is not even the cool part of this project but more of a foundation so that I know how things work.  Commercial products do these things for you anyways I think.

This revised schema better illustrates this change.

Written by mariotalavera

February 1, 2010 at 3:22 pm

Movielens – Completing fact tables

leave a comment »

With the previous schema changes in place, it is now a matter of running queries in order to complete our fact_rating and fact_tag tables.  Both these tables are missing timeId and dateId.

For the table fact_rating, the following potentially long time queries need be executed. For timeId:

UPDATE fact_rating SET timeId = (
SELECT a.timeId FROM dim_time a
WHERE a.military = hour(from_unixtime(fact_rating.timestamp))
AND a.minute = minute(from_unixtime(fact_rating.timestamp))
)

For dateId:

UPDATE fact_rating SET dateId = (
SELECT a.dateId FROM dim_date a
WHERE a.year = year(from_unixtime(fact_rating.timestamp))
AND a.month = month(from_unixtime(fact_rating.timestamp))
AND a.day = day(from_unixtime(fact_rating.timestamp))
)

Similarly, for the table fact_tag, we run the following for timeId:

UPDATE fact_tag SET timeId = (
SELECT a.timeId FROM dim_time a
WHERE a.military = hour(from_unixtime(fact_tag.timestamp))
AND a.minute = minute(from_unixtime(fact_tag.timestamp))
)

For dateId, again, the snippet is similar to the fact_rating one above:

UPDATE fact_tag SET dateId = (
SELECT a.dateId FROM dim_date a
WHERE a.year = year(from_unixtime(fact_tag.timestamp))
AND a.month = month(from_unixtime(fact_tag.timestamp))
AND a.day = day(from_unixtime(fact_tag.timestamp))
)

It is worth adding a where clause to each of these and exclude items with id being updated equals null. (I.E. – for dateId,  append ‘WHERE dateId is null’ at the end of update query.  Some of these may (just may) take days so this is a no brainer although I didn’t do in all cases.

If anyone is thinking about how long these take, I recall the timeId population for table fact_rating took my Core 2 Duo 2.26, 4GB of ram laptop 18 hours and change.  Its an ‘execute and go to bed’ operation for sure for me.  Looking forward to having a completed star-schema to play with!

Written by mariotalavera

January 26, 2010 at 6:54 pm

Movielens Datasets for BI – 10 Million Movie Ratings

with 2 comments

The 10 million ratings set from Movielens allows us to create two fact tables (linked?!).  We can create a fact table for ratings and another one for tags.

Worth noting that a userIds between these two schemas (one from ratings.dat and the other from tags.dat) do match across sets.  I.E. – userId 1234 in tags dataset is user 1234 (if existing) in ratings dataset.  So we could link these but, for now, its simpler not to.

Information provided in the 10 million ratings set allows us to create similar star schemas as follows:

Ratings

Tags

I’ll be running these proposed schemas by my peers at work and post back any insight I am sure to be missing from these.  A lot of design here is based on my perception of what matters and its worth seeking advice before I trail off too much.

It is odd that we have no more information for the users in our set.  The smaller, 1 million set does provide move interesting user information but no tags from said users.  Perhaps it is worth considering both datasets at the same time and treating them as two different sets of study.

Written by mariotalavera

January 20, 2010 at 10:21 pm

Revisiting the MovieLens Database

leave a comment »

What a great opportunity to put the MovieLens data to good use (again).  Previously, I had used this dataset to build a movie recommendation engine based on the book Collective Intelligence.  I will be spending lots of time at work creating OLAP cubes for business reporting.

Immediately, I thought the Movielens dataset would be as good as it comes to practice concepts and practices I’ll be exposed to at work.  I am going to need all the help and practice since the world of Business Intelligence is new to me.

A little history about the dataset I am referring to can be found at the GroupLens Research.  The dataset I am planning on using can be found here. This time, I’ll be using the 10 million ratings set since it seems everything at work is measured in millions.  There are a few other variations of this data. Look around; you may find something you like.

After downloading the files, I proceeded to import all the data as is into our ‘original’ database.  I am using some late flavor of mySQL and was able to import the data without much fuss.  The only thing I had to do was replace the delimiters (double colons ::) in the movies data file for something else.  Movie titles with colons where being inadvertently ’split’ across more than one column on import.  Doing this produces a database with three initial tables.

One table holds information on the users rating at least 20 movies, another one holds all the movie information and the last one holds the 10 million ratings for these movies looking like this:

10m_ratings(userID, movieID, rating, timestamp)
10m_tags(userID, movieID, tag, timestamp)
10m_movies(movieID, title, genres)

Table naming prefix aids infer purpose since I am restricting myself for one database for this project.  More so, sql styles and the like suit me fine ;)

In total, you end up with a database a bit less than 400 MBs.  Relevant information for data I will be using can be found here.  If there is any interest and licensing allows it, I’ll post a sql dump or db backup to share and same someone some time.

Written by mariotalavera

January 16, 2010 at 4:50 am

Top Eleven Podcasts I’m Listening To

leave a comment »

I am stealing the post idea from a friend I think. Every time I update my podcast subscriptions, I think about posting new finds but never do… Here are my top ten podcasts (in no particular order) making my commute to new job less arduous.

1. FLOSS Weekly – From the TWIT network (lots of goodies there), an interview per podcast, a bit less than an hour in length, with interesting people (most of the time) from the open source community. TWIT allows you to grab any episode form their site so it may be best to pick and choose…  They have even been streaming for the past few months at live.twit.tv.  Many memorable episodes, like the processing one, are worth saving for a second listening if topic spikes your interest.

2. TJ’s Twillight Zone – Kick ass music podcast for hitting your zone and don’t care who’s close by ;)

4. Mixergy – An entrepreneurial, interview per podcast with people who’ve made it and their stories. Must listen I think because it gives perspective on both life and success in it from regular people (or not so regular) inspiring either way.

5. Dave Ramsey – Best podcast on money. Simple down to earth financial advice with zero fluff or hidden agenda. I ought to be pissed at Dave because he made me sell my little sporty car but at the same time, his advise allowed the best of times for my family even as most people are having trouble nowadays.
In 2009 I lost a fantastic job, consulting had nothing more than slim pickings and my family was still able to vacation in Alaska, buy another little sporty car and even entertain working for myself while my wife stayed at home enjoying our kids. Debt free and not a care in the world; thanks Dave!

6. Classical Podcast, Magnature – Music for thought?! As nice slow non boring classical music for white noise as I can stand.

7. Digital Planet – Nice short news on good geeky topics from the BBC. Proof that God has a sense of humor: one of the commentators, Gareth Mitchell, sounds a like Moss from the IT Crowd… Actually, that makes sense, awesome!

8. This American Life – Chicago’s Public Radio show about everyday stories from real people like us. Quality of topics is most consistent form most podcasts I listen to, even if the Christmas one was shit.

9. WNYC’ Radio Lab – A friend told me about this one when he heard This American Life playing on my laptop. Very similar, different angle and sometimes better. If you like one of these you will like the other.

10. tech5 – John C. Dvorak’s 5 minute take on most relevant and timely events of the day… Yes no more than a few minutes, the most succinct podcast on tech. I was both impressed by how nice keeping up with a short podcast was and by the fact that podcasts keep getting larger instead of shorter…

11. The Concert – Another classic music podcast. More variety of performances than the Magnature one, just as good with contrasting selection.

Hopefully you find something worth listening to :)

Written by mariotalavera

January 13, 2010 at 2:47 pm

Posted in Life, Miscellaneous

Jumping into the world of data

with 2 comments

A new year, a new job… Hope this is not a pattern.  Starting 2010 with a bang; I find myself as an analytical engineer at Inquira.  Inquira touts itself as the leader in intelligent knowledge solutions.  I believe this to be true but admit I had to look this up first.

In a nutshell, we provide the best content management system for big companies coupled with the best search engine for content retrieval.  Lastly, an analytical package gets bundled which enables the customer to improve the content managed.

OK, in a nutshell, we have a kick ass CRM with even better search (unmatched in the industry) and analytics to boot.  There.

As already stated, I am fortunate to have joined a small team of analytical engineers mainly concerned with business intelligence and all that that encompasses.  The key is, everyday I find out this BI covers more and more areas I hadn’t thought about.  This seems to be a very interesting opportunity and I hope to relay such sentiment in upcoming posts…

Written by mariotalavera

January 11, 2010 at 6:08 pm

Posted in Announcements, Career

Pick a book you like – A post about nothing…

leave a comment »

Pick a book you like.  Open a few tabs; search for it on Amazon.  Search for it on Barnes.  Compare prices for paper books, compare prices for digital versions…

For me, and the last five books I wanted to put on my list, prices match to the cent.  So much for price shopping these two :(

I could see each company’s bots working overtime on this and allowed by competitor; I could see competing companies sharing getaway vacation islands together as well…

Borders is only competitive with an online coupon.  But Barnes is almost as gracious with coupons as Borders is so, price shopping comes down to who’s coupon is in my inbox.

Coupon fight it is!

Written by mariotalavera

December 10, 2009 at 3:00 pm

Posted in Uncategorized

The Dip by Seth Godin

leave a comment »

A few months into this self employment gig, I read ‘The Dip’ by Seth Godin.  Ok, I read a few weeks ago; maybe I should have read it a few months ago.  The premise of the book is simple.  Its a short reflection on the fact that, like it or not, we can only excel at a few things (or one) and how this is indeed better than being a jack of all trades.

Godin postulates that, by attempting to be competent at many things, we end up being mediocre at all of them.  He describes briefly how we’re encouraged all through our school years to do well on everything we’re taught instead of being rewarded for excelling at one thing.  Later in life thou, we are most rewarded if we are the best at whatever we do.  While this may be true enough in the context of this book, a book from a fantastic marketing guy, most of my successful peers seem to excel at multiple things.  Godin spends some time explaining how not even second best is good enough in a connected society such as ours.  Instead, he encourages us to focus our time and energy doing those things where we can be better than anyone else for our particular situation.  This is not to say ‘be the best on one skill’ but more along the lines of be the best for the situation you’re in.  With this, I agree 100%.

These past few months, we’ve done coldfusion apps, rails apps, custom wordpress (php) sites and pondered or quoted flex, seo work and everything in between.  This is typical for an IT shop… think of that ‘generalist specialist’ angle to what we do.  I can’t say its been a waste of time, we have to pay bills and being self employed is lot of fun but, reading ‘The Dip’, I can’t help but wonder if choosing only one of these (application types) would’ve been better, worse or the same as trying to cast a wide net for profit.  The usual mode of operation here is to entertain most things if they guarantee fair compensation, prompt execution and payment.  Maybe I’m missing the point of book a bit since Godin talks about multiple careers paths, etc.  Guess this is nothing more than an interpretation but the fact is that the book is a nice quick and insightful read.

Godin, unlike me, makes a very eloquent argument in a short, concise and entertaining book worth the couple of hours reading time.  I could not do the book justice but reading it did have a profound effect on my take on work and definitely on how I spend my time.

Written by mariotalavera

November 16, 2009 at 8:29 pm

Posted in Book Review, Career

IT Self Employed in Orlando – The Good

leave a comment »

Choosing what we work on (for the most part) is one of the most joyful aspects of consulting and contracting.

So far, we’ve found out that there are plenty of opportunities for small companies like ours to make a living. There always seems to be a small shop or an entrepreneur that needs a killer application or website.  These are not always the most lucrative ones. But taking into account how interesting they are and how unpleasant it is to work on something that may get shelved at any moment, they are worthy of consideration.

Catering to these clients is a nice reminder that our skillset is not restricted only to doing ‘data-driven’ sites or applications.  Among the usual accounting applications, the real estate applications, we’re either putting quotes out or building sites for ‘inspecting stuff’ (details if we build it), political websites (upcoming), simplest of content management systems (and not so simple) and even online directories for niche markets that no one seems to have bothered to serve.

If you ever wonder, yes, there are plenty of applications you can profitably write that you feel passionately about but do not believe they are worth your time to do so.  Often, we forgo pursuing these endeavors assuming we can better spend the time doing something else.  We end up with our heads always forgetting these not so silly ideas we have only to shortly afterwards see someone else develop something similar and with much success and personal satisfaction.

Written by mariotalavera

October 27, 2009 at 3:01 pm

Posted in Career, Life

Atlassian Rocks, starter licensing for all their software!

leave a comment »

One of the more empowering tools working in IT is automation tools. Some of my time best spent in past few weeks has been devoted to deploying a WIKI, a SCM System (SVN), a Bug Tracking System (Mantis), etc.

Notice how all these are open source. At the same time, there are plenty of commercial alternatives to each of these. One Company whose products excel (I think) is Atlassian. My issue is that, while their offerings are fantastic; there is usually an open source alternative which, while not as feature rich, is more than feature complete to cater to my needs.

However, yesterday they started offering a starter licensing program where you can buy any of their products for ten dollars each. Looking at their details page, it is easy to see how much ass kicking one gets for $60.00.

Written by mariotalavera

October 7, 2009 at 6:41 pm