After some thought, I’ve decided to switch data sources from the ten million movie rating set to the one million movie rating set. As seen below, this dataset just has lots more interesting data which will provide with more dimensions to explore.
The exact same data scrubbing applies (same sql as well) as I had done in the other data set a few posts ago. Also, all the secondary supporting data generated (time, date dimensions) will fit just as well.
Unfortunately, we loose the ability to dig into description tags applied to movies by movie reviewers. On the bright side, We have information on gender, age range, location (zip code) and occupation. Clearly, the looses are less than all this information gained and will make for a much stronger dataset to learn from.
As previously mention, all side work done for ten million rating set will be reused here including time, date dimensions, etc. No loss creating these either.