Mario Talavera Writes

- My Development Journal

Exploring Cohort Analysis – Part One

leave a comment »

Simply put, cohort analysis is a technique for analyzing activity over time by a common characteristic.  Mostly used in sales and marketing, cohort analysis can be used in tasks such as analyzing customer loyalty, customer cost acquisition, marketing campaign effectiveness and to explore many other aspects of sales.

Dataset

I am using the superstore sales created by Michael Martin found here or here.  This excel file contains three sheets of which only the first one, Orders, will be used in this analysis.

Case

The store providing their sales data does monthly advertising campaigns and wants to track what impact these advertising campaigns have on the amounts of orders placed over time.  They want to use this information to evaluate their different campaigns and improve their efforts.

Given the superstore sales data and the requirements, lets present the number of orders placed per customer join date.  Presenting the number of orders per join date will show the effectiveness of advertising campaigns leading to such date.

Tools

Any database server can be used to follow along.  The code used here can easily be revised to work on any vendor’s product like MySQL, etc.  For visualization purposes, Tableau can be easily changed replace by LibreOffice or similar.

Analysis

From the Orders Sheet, we are only going to use the following columns – Order Id, Order Date and Customer Name.  For further analysis, however, lets preserve all the data in our system by saving the contents of the Orders Sheet in a database table.

Lets appropriately call this table superstore_sales.  Also, lets execute our analysis for month/year to match the store’s advertising efforts.

1. First, lets fetch the orders, together with the order date and the customer name.

Image select
  a.order_id
   , to_char(a.order_date, ‘YYYY-MM’) order_date
   , a.customer_name customer
from
   superstore_sales a
group by
     a.order_id
   , a.order_date
   , a.customer_name
order by 2,3 
;

 

2. To this set, lets add the date each customer placed their first order in the store.

Image(1)

select
  a.order_id
  , to_char(a.order_date, ‘YYYY-MM’) order_date
  , a.customer_name customer
  , to_char(min(b.order_date), ‘YYYY-MM’) join_date
from
  superstore_sales a
inner join
  superstore_sales b on a.customer_name = b.customer_name
group by
  a.order_id
  , a.order_date
  , a.customer_name
order by 2,3
;

 

3. Interested in the effect of each month’s advertising campaign, lets create a campaign id for each month.  Basically, join date becomes campaign id.  Notice we dropped the customer name as well.  This is our join criteria only at this point.

Here we see, for each order placed, the monthly campaign to attribute the order to. Neat.

Image(2)

select
  a.order_id
  , to_char(a.order_date, ‘YYYY-MM’) order_date
  , to_char(min(b.order_date), ‘YYYY-MM’) campaign_id
from
  superstore_sales a
inner join
  superstore_sales b on a.customer_name = b.customer_name
group by
  a.order_id
  , a.order_date
  , a.customer_name
order by 2
;

 

4. Now, lets determine the number of months between the join date for each order’s customer first order and the order date for each order.

This is a mouthful; lets explain.  It is easier to see it if explained backwards. 

For each customer, calculate the number of months between their first order and every other order they have placed.

There, this makes a bit more sense. 

Now we have a set of all orders by date by campaign and the number of months since the customer joined store.

Image(3)

select
  a.order_id
  ,  to_char(a.order_date, ‘YYYY-MM’) order_date
  , round(months_between(a.order_date,min(b.order_date))) mon_diff
  , to_char(min(b.order_date), ‘YYYY-MM’) campaign_id
from
  superstore_sales a
inner join
  superstore_sales b on a.customer_name = b.customer_name
group by
  a.order_id
  , a.order_date
order by 2, 4
;

 

5. Next up, we simply count orders by campaign id by number of months since customer joined store.

Image(4)

select count(distinct order_id) order_count, mon_diff, campaign_id from (
select
  a.order_id
  , round(months_between(a.order_date,min(b.order_date))) mon_diff
  , to_char(min(b.order_date), ‘YYYY-MM’) campaign_id
from
  superstore_sales a
inner join
  superstore_sales b on a.customer_name = b.customer_name
group by
  a.order_id
  , a.order_date
order by 1
)
group by mon_diff, campaign_id
order by campaign_id
;

 

This is enough to satisfy our case; not bad.  Now we have the set we need to analyze sales by campaign id over time.  A short dive into Tableau (try Public) and we come out with an elegant way to view a lot of information at once.

Lets go over the visualization together.

The y axis represents the number of sales (order count), the x axis represents how many months ago (months diff) these sales occurred and the different colors each represent a campaign (campaign id).

Using the first month (ago) as an example, we can tell how many orders can be attributed to each campaign for the life of the data set.  Skipping to the last month, 48, we see how many orders were placed for the campaign that occurred back then.  Seeing only one color makes sense since that is our first advertising campaign. 

Image(5)

Clicking on the visualization above takes you to the interactive version where you can see the actual numbers of orders per campaign, etc.  Try it.

That’s it; thanks for reading.  This seems like a very effective way to analyze a large amount of sales data for insights into advertising performance. 

This post only covers join date as the grouping characteristic; many other cases can be made where cohort analysis can be helpful. Similarly, many other visualizations can provide similar insight.

Maybe next time we can cover some of these; thanks for reading!

Written by mariotalavera

October 21, 2014 at 4:53 am

2014 Miracle Miles 15k Race

leave a comment »

Another great, and wet, run!  Non stop rain did not stop the enthusiasm for the thousands of runners the met this year for the 5k and 15k Miracle Miles Event in Orlando.

Highlights for me:

  • First race-in-the-rain!
  • Second race without wearing headphones (in fanny pack as security blanket).  First one was Disney, by choice; more later.
  • Longest race without any walking.
  • (Sub :) ) 9 min/mi.
  • Lots of fun.

I even knew a bunch of runners at event; neat.  Lastly, I feel happy about performance.  First 15k, hence reference for distance, and time-to-beat for next 15k’s , hopeful, PR!

image

image

Building on my previous race analysis for the Oviedo 5K, I debated whether to use Tableau (rocks!) again or try Oracle’s Cloud Analytics.

Disclaimer – Happy Oracle Employee.

Even thou it has a 30 day trial, I think free developer accounts will be more suitable to accelerate adoption.  I’ve inquired and await response but they know best I guess.  Oracle does offer such a service for APEX; maybe its a matter of time…

Maybe next time; Tableau to the rescue it is.

Read the rest of this entry »

Written by mariotalavera

October 1, 2014 at 11:59 am

Posted in Analytics, Running, Tableau

Analyzing the 12th Annual Greater Oviedo 5K with Tableau

with one comment

I had been meaning to write about this semi recent 5K run in beautiful Oviedo at First Baptist Church.  With by Raul’s insistence, I decided to check this local favorite out.

Held this past May 24th, 2014 on a hot Saturday morning, the course was florida-flat (42 ft elevation gain) but had plenty of shade and had that small-town feeling Oviedo is known for.

With a few weeks to ‘train’ I set out to ramp up the miles in order to reduce my time.  No medals for me yet but I enjoyed the race sights, the crowd and the location.  Nothing like an early short run to set one up for a nice weekend.

For fun, I decided to play around with the results in Tableau and try out the OSX version of the software.  Feature-wise, both Windows and OSX versions are the same, with the Windows version being just a bit more stable with big sets of data.

Anyways,  back to the race.  As mentioned before, this is a local Oviedo race.  Most of the runners come from Oviedo itself or some close by neighborhood.

Image(37)

640 runners met for the race.  The group was split almost evenly between men and women with 47% and 53% respectively.

The age groups looked a little more interesting.  Notice the difference in gender at 40-44.  More generally, from 25-44, girls clearly represent this race.

Image(38)

Other fun bits, the younger runner was 6 years old and the oldest 84.  There were a bunch of mom/daughter and son/dad teams as well.  Good family times!

As for myself, I did ok.  I was able to place 7th in my age group and pr with a little bit to go.

Image(39)

As far as other 40 year olds, I tell myself things look mighty impressive.

Image

I am skipping the Garmin data of this post; I just wanted to play a bit with Tableau.  Maybe I’ll combine run data from Garmin next time.  Garmin run data can be seen here. Tableau workbook can be found here.  Race data can be found here.

Oh yeah, make sure to check out the Greater Oviedo 5K; nice all around race.  Thanks for reading.

Written by mariotalavera

September 16, 2014 at 6:21 am

Posted in Analytics, Running, Tableau

A Savannah Runaround

with one comment

As far as scenery goes, this is the most distracting run I’ve ever done. I found it impossible to focus on running.  Savannah is a beautiful city.  It is small, yet packed with life and activity from the touristy bits to family life and business.

For a better write-up about Savannah Squares than this post, read this; an eloquent intro to the squares.

Read the rest of this entry »

Written by mariotalavera

June 29, 2014 at 3:21 pm

Posted in Running

Installing Oracle DB 12c in Linux OEL6 (In One Sitting)

leave a comment »

Because I will have to do this again in no time, here I jot down the quickest way for me to install 12c.  This is the usual series of following steps, running into an error, searching for a fix, lather, rinse repeat.

Step One – Install oracle-rdbms-server-12cR1-preinstall.

Source – http://www.oracle.com/technetwork/articles/servers-storage-admin/ginnydbinstallonlinux-488779.html.

Step Two – …gets us all the way here – [FATAL] PRVF-0002 : Could not retrieve local nodename

Source -  http://www.krenger.ch/blog/fatal-prvf-0002-could-not-retrieve-local-nodename/

Step Three -  … cannot write oradata change to /home/oracle/oradata

Source – http://blog.ronnyegner-consulting.de/2009/09/14/oracle-11g-release-2-install-guide-%E2%80%93-grid-infrastructure-installation/

Step Four – Swapfile … I used 10000

Source – https://community.oracle.com/thread/2189824?tstart=0

image

Step Five – Then I see installer is bent on putting oradata directory in /home.  Just revise to suit your install.

Image(1)

Presto!

Image(2)

Written by mariotalavera

April 30, 2014 at 8:10 pm

Posted in Data Warehousing, SQL

Browsing H2 Database in Play 2 Application – News To Me

leave a comment »

This one got me for a loop.  It just so happens that both the h2 browser AND the play app should run from the same console.  Otherwise, you end up running app and its database as one instance and new instance of ‘a’ database in another instance.

Do as I say and not as I do!

Image

1. From shell, start play app. Do NOT type ‘play run’, just ‘play’.

2. Type ‘h2-browser’ next, The H2 Console should load up in the browser.

3. Now, type ‘run’ (or ‘~run’ for compile on save) to start your app.

Bingo! You should be able to now see your database.

Written by mariotalavera

March 5, 2014 at 8:38 pm

Posted in Life

Disney Marathon 2014 Report

with 4 comments

Written by mariotalavera

January 20, 2014 at 1:52 pm

Posted in Running

Follow

Get every new post delivered to your Inbox.

Join 58 other followers

%d bloggers like this: