Kaggle – Grupo Bimbo Preparing The Data

With the database from the last post in mind, we can now go over the information provided for this contest.  Most interesting to me, is the distribution of inventory delivered versus inventory returned. 


Above, we can see the number of units sold each week.  The green portion of the bar indicates the number of units consumed and the red portion indicates the number of units returned (unsold) from the previous week.


Here we can see the monetary amount for units sold per week, together with the monetary amount not sold from the units returned the from the previous week. 

Lets prepare the data that gets us here.

The Data and The Code

If you want to follow along, here is all you need.

From the competition’s main page, we are told that returned inventory is considered expired.  Whether this inventory is discarded or sold at a lesser price, is outside this problem’s domain.   Another important piece of information is that inventory delivery calculations, the amount of each product delivered, is estimated by delivery employees.  We would’ve expected the owner of the store to place the order instead.  Seems a bit odd that the expertise lies with the deliver sales employee instead.

A lot of information is provided for analysis, both categorical and quantitative.  Having briefly looked at this information in the previous post, I decided to try using R instead of a database now.  For the rest of this post, the tool of choice is the open source version of R-Studio .

So Much Data…

In order to save some time probing the data provided, I’ve been sampling the data instead of using the complete data set.  This is most easily done from the command line.

For each big file, test.csv and train.csv,  I’ve created sample files with 5% of their total records.  Unlike head or tail, shuf will shuffle the records returned.  I haven’t looked at how the command shuffles the returns but, if it works as intended, our analysis with this sample dataset should mirror the same for the big data provided.  We will come back to compare accuracies in due time.

$ head -1 test.csv > test_sample.csv && tail -n +2 test.csv | shuf -n 349963 >> test_sample.csv
$ head -1 train.csv > train_sample.csv && tail -n +2 train.csv | shuf -n 3709023 >> train_sample.csv

Whereas the test file is 295 MBs, the test_sample.csv version is 11.9 MBs.  More significantly, the train file is 3 GBs and the train_sample.csv one is 152 MBs.

This makes exploring the data a lot faster.  Lastly, each time we run these commands, we create a different sample.  Once we have a general idea of what to look deeper into, we can load the full size files for further analysis.

Sample files in hand, the first step is to load them into R-Studio.

# Setup session environment
# Loads train,test data
train <- read.csv("train_sample.csv", header = TRUE)
test <- read.csv("test_sample.csv", header = TRUE)
town_state <- read.csv("town_state.csv", header = TRUE)

Which results in 3 datasets in memory on our R session.  At this point, this is the equivalent of having imported these files as tables in a database.  The end result is as follows.

The test dataset with 349,963 objects with 7 variables.


The town_state dataset with 790 objects with 3 variables.


The train dataset 3,709,023 objects with 11 variables.


For reference, loading the original files takes me about 8 minutes and consumes amount of 5 GBs of RAM for the R session.  In contrast, loading the sample files takes about 5 seconds and only 387 MBs of memory.  Once we zero in on items of interest, we can always resample our source files and rerun analyses from the new samples files.  For our final analysis, however, we will validate our results against the original files.

Preparing Our Dataset

One of the first things we would like to do before analyzing our date is combine test and train data into one dataset.  In order to do this, we need to make both of these have the same variables.  We can do this by adding every variable that each set is missing that exists in the other set.

For example, the test set has the variable id which the train set lacks.  This is because the submission to Kaggle for predicting demand is in the form of id, prediction.  There is no need to have such variable in the train set.  Still, if we want to join these sets, we need to reach parity in the variables.

We can add this mock id variable to our train set in R like this.  Here we set the default value of 0 for all records in train for the id variable being added.

# train needs: ID.
train <-data.frame(id = rep(0, nrow(train)), train[,])

Similarly, the test set needs all the variables that exists in the train set.  Lets default these to 0 as well.

# test needs: venta_uni_hoy,venta_hoy,dev_uni_proxima,dev_proxima,demanda_uni_equil
test <-data.frame(Venta_uni_hoy = rep(0, nrow(test)), test[,])
test <-data.frame(Venta_hoy = rep(0, nrow(test)), test[,])
test <-data.frame(Dev_uni_proxima = rep(0, nrow(test)), test[,])
test <-data.frame(Dev_proxima = rep(0, nrow(test)), test[,])
test <-data.frame(Demanda_uni_equil = rep(0, nrow(test)), test[,])

Combining these sets now is painless.

# Smash baby
combined <- rbind(test,train)

Neat; this gives us a new dataset, combined, with 4,058,986 objects and 12 variables.  In the screenshot below, we can tell that the objects shown are from the test set since they have 0-values for all the variables we added.  If we were to look at objects originating in the train set, we would see 0-values for id instead.


The last file we have to look at is the town_state.csv file.  The town_state file is a lookup table between the delivery activity and the town and state of each activity via Agencia_id.  This information can prove very valuable in our analysis (or not).  Adding this information to our main dataset is easily done by ‘joining’ these two sets as follows.

# Join town_state and combined
combined <- merge(combined,town_state, by = "Agencia_ID")

Which leaves us with our combined dataset looking like this now.  14 variables in total.


Lets take a quick look at the structure of our dataset.




In here, we can see the objects in the set, together with their type and the first 10 values.  For our dataset, we see int, num and Factor.  Int and num are self-explanatory.  Factor is analogous to a Category (or dimension). 

Lets convert both Semana and Canal_ID to categories.

combined$Semana <- as.factor(combined$Semana)
combined$Canal_ID <- as.factor(combined$Canal_ID)


This will change the way R handles this information for analytical purposes, more like a category and less like a measurement.  Both Ruta_SAK and Cliente_ID can, depending on intent, be used as Factors but, with so many levels, I doubt we will get to use these for now.

From here, we can start looking at the data provided.  For example, we can see a weekly overview of units sold versus units returned and the amounts of money involved.

First lets copy our data frame, combined, to a new data frame to fiddle with. From this new data frame, lets remove all variables out of scope.

# New data frame
sampled_stacked_bar <- combined
# And trim as desired
sampled_stacked_bar$Agencia_ID <- NULL
sampled_stacked_bar$Demanda_uni_equil <- NULL
sampled_stacked_bar$Canal_ID <- NULL
sampled_stacked_bar$Ruta_SAK <- NULL
sampled_stacked_bar$Cliente_ID <- NULL
sampled_stacked_bar$Producto_ID <- NULL
sampled_stacked_bar$Town <- NULL
sampled_stacked_bar$State <- NULL
sampled_stacked_bar$id <- NULL

Lets take a look.


Now we can use this structure as the basis for aggregating all measurements over weeks.  Lets create a base structure aggregating all our measurements by week.

# Lets aggregate by Semana

agg_per_week <- ddply(sampled_stacked_bar, .(Semana), colwise(sum))

Our base structure looks like this.  Neat.


Looking At The Problem

From here, We can focus on weekly units as follows.

# Lets create a copy and remove variables we are not going to use

agg_units_per_week <- agg_per_week
agg_units_per_week$Dev_proxima <- NULL
agg_units_per_week$Venta_hoy <- NULL

And now we just have to manipulate our data structure so that we can reasonably chart it.  I choose to use stacked charts, hence the messy code that follows.

# Transposing our units table
trans_agg_units_per_week <- t(agg_units_per_week[,2:ncol(agg_units_per_week)])
# Set the column headings (we need these for our x-axis values)
colnames(trans_agg_units_per_week) <- c("3","4","5","6","7","8","9","10","11")
# Lets switch row order to display units sold before units returned
trans_agg_units_per_week <- trans_agg_units_per_week[nrow(trans_agg_units_per_week):1, ]
# And, lastly, plot Unit Amounts Overview plot
barplot(trans_agg_units_per_week, main= "Units Per Week",
        xlab = "Week", ylab = "Units",
        col=c("lightgreen", "brown1"), ylim=c(0,5000000),

If my calculations are correct, we ought to end with a chart like this.


We can do exactly the same for the Monetary amounts and end up with a similar chart.

# One for amounts
agg_amounts_per_week <- agg_per_week
agg_amounts_per_week$Dev_uni_proxima <- NULL
agg_amounts_per_week$Venta_uni_hoy <- NULL   # Transpose our amounts table trans_agg_amounts_per_week <- t(agg_amounts_per_week[,2:ncol(agg_amounts_per_week)]) # Set the column headings, we need these for our x-axis labels colnames(trans_agg_amounts_per_week) <- c("3","4","5","6","7","8","9","10","11") # Lets switch row order trans_agg_amounts_per_week <- trans_agg_amounts_per_week[nrow(trans_agg_amounts_per_week):1, ] # And plot Unit Amounts Overview plot barplot(trans_agg_amounts_per_week, main="$$$ Per Week", xlab = "Week", ylab = "$$$", col=c("lightgreen", "brown1"), ylim=c(0,40000000), legend=rownames(trans_agg_amounts_per_week))

Which gives us this.


We can now see the problem Bimbo has, they want to us to estimate those two weeks without values, 10 and 11, so that we minimize returns and money lost.


One thought on “Kaggle – Grupo Bimbo Preparing The Data

  1. Pingback: Kaggle – Grupo Bimbo First Prediction | Mario Talavera Writes

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s