Kaggle – Grupo Bimbo First Prediction

Having defined the problem in the previous post, I’ve decided to attempt to make a first prediction to address it.  Per the submission requirements, this requires us to use the complete dataset to supply a csv file with both the id of the ‘delivery’ and the predicted adjusted demand of it. 

All the code for this post, as well as any others in this series, is over at my GitHub account.

As before, the starting point of all these exercises consist of importing our data into an R Session.

# Setting working dir env
setwd("~/workspace/Bimbo/csv")

# Loading the train and test data from csv
train <- read.csv("train.csv", header = TRUE)
test <- read.csv("test.csv", header = TRUE)

Note the shift from a sample, smaller set to the actual files provided by Kaggle.  The reason for this is the need to predict deliveries for all the items on the test set (6,999,251 predictions).  Our previous sample test set only had about 5% of this.

Just like before, the next step is to create a combined set including all the records from each file imported.

# train needs: ID.
train <-data.frame(id = rep(0, nrow(train)), train[,])

# test needs: venta_uni_hoy, venta_hoy, dev_uni_proxima, dev_proxima, demanda_uni_equil
test <-data.frame(Venta_uni_hoy = rep(0, nrow(test)), test[,])
test <-data.frame(Venta_hoy = rep(0, nrow(test)), test[,])
test <-data.frame(Dev_uni_proxima = rep(0, nrow(test)), test[,])
test <-data.frame(Dev_proxima = rep(0, nrow(test)), test[,])
test <-data.frame(Demanda_uni_equil = rep(0, nrow(test)), test[,])

combined <- rbind(test,train)

In this case, the combined data frame has 81,179,715 records versus the previous post’s 4,058,986.  Crazy amount of records to fiddle with.  Soon enough, careful planning for both performance and computers resources will play an increasingly important role in this series.

Also, I’ve decided to join the town_state information again.  For this post, I am going to leverage this information to predict our deliveries. Hopefully, it will be worth the effort.

# New variable to order by
combined$order  <- 1:nrow(combined)   town_state <- read.csv("town_state.csv", header = TRUE, fileEncoding="UTF-8-BOM") # Lets change 'Queretaro de Arteaga' to 'QUERETARO' town_state$State <- as.character(town_state$State) town_state$State[town_state$State == "Queretaro de Arteaga"] <- "QUERETARO" # Join town_state to combined combined <- merge(combined,town_state, by = "Agencia_ID") combined <- combined[order(combined$order), ]   combined$Semana <- as.factor(combined$Semana)
combined$Canal_ID <- as.factor(combined$Canal_ID)
combined$State <- as.factor(combined$State)

A few things are going on in this snippet; lets go over them. 

After importing the file, I noticed that there were a small enough number of states, 33, to use as a factor (category).  The only issue was that the R implementation of the method I wanted to try, Random Forest, uses up to 32 levels for a factor.  Googlefu leads me to believe that ‘Queretaro de Arteaga’ is the same place as ‘Queretaro’. 

image

Since the first only had 2 records, I just changed these to the latter one, effectively ending up with 32 levels for our factor-to-be of State.  Neat.

Lastly, I added and used an order variable on combined because merges change the order of records and I need to preserve these in order to refer to sets (train and test) by position.

Lets do a quick  check to see the results of this work so far.

image

At this point, I have 4 Factors available.  Of these, Semana, Canal_ID and State are 32 levels or less.  These are the factors to be used to make my first prediction. 

Actually, for this post, I attempt to make two predictions.  The first prediction will make use of only Semana and Canal_ID.  For the second prediction, I will add State to the mix.  This will allow me to compare these two and see what effect, if any, adding states has on the outcome.

Here is the setup for each and the running of the first one.

# Loading the Random Forest library
library(randomForest)

# Defining each prediction
var_list.01 <- c("Semana", "Canal_ID") # 
var_list.02 <- c("Semana", "Canal_ID", "State") # 

# Parameters for operation
active_var_list <- var_list.01
sample_size <- 500000 # 5000
seed <- 1234
num_trees <- 1000
rf.data <- combined[1:sample_size,]

# The desired prediction is the adjusted demand
rf.predict <- as.factor(combined$Demanda_uni_equil[6999252:(6999252+sample_size-1)])

# This is the train data
rf.train <- combined[1:sample_size, active_var_list]

# Setting a seed ensures the results are repeatable
set.seed(seed)

# Random forest call
combined.rf <- randomForest(x = rf.train, y = rf.predict, data = rf.data, importance=TRUE, ntree=num_trees)

# Check please
combined.rf

Lets look at the output of combined.rf

image

Based on our setup parameters, this reads as 17.45% success in predicting inventory demand.  I guess with little work comes little improvement. Still, there proves there is much work and potential ahead.  The desire is to bring this number down with a proper model and set of features.

Using these two variables, I did my first submission to Kaggle.

# Creating a Kaggle submission
predict_demand <- predict(combined.rf, test)
submit <- data.frame(id = test$id, demanda_uni_equil = predict_demand)
write.csv(submit, file = "kaggleone.csv")

After much processing by Kaggle, this was good enough for 1456th place out of 1559 teams.  as expected, this is bottom of the pack indeed.  Hey, its a start.

image

Going back and adding State may improve my standings but good progress has been made. Until next time.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s