Having defined the problem in the previous post, I’ve decided to attempt to make a first prediction to address it. Per the submission requirements, this requires us to use the complete dataset to supply a csv file with both the id of the ‘delivery’ and the predicted adjusted demand of it.
All the code for this post, as well as any others in this series, is over at my GitHub account.
As before, the starting point of all these exercises consist of importing our data into an R Session.
# Setting working dir env setwd("~/workspace/Bimbo/csv") # Loading the train and test data from csv train <- read.csv("train.csv", header = TRUE) test <- read.csv("test.csv", header = TRUE)
Note the shift from a sample, smaller set to the actual files provided by Kaggle. The reason for this is the need to predict deliveries for all the items on the test set (6,999,251 predictions). Our previous sample test set only had about 5% of this.
Just like before, the next step is to create a combined set including all the records from each file imported.
# train needs: ID. train <-data.frame(id = rep(0, nrow(train)), train[,]) # test needs: venta_uni_hoy, venta_hoy, dev_uni_proxima, dev_proxima, demanda_uni_equil test <-data.frame(Venta_uni_hoy = rep(0, nrow(test)), test[,]) test <-data.frame(Venta_hoy = rep(0, nrow(test)), test[,]) test <-data.frame(Dev_uni_proxima = rep(0, nrow(test)), test[,]) test <-data.frame(Dev_proxima = rep(0, nrow(test)), test[,]) test <-data.frame(Demanda_uni_equil = rep(0, nrow(test)), test[,]) combined <- rbind(test,train)
In this case, the combined data frame has 81,179,715 records versus the previous post’s 4,058,986. Crazy amount of records to fiddle with. Soon enough, careful planning for both performance and computers resources will play an increasingly important role in this series.
Also, I’ve decided to join the town_state information again. For this post, I am going to leverage this information to predict our deliveries. Hopefully, it will be worth the effort.
# New variable to order by
combined$order <- 1:nrow(combined) town_state <- read.csv("town_state.csv", header = TRUE, fileEncoding="UTF-8-BOM") # Lets change 'Queretaro de Arteaga' to 'QUERETARO' town_state$State <- as.character(town_state$State) town_state$State[town_state$State == "Queretaro de Arteaga"] <- "QUERETARO" # Join town_state to combined combined <- merge(combined,town_state, by = "Agencia_ID") combined <- combined[order(combined$order), ] combined$Semana <- as.factor(combined$Semana)
combined$Canal_ID <- as.factor(combined$Canal_ID)
combined$State <- as.factor(combined$State)
A few things are going on in this snippet; lets go over them.
After importing the file, I noticed that there were a small enough number of states, 33, to use as a factor (category). The only issue was that the R implementation of the method I wanted to try, Random Forest, uses up to 32 levels for a factor. Googlefu leads me to believe that ‘Queretaro de Arteaga’ is the same place as ‘Queretaro’.
Since the first only had 2 records, I just changed these to the latter one, effectively ending up with 32 levels for our factor-to-be of State. Neat.
Lastly, I added and used an order variable on combined because merges change the order of records and I need to preserve these in order to refer to sets (train and test) by position.
Lets do a quick check to see the results of this work so far.
At this point, I have 4 Factors available. Of these, Semana, Canal_ID and State are 32 levels or less. These are the factors to be used to make my first prediction.
Actually, for this post, I attempt to make two predictions. The first prediction will make use of only Semana and Canal_ID. For the second prediction, I will add State to the mix. This will allow me to compare these two and see what effect, if any, adding states has on the outcome.
Here is the setup for each and the running of the first one.
# Loading the Random Forest library library(randomForest) # Defining each prediction var_list.01 <- c("Semana", "Canal_ID") # var_list.02 <- c("Semana", "Canal_ID", "State") # # Parameters for operation active_var_list <- var_list.01 sample_size <- 500000 # 5000 seed <- 1234 num_trees <- 1000 rf.data <- combined[1:sample_size,] # The desired prediction is the adjusted demand rf.predict <- as.factor(combined$Demanda_uni_equil[6999252:(6999252+sample_size-1)]) # This is the train data rf.train <- combined[1:sample_size, active_var_list] # Setting a seed ensures the results are repeatable set.seed(seed) # Random forest call combined.rf <- randomForest(x = rf.train, y = rf.predict, data = rf.data, importance=TRUE, ntree=num_trees) # Check please combined.rf
Lets look at the output of combined.rf
Based on our setup parameters, this reads as 17.45% success in predicting inventory demand. I guess with little work comes little improvement. Still, there proves there is much work and potential ahead. The desire is to bring this number down with a proper model and set of features.
Using these two variables, I did my first submission to Kaggle.
# Creating a Kaggle submission predict_demand <- predict(combined.rf, test) submit <- data.frame(id = test$id, demanda_uni_equil = predict_demand) write.csv(submit, file = "kaggleone.csv")
After much processing by Kaggle, this was good enough for 1456th place out of 1559 teams. as expected, this is bottom of the pack indeed. Hey, its a start.
Going back and adding State may improve my standings but good progress has been made. Until next time.