For complete information on this competition, please go to Maximize sales and minimize returns of bakery goods. In a nutshell, Group Bimbo, makers of cookies from our childhood, presents an optimization problem with a lot of data in the hopes of delivering the right amount of inventory to meet, but not over estimate, demand.
My interest in this competition comes from a random email from Kaggle and a fondness for cookies common in lunchboxes of our youth. Zero Kaggle experience and equal experience in the problem at hand makes for an interesting problem to look at.
In a nutshell (you can read main page for all the details) the task is to develop a model to accurately predict future product demand based on the historical data provided. Inventory is delivered weekly to stores along delivery routes. In contrast, unsold inventory, from the previous week, is returned. The ideal solution will produce the minimum difference between the current week’s delivered inventory and the following week’s returned inventory.
With this in mind, the question is: What is product demand per week per store?
The challenges are various, for starters, these are perishable goods and week old products are bound to be not at their best, if good by then. Secondly, inventory is delivered over 45 thousand routes to one million stores in Mexico. Other challenges involve an always-changing inventory product list and the nature of perishable good management and desirability. Lastly, the dataset is huge!
Data is provided in CSV files and include primary information like train and test data and sample submission format. Likewise, secondary data about products, agencies and clients is also provided. All data points are properly described and, aside from some client naming variances, seems to be very clean.
- cliente_tabla.csv, 20 MBs, 935,362 records. Client names. Can join to train and test files. Note, this file has duplicate clients (dup ids with difference variations of the client’s name). After cleaning up these dups. We end up with 930,500 records instead. That is almost 5 thousand entries trimmed from file. Neat.
- producto_tabla.csv, 100 KBs, 2,592 records. Product names. Can join this one to train and test as well.
- sample_submisison.csv, 67 MBs, 6,999,251 records. Per name, this is the expected submission format. Joins to test since this is the information to be completed on test.
- test.csv, 245 KBs, 6,999,251 records. This is the test set of data of information to predict.
- town_state.csv, 29 KBs, 790 records. The towns and the states of these. Can join to both test and train.
- train.csv, 3 GBs, 74,180,464 records. Training set of data.
Instead of jumping into using these files right away, lets put this data in an database for now. For preliminary exploration, and because this is the most punypunny laptop ever, this is worth the hassle. From here, exports to suit any need can be done without much trouble.
I also edited the data for clarity as can be seen in the following mapping table.
Simply importing the files per table shown above and declaring all the possible joins between these tables, gives us a database schema as shown below.
Examining this data sheds light on the nature of the data.
- 930,500 Clients. Of these clients, 9,663 show up in the test data set (the one to predict demand for) that do not exist in the train set. Interesting.
- 2,592 Distinct Products.
- 790 Agencies across 260 towns in 33 states in Mexico.
- Each of these agencies, also known as sales depots, contain several delivery routes.
- Each route serves multiple clients delivering and collecting returned products.
- 9 Sales Channels. I do not know what these are.
- 9 weeks of sales data broken into 7 weeks of sales data and 2 weeks of test data.
- 3,603 routes on train data, 2,608 routes on test data. Some of these are bound to be the same. Interesting follow-up observation.
- For the 7 weeks of train data, 1,799 different products were delivered across 552 agencies on 3,603 routes to 880,604 clients.
This looks like a serious logistical ordeal. Optimizing supply and minimizing returns would definitely have a positive effect on resources and finances. Neat.
Lots of material to examine and explore yet, for the next few posts, I should focus on the competition at hand. First time at Kaggle and overall intrigued by the whole thing.