I scored top 65% ranking on the private leaderboard, which counts as the official score for this contest. Of 1,969 teams, my team (myself) ranked 1,261 using RMSLE as the measurement of accuracy with 0.56330.
For the public leaderboard however, I got 18% ranking instead with an RMSLE of 0.45970.
This second score is a lot better because I was able to submit my model for scoring, up to three times a day, and refine my model accordingly. At some point, I was was ranked top 11% but this was short lived, lasting no more than a week or so before some statistic giants woke up and ate my lunch.
Having defined the problem in the previous post, I’ve decided to attempt to make a first prediction to address it. Per the submission requirements, this requires us to use the complete dataset to supply a csv file with both the id of the ‘delivery’ and the predicted adjusted demand of it. Continue reading
With the database from the last post in mind, we can now go over the information provided for this contest. Most interesting to me, is the distribution of inventory delivered versus inventory returned.
Above, we can see the number of units sold each week. The green portion of the bar indicates the number of units consumed and the red portion indicates the number of units returned (unsold) from the previous week.
Here we can see the monetary amount for units sold per week, together with the monetary amount not sold from the units returned the from the previous week.
Lets prepare the data that gets us here.
For complete information on this competition, please go to Maximize sales and minimize returns of bakery goods. In a nutshell, Group Bimbo, makers of cookies from our childhood, presents an optimization problem with a lot of data in the hopes of delivering the right amount of inventory to meet, but not over estimate, demand.
My interest in this competition comes from a random email from Kaggle and a fondness for cookies common in lunchboxes of our youth. Zero Kaggle experience and equal experience in the problem at hand makes for an interesting problem to look at.
Picking this topic up from the last post, I focused on enriching the data released. This will allow further exploration of this data.
Lets use our previous schema as our starting point. The previous post produced a good starting point for the task at hand. The records from the previous post were stored in a table as shown in Figure 1.
Figure 1 – License plate table readings.
Browsing Hacker News, I recently found out about the City of Oakland releasing almost 3 million records of license plate reader data. The conversation there is way better than any blurb I could come up with. However, this is a neat opportunity to mine this data as an academic exercise.
From the source, they are hosting a list of CSV files with various bits of information. Common to all files, and of critical importance is the date and time of the tag reading and the latitude and longitude of each reading. Supplemental information as the site of the reading and source of such is often given as well. Most worrisome is the fact that the data has not been cleansed and includes the actual license tag for each reading instead of some ID. This would be the first thing to go after for data to be re-shared and used here. Continue reading
In search for public (and fast and low cost) geocoding services, I’ve run into Texas A&M GeoServices.
I have only tested their reverse geocoding service and it was all of the three above. It took no more than a cup of coffee to provide addresses given latitude and longitude and information added looks very promising.
They even have a partnering program that could minimize the expense of using said service. Neat! Expect upcoming posts to attribute all geo data to them as so:
Geo-stuff provided by Texas A&M University GeoServices