Browsing Hacker News, I recently found out about the City of Oakland releasing almost 3 million records of license plate reader data. The conversation there is way better than any blurb I could come up with. However, this is a neat opportunity to mine this data as an academic exercise.
From the source, they are hosting a list of CSV files with various bits of information. Common to all files, and of critical importance is the date and time of the tag reading and the latitude and longitude of each reading. Supplemental information as the site of the reading and source of such is often given as well. Most worrisome is the fact that the data has not been cleansed and includes the actual license tag for each reading instead of some ID. This would be the first thing to go after for data to be re-shared and used here.
As previously mentioned, the data contains, at a minimum: license plate, timestamp and location (latitude and longitude). Secondary attributes such as source description, site name, etc. are often provided as well, depending of the file.
The attribute combination provides about 2.8 million readings over 14 files with the following information:
- License plate – This is the actual license plate.
- Timestamp – Date and time of the reading.
- Site name – Apparently, these are police department areas or sectors. From a previous Orlando Crime Post, I learned that police departments divide areas by sectors here.
- Source description – This appears to be the police unit (car perhaps) that took said reading.
- Location – This is the combination of latitude and longitude.
Figure 1 – License plate reading exports from the City Of Oakland.
Before this data can be useful, a few preparatory steps will ensure the data has lots more longevity than it provides as is. After combining all these files into one table, one of the first things to do is replace all the license plates with keys. This is not a dragnet operation and we are not really interested in who these people are but in what we can learn from the data itself. Another must do item is to divide the location information into two separate columns, one for latitude and the other for longitude.
- Combine files into one table
- Scrub license plate data and replace with key
- Separate, at least, time and date
- Derive both usable latitude and longitude from location
These preliminary steps make this data more usable and provides a solid base to build upon for analysis.
Figure 2 – Consolidated license plate readings.
Inspecting The Data
Lets poke around and see what stands out from set.
Table 1 – Readings by the numbers.
Readings By Hour
Figure 3 shows the tag reading breakdown by hour of day. I was hoping the readers wouldn’t work at night or some other anomaly that would make figure more interesting. Still, it is clear there are two periods of high activity and two periods of low activity. Now we know when to drive in Oakland with an expired tag. Lastly, according to this data, 11 PM is definitely the wrong time of day to drive with an expired tag.
Figure 3 – Tag readings by hour.
Readings by Weekday
Figure 4 shows that , clearly, the weekends are the most likely days to get an ticket for an expired tag in Oakland.
Figure 4 – Readings by weekday.
Readings By Month
Looking at Figure 5, it is clear this is but a slice of activity of license plate readings. Look how many months without data we have in set.
Figure 5 – Readings by month.
Readings By Site Name and Readings By Source Description
As previously mentioned, ‘Site Name’ seems to be the general location where the reading took place. Source description seems to be the description of the unit that took the reading. Unfortunately, we do not have enough readings classified into enough site names or site descriptions. It is still interesting to see the values in set but I do not see what to gain from them at this point.
These license plate readings records do not cover the entirety of either Oakland’s geography or all the car tags registered in Oakland. We are taking its accuracy as a given and the timespan, thou inclusive of a reasonable sample, is but a slice of life and may not be an accurate representation of complete day/week, etc. cycle. There is much that could be done to ‘validate’ this data set.
Having this in mind when referring to the date helps a lot. Instead of focusing on ‘Oakland License Plates’, it is more suitable to refer to license plate readings over time.
I’ve only scratched the surface on this data. This post provides a brief overview of the data collected. Regardless of reason, releasing these types of datasets seems to be the new normal. We already expect to be tracked while driving, at work, at the mall, etc. How this data is used, shared and what is done with it remains fluid. The only thing we can be certain of is that data will be collected more in the future and it will be used by different parties than those collecting the data originally.
Lets dive into this data in an upcoming post and see how far a regular person can go exploring with it.