Oakland License Plate Readings Database

Picking this topic up from the last post, I focused on enriching the data released.  This will allow further exploration of this data.

Lets use our previous schema as our starting point.  The previous post produced a good starting point for the task at hand.  The records from the previous post were stored in a table as shown in Figure 1.

image1

Figure 1 – License plate table readings.

In the previous post, we replaced the actual car license plate (TAG) with a key because of privacy concerns.  This time, we are going to do the same operation for the location information in the table.  Reviewing our model from the last post, our database looks like Figure 2.

image

Figure 2 – Database Model with Tags, our first dimension.

The general idea is to fetch all the tags that show up in the readings table and insert them, no duplicates, into their own table.  To this table (the tag dimension table) we then add a key. Lastly, we take that key and replace all occurrences of said tag in the readings table with its key instead.

Similarly, lets extract a location dimension from the readings data.

Location Dimension

We see that the records provided have location information in the form of latitude and longitude.  For each of these values, we have 6 decimal places.  The more decimals, the more accurate the location provided is.  Here is a very nice overview on latitude and longitude applicable to our task.

From said post, we borrow the information in table 1.

image

Table 1 – Coordinate precision by decimal places and scale.

As we can see, 6 decimal places provides a tag reading with about 4 inches of precision.  This is overkill for this post.  Lets assume that a car usually parks in the vicinity of the driver’s destination.  With 4 decimal places, we have about 33 feet of accuracy.  Lets assume the width of a house or small office to be about 50 feet so 33 feet is still a bit too much.  Using 3 decimal places would give us a location accuracy of 328 feet.  This sounds more sensible to place a car on a map. Keep in mind also that the more accuracy we preserve, the more taxing any calculations are and, in our case, they would not necessarily provide a better output.

Even thou we will use 3 decimal places, we do not want not loose the resolution provided.  Because of this, we are going to keep the original values in the main table and use 3 decimal places in our dimension table.

Location Dimension Creation
  1. Create a new table, name it Dim_Location with the following initial columns: ID, Latitude, Longitude.
  2. Populate this new table with  every unique latitude/longitude combination from the tag readings table.
  3. To the tag readings table, add a new column for location key.
  4. Populate this new column from the location dimension’s ID that corresponds with said coordinates.

image

Figure 3 – Newly created location dimension

Just a few more steps.  Having latitude and longitude in our location dimension allows us to grab a physical address as well.  There are many ways, some free, to do this.  I used the excellent Texas A&M Geo Services.  They have both, free and paid tiers that would fit many enthusiast’s needs.  It takes a while to go thru almost 3 million records on their free tier but, after a few days, we end up with the following populated schema.

image

Figure 4 – Completed location dimension.

Our schema now provides us with a lot more information.  We can see that Oakland pretty much blankets the city scanning for expired tags, etc.  These records spans four years and cover pretty much all of Oakland.

image

Image 1 – Tag readings in Oakland

Breaking down the readings by year, we can see reading activity increase as time goes by.  Maybe it’s a matter of time before a city can scan all the streets in a day.

image

2010

image

2011

image

2012

image

2013

image

2014

Image 2 – Tag readings per year

Lets now pick a day, the most active day to be exact, to see what type of reach a city has. At 21,828 readings, September 9, 2012 comes in at the busiest license plate scanning date we have.  That is a lot of readings in one day.

image

Image 3 – Tags read on September 9, 2012.  Color darkens as as time advances.

Lastly, this is a 5 minute increment animation for this date.  It seems that the Oakland Police Department has at least one dedicated patrol car scanning plates on the city.  Most times, you can see two or three covering Oakland.

Image 4 – 24 hour license plate time span in 5 minute increments.

Imagine this happening in hundreds of police departments across the US and you have a legit business case in catching cars with expired cars.  Follow Uber by making these scanners cheap and put one in every car and there would be truly nowhere to hide.

City Of Oakland Plate Reader Data

Browsing Hacker News, I recently found out about the City of Oakland releasing almost 3 million records of license plate reader data.  The conversation there is way better than any blurb I could come up with.  However, this is a neat opportunity to mine this data as an academic exercise. 

From the source, they are hosting a list of CSV files with various bits of information.  Common to all files, and of critical importance is the date and time of the tag reading and the latitude and longitude of each reading.  Supplemental information as the site of the reading and source of such is often given as well.  Most worrisome is the fact that the data has not been cleansed and includes the actual license tag for each reading instead of some ID.  This would be the first thing to go after for data to be re-shared and used here.

Tools

The only software needed for this post is your database of choice (Oracle here) and a text editor (Notepad++).  Also, Tableau Public is used for visualizations.

Raw Data

As previously mentioned, the data contains, at a minimum: license plate, timestamp and location (latitude and longitude).  Secondary attributes such as source description, site name, etc. are often provided as well, depending of the file. 

The attribute combination provides about 2.8 million readings over 14 files with the following information:

  1. License plate – This is the actual license plate. 
  2. Timestamp – Date and time of the reading.
  3. Site name – Apparently, these are police department areas or sectors.  From a previous Orlando Crime Post, I learned that police departments divide areas by sectors here.
  4. Source description – This appears to be the police unit (car perhaps) that took said reading.
  5. Location – This is the combination of latitude and longitude.

image

Figure 1 – License plate reading exports from the City Of Oakland.

Action Items

Before this data can be useful, a few preparatory steps will ensure the data has lots more longevity than it provides as is.  After combining all these files into one table, one of the first things to do is replace all the license plates with keys.  This is not a dragnet operation and we are not really interested in who these people are but in what we can learn from the data itself.  Another must do item is to divide the location information into two separate columns, one for latitude and the other for longitude. 

  1. Combine files into one table
  2. Scrub license plate data and replace with key
  3. Separate, at least, time and date
  4. Derive both usable latitude and longitude from location

These preliminary steps make this data more usable and provides a solid base to build upon for analysis.

image

Figure 2 – Consolidated license plate readings.

Inspecting The Data

Lets poke around and see what stands out from set.

image

Table 1 – Readings by the numbers.

Readings By Hour

Figure 3 shows the tag reading breakdown by hour of day.  I was hoping the readers wouldn’t work at night or some other anomaly that would make figure more interesting.  Still, it is clear there are two periods of high activity and two periods of low activity.  Now we know when to drive in Oakland with an expired tag.  Lastly, according to this data, 11 PM is definitely the wrong time of day to drive with an expired tag. 

image

Figure 3 – Tag readings by hour.

Readings by Weekday

Figure 4 shows that , clearly, the weekends are the most likely days to get an ticket for an expired tag in Oakland.

image

Figure 4 – Readings by weekday.

Readings By Month

Looking at Figure 5, it is clear this is but a slice of activity of license plate readings.  Look how many months without data we have in set.

image

Figure 5 – Readings by month.

Readings By Site Name and Readings By Source Description

As previously mentioned, ‘Site Name’ seems to be the general location where the reading took place.  Source description seems to be the description of the unit that took the reading.  Unfortunately, we do not have enough readings classified into enough site names or site descriptions.  It is still interesting to see the values in set but I do not see what to gain from them at this point.

image

Data Disclaimer

These license plate readings records do not cover the entirety of either Oakland’s geography or all the car tags registered in Oakland.  We are taking its accuracy as a given and the timespan, thou inclusive of a reasonable sample, is but a slice of life and may not be an accurate representation of complete day/week, etc. cycle.  There is much that could be done to ‘validate’ this data set.

Having this in mind when referring to the date helps a lot.  Instead of focusing on ‘Oakland License Plates’, it is more suitable to refer to license plate readings over time.

Summary

I’ve only scratched the surface on this data.  This post provides a brief overview of the data collected.  Regardless of reason, releasing these types of datasets seems to be the new normal.  We already expect to be tracked while driving, at work, at the mall, etc.  How this data is used, shared and what is done with it remains fluid.  The only thing we can be certain of is that data will be collected more in the future and it will be used by different parties than those collecting the data originally. 

Lets dive into this data in an upcoming post and see how far a regular person can go exploring with it.

Happy Elasticsearch

So last night I found this on my lab machine…

elasticsearchsad

Silly node ran out of space.  Spawning up extra node promptly made this a non-issue.

elasticsearchhappy

Crisis averted. 

It is so nice to work with open source tools built to handle failure gracefully.  A few years ago the above scenario would have prompted a weekend-at-the-colo to the dismay of family and my sanity.  These are interesting times!

OBIEE Reporting From Transactional Databases

This blog post is about OBIEE reporting.  Specifically, it is about skipping the data warehouse and reporting from the transactional database instead.  Oracle’s OBIEE, like most BI reporting tools, is designed to use star/snowflake schemas as the underlying structures to report from.   Additionally, OBIEE’s metadata layer is very rich and extensively well thought out, allowing for a great deal of flexibility.  Oracle’s metadata tool (Admin Tool) allows us to leverage this flexibility and features to bridge the gap between an OLTP and an OLAP model.  I am not negating the need for a data warehouse, I am just wondering if all BI reporting projects merit one.

So the question to ask is: Is OBIEE up to the task?

Continue reading

Flex OLAP Cube – updated

I’ve added ability to consume URLs whose output is XML as a data-source for Flex OLAP Cube. Check it out :)

Going thru this exercise has helped me understand a bit about creating data cubes and the uses they serve. This is but one of the many interesting (to me) things I am exposed to at work. Doing this from scratch provides me with insight unattainable with a ‘shrink-wrap’ tool.

Note – I had originally fetched data from accross the web but had to store xml files internally for show and tell. Enjoy.

Movielens OLAP Cube Slicing And Dicing On Demand

I wanted to write the post ‘Slicing Your Own OLAP Cube’ but I am not there yet. From my last post, a recurring theme in my friends comments was that the dimensions and measures that can be inspected where set in stone. I thought I was doing fairly well but I can see their point. Having a cube and having it sliced in a way you don’t need is kinda useless.

Continue reading