PCI – Chapter Four – Cleaning the Data

After retrieving all the movie files form Wikipedia, some cleanup was in order. I decided to remove, based on size, the files which did not point to actual movie in question. Simple file check cut down file size considerably, to a little more than 430 movie files. While this does not make for an amazing data representation of the movies in the database; it will be more than enough data for the sake of the exercises. Conveniently, this operations leaves us with only 21 mbs of textual data to work with, a third of original set retrieved.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s