Programming Collective Intelligence is a wonderful book but, being a first edition, means it has quite a few errors which will throw off a Python newbie very easily.
If only to keep me sane; this page at OReilly’s Website is invaluable while working through this book. Specially, the unconfirmed page which seems lots more up to date.
I’ll be submitting errors as I go if they are not there already. So far, I’ve run into two issues not there and submitted. I hope they update frequently.
Hoping to find an error in my script, I going over my steps for chapter four of the book Programming Collective Intelligence. So far, my code looks good as far as I can tell (I’m no expert).
Also, I tested all dependencies for this chapter, mainly the installation of BeautifulSoup, an updates version of SQLite and pysqlite.
BeautifulSoup’s and SQLite’s installations completed successfully (again). PySQLite also installed fine, however, testing installation with an included script fails on my computer. I did not run this test the first time, need to be more thorough. It seems there is a defect in test script for Mac.
If you’re running pySqlite-2.4.2 on a Mac, you can find more information here. This is not the cause of my previous errors, however, since my script cannot open the files to index them.
Post of confusion. My python script does not read the files ‘into’ SQLite database.
I had to install SQLite3 (Mac came with a 2.x version) in order to install pysqlite-2.4.2. Maybe pysqlite is using older SQLite version on computer… I am following the book to the tee. Not trying any fancy Coldfusion nor SQL.
Everything seems to be installed properly. When ‘searchengine.py’ runs from terminal; it reports that it could open any of the files in my directory. I tried feeding it all sorts of files to no avail. Very frustrating…
On the other hand, I am having a blast playing with SQLite. By the way, SQLite is the coolest thing I’ve fiddled with today. This is very powerful and convenient.
FYI – SQLite is a ‘db in a file’. It’s very convenient for development and there is much fanfare from people using it on production systems as well. Seems to come embeded in everything from Mail.app (yes it is!) to cel phones and all sorts of electronice devices.
For lots of cool information on this, you can go to Leo Laporte’s FLOSS Weekly (lately it is 😉 episode 26 podcast. Also, (this is where I fist saw SQLite) Google video has this great overview of SQLite, thou a little dated.
Lastly – Searching on Google Video for ‘genre:educational _stuff_‘ usually returns the most informative videos on topics of interest.
Next on PCI TODO: index my movie files using Python!
After retrieving all the movie files form Wikipedia, some cleanup was in order. I decided to remove, based on size, the files which did not point to actual movie in question. Simple file check cut down file size considerably, to a little more than 430 movie files. While this does not make for an amazing data representation of the movies in the database; it will be more than enough data for the sake of the exercises. Conveniently, this operations leaves us with only 21 mbs of textual data to work with, a third of original set retrieved.
I’ve just completed the first step towards building the movie pages search engine. Having spent the better part of an hour scraping Wikipedia; I’ve been able to save to my laptop more than 67 MBs of movie pages.
DISCLAIMER – I am positive Wikipedia‘s servers do not break a sweat fulfilling requests like these, please bear in mind that abusing someone else’s servers is in bad taste.
I am going to have to scrub data clean of all HTML tags, etc. Ideally, I will only keep the ‘plot’ of each movie in each page. This will both keep data size down and provide more relevant search results.
Here’s code (Coldfusion) snippet for grabbing the pages form Wikipedia:
<cfset urlLink = "http://en.wikipedia.com/wiki/"><cfquery name="getMovies"
SELECT itemId, Title
WHERE 1 = 1
ORDER BY 1
<cfset movieStr = title />
<cfset movieStr =
replace(left(title, find(“(“,title)-2), ” “,”_”,”all”) />
<cfset movieStr =
trim(right(left(title, find(“(“,title)-2),len(left(title, find(“(“,title)-2))-find(“,”,title)-1)) & ” ”
<cfset movieUrl = variables.urlLink & movieStr />
<cfhttp url=”#movieUrl#” method=”get” resolveurl=”yes”>
<cffile action=”write” output=”#cfhttp.FileContent#” file=”#expandPath(“pages”)#/#title#.html”>
<font color=”red”>#itemId# – #movieStr#</font><br/>
No, my code does not look like that in my IDE. It is properly indented and spaced, etc. I still do not know how to use WordPress 😦 This leaves us with a directory with about 1600 html files, one per movie.
At this point; my main focus will be to decrease the size of my data to index as effectively as possible hoping to end up with a clean set of movies per page. This will be the source of my index for movie text (data) to search on eventually. Wow, sound like a lot of indexing even if I can scrub a lot of the useless parts of the pages.
Since I am starting with about 67 MBs of text, its in my best interest to clean up as much as possible. Lots of scrubbing and parsing ahead. Lets see how much of this textual data can be scrapped off.
I am currently revisiting chapter four of Programming Collective Intelligence, in which they build a full blown search engine. Many features of existing search engines are explored and tried. I think it would be neat to create a search engine for What Movie Now?
Doing this would require crawling some movie information repository for information in each movie in set. This should be easily obtainable from Wikipedia. This will be the basis for material to both search and retrieve as search results.
Afterwards, I would need to index all the documents retrieved and store this information in a database.
The last step (and this is the interesting one) would be to write a query that returns a ranked list of documents based on keywords supplied. I.E. – A movie information search engine.
This resultset would be returned to search request and, woohoo, a search engine is born. The methods for doing this is not much different from ranking user ratings…
One cool thing book points out is the amount of metrics that can be gathered from what the users search for, the returns they get and, most importantly, which documents do they click on. Very neat indeed.
I would be outlining the details of each step here. This is pretty much my too list; more later.
I’ve added a brief F.A.Q. on the site to provide some background information. Maybe this should have an ‘updated’ label somewhere stating last time I’ve added anything… This will hopefully be often enough for warrant label 😉