Programming Collective Intelligence is a wonderful book but, being a first edition, means it has quite a few errors which will throw off a Python newbie very easily.
If only to keep me sane; this page at OReilly’s Website is invaluable while working through this book. Specially, the unconfirmed page which seems lots more up to date.
I’ll be submitting errors as I go if they are not there already. So far, I’ve run into two issues not there and submitted. I hope they update frequently.
Hoping to find an error in my script, I going over my steps for chapter four of the book Programming Collective Intelligence. So far, my code looks good as far as I can tell (I’m no expert).
Also, I tested all dependencies for this chapter, mainly the installation of BeautifulSoup, an updates version of SQLite and pysqlite.
BeautifulSoup’s and SQLite’s installations completed successfully (again). PySQLite also installed fine, however, testing installation with an included script fails on my computer. I did not run this test the first time, need to be more thorough. It seems there is a defect in test script for Mac.
If you’re running pySqlite-2.4.2 on a Mac, you can find more information here. This is not the cause of my previous errors, however, since my script cannot open the files to index them.
Post of confusion. My python script does not read the files ‘into’ SQLite database.
I had to install SQLite3 (Mac came with a 2.x version) in order to install pysqlite-2.4.2. Maybe pysqlite is using older SQLite version on computer… I am following the book to the tee. Not trying any fancy Coldfusion nor SQL.
Everything seems to be installed properly. When ‘searchengine.py’ runs from terminal; it reports that it could open any of the files in my directory. I tried feeding it all sorts of files to no avail. Very frustrating…
On the other hand, I am having a blast playing with SQLite. By the way, SQLite is the coolest thing I’ve fiddled with today. This is very powerful and convenient.
FYI – SQLite is a ‘db in a file’. It’s very convenient for development and there is much fanfare from people using it on production systems as well. Seems to come embeded in everything from Mail.app (yes it is!) to cel phones and all sorts of electronice devices.
For lots of cool information on this, you can go to Leo Laporte’s FLOSS Weekly (lately it is 😉 episode 26 podcast. Also, (this is where I fist saw SQLite) Google video has this great overview of SQLite, thou a little dated.
Lastly – Searching on Google Video for ‘genre:educational _stuff_‘ usually returns the most informative videos on topics of interest.
Next on PCI TODO: index my movie files using Python!
After retrieving all the movie files form Wikipedia, some cleanup was in order. I decided to remove, based on size, the files which did not point to actual movie in question. Simple file check cut down file size considerably, to a little more than 430 movie files. While this does not make for an amazing data representation of the movies in the database; it will be more than enough data for the sake of the exercises. Conveniently, this operations leaves us with only 21 mbs of textual data to work with, a third of original set retrieved.
I’ve just completed the first step towards building the movie pages search engine. Having spent the better part of an hour scraping Wikipedia; I’ve been able to save to my laptop more than 67 MBs of movie pages.
DISCLAIMER – I am positive Wikipedia‘s servers do not break a sweat fulfilling requests like these, please bear in mind that abusing someone else’s servers is in bad taste.
I am going to have to scrub data clean of all HTML tags, etc. Ideally, I will only keep the ‘plot’ of each movie in each page. This will both keep data size down and provide more relevant search results.
Here’s code (Coldfusion) snippet for grabbing the pages form Wikipedia:
<cfset urlLink = "http://en.wikipedia.com/wiki/"><cfquery name="getMovies"
SELECT itemId, Title
WHERE 1 = 1
ORDER BY 1
<cfset movieStr = title />
<cfset movieStr =
replace(left(title, find(“(“,title)-2), ” “,”_”,”all”) />
<cfset movieStr =
trim(right(left(title, find(“(“,title)-2),len(left(title, find(“(“,title)-2))-find(“,”,title)-1)) & ” ”
<cfset movieUrl = variables.urlLink & movieStr />
<cfhttp url=”#movieUrl#” method=”get” resolveurl=”yes”>
<cffile action=”write” output=”#cfhttp.FileContent#” file=”#expandPath(“pages”)#/#title#.html”>
<font color=”red”>#itemId# – #movieStr#</font><br/>
No, my code does not look like that in my IDE. It is properly indented and spaced, etc. I still do not know how to use WordPress 😦 This leaves us with a directory with about 1600 html files, one per movie.
At this point; my main focus will be to decrease the size of my data to index as effectively as possible hoping to end up with a clean set of movies per page. This will be the source of my index for movie text (data) to search on eventually. Wow, sound like a lot of indexing even if I can scrub a lot of the useless parts of the pages.
Since I am starting with about 67 MBs of text, its in my best interest to clean up as much as possible. Lots of scrubbing and parsing ahead. Lets see how much of this textual data can be scrapped off.
I am currently revisiting chapter four of Programming Collective Intelligence, in which they build a full blown search engine. Many features of existing search engines are explored and tried. I think it would be neat to create a search engine for What Movie Now?
Doing this would require crawling some movie information repository for information in each movie in set. This should be easily obtainable from Wikipedia. This will be the basis for material to both search and retrieve as search results.
Afterwards, I would need to index all the documents retrieved and store this information in a database.
The last step (and this is the interesting one) would be to write a query that returns a ranked list of documents based on keywords supplied. I.E. – A movie information search engine.
This resultset would be returned to search request and, woohoo, a search engine is born. The methods for doing this is not much different from ranking user ratings…
One cool thing book points out is the amount of metrics that can be gathered from what the users search for, the returns they get and, most importantly, which documents do they click on. Very neat indeed.
I would be outlining the details of each step here. This is pretty much my too list; more later.
I’ve added a brief F.A.Q. on the site to provide some background information. Maybe this should have an ‘updated’ label somewhere stating last time I’ve added anything… This will hopefully be often enough for warrant label 😉
Finally, I was able to secure a low cost hosting for trying skills form book out.
After tons of inconveniences, I’ve launched whatmovienow.com. This is a work in progress and I will try to add features based on collective intelligence book as best I can. The site mostly employs the ranking algorithms form book. It does not re-evaluates movies based on ranking from site visitors as that would take too many resources. Updates should be a lot quicker now that the hard work is done :).
Getting site off the ground has taken more time that I had intended.
First, I had to change from mySQL to MSSQL. I was going to use Dreamhost for hosting mySQL database but performance was very irregular and sluggish.
Second, I originally wrote website in Coldfusion, using the Model-Glue framework. This worked fine on my local computer, however, hosting provider had some restrictions which further delayed deployment. I ended up with two sligthly different versions of site one for dev locally and one for live :(. I intend to configure local computer to better reflect production server.
How apologetic… the only thing that matters is that site is out and I can resume writing blog.
To finalize my study of chapter two, I have decided to embark on creating a mini-scale site of movie recommendations based on both the movie information available at Movielens and the algorithms provided in Programming Collective Intelligence Book.
I have had a few complications gettign site hosted becuase of the intense pre-processing of data for movie ranking revisions as site is to accumulate ranking by visitors. So far, movie dataset takes close to three hours of processing before I can use statistical data to provide updated recommendations.
As an excersice, I will most likely have to cut back on some of the fetures and capabilities ourlined in the book but, most of the good stuff; I can represent in a website.
Lastly, I already have outlined Chapter Four – Searching and Ranking and it definately lends itself to another feature for this upcoming website. Very cool stuff.
Stay tuned as I plan site and logistics and there should be more info here within the week or so 😉
I am a third of the way into the problems on chapter three and already thinking that I should have used Python for chapter two. Absolutely wonderful and efficient, no fluff whatsoever. Live and learn…
Chapter three starts off by mentioning the techniques and practices involved are of the data-intensive (heavy computational I bet as well) type. Learning from the previous chapter, I am definitely not using SQL as my sole tool for the job…
This chapter focuses on data clustering, which as best I can describe, means finding how much alike items are in a data-set when there is not enough (known) information to make obvious comparisons or well defined associations.
Having just finished the problems on Word Vectoring, I am still eager to find some relevant use of this skill that helps me understand the problem domain. Basically, I am parsing blog feeds for word frequency among them to infer which blogs are alike. Besides this being an awesome extension to chapter two, I haven’t come up with a typical-style app or tool that would make a good case (exiting for me) for doing.
Then again, I am a third into chapter so my perspective may change in the next few days… Hope so.
For now, the most important thing to note is that, if you only have a hammer, everything looks like a nail.
Giving Python a chance!