I’ve just completed the first step towards building the movie pages search engine. Having spent the better part of an hour scraping Wikipedia; I’ve been able to save to my laptop more than 67 MBs of movie pages.
DISCLAIMER – I am positive Wikipedia‘s servers do not break a sweat fulfilling requests like these, please bear in mind that abusing someone else’s servers is in bad taste.
I am going to have to scrub data clean of all HTML tags, etc. Ideally, I will only keep the ‘plot’ of each movie in each page. This will both keep data size down and provide more relevant search results.
Here’s code (Coldfusion) snippet for grabbing the pages form Wikipedia:
<cfset urlLink = "http://en.wikipedia.com/wiki/"><cfquery name="getMovies"
datasource="movieLens100k-mysql">
SELECT itemId, Title
FROM Items
WHERE 1 = 1
ORDER BY 1
</cfquery>
<cfflush>
<cfoutput query=”getMovies”>
<cfset movieStr = title />
<cftry>
<cfif find(“(“,movieStr)>
<cfset movieStr =
replace(left(title, find(“(“,title)-2), ” “,”_”,”all”) />
</cfif>
<cfif find(“,”,movieStr)>
<cfset movieStr =
trim(right(left(title, find(“(“,title)-2),len(left(title, find(“(“,title)-2))-find(“,”,title)-1)) & ” ”
&left(left(title, find(“(“,title)-2),find(“,”,title)-1)/>
</cfif>
<cfset movieUrl = variables.urlLink & movieStr />
<cfhttp url=”#movieUrl#” method=”get” resolveurl=”yes”>
<cffile action=”write” output=”#cfhttp.FileContent#” file=”#expandPath(“pages”)#/#title#.html”>
<cfcatch>
<font color=”red”>#itemId# – #movieStr#</font><br/>
</cfcatch>
</cftry>
</cfoutput>
No, my code does not look like that in my IDE. It is properly indented and spaced, etc. I still do not know how to use WordPress 😦 This leaves us with a directory with about 1600 html files, one per movie.
At this point; my main focus will be to decrease the size of my data to index as effectively as possible hoping to end up with a clean set of movies per page. This will be the source of my index for movie text (data) to search on eventually. Wow, sound like a lot of indexing even if I can scrub a lot of the useless parts of the pages.
Since I am starting with about 67 MBs of text, its in my best interest to clean up as much as possible. Lots of scrubbing and parsing ahead. Lets see how much of this textual data can be scrapped off.