5.04.2012

XKCD downloader or 'Ohai regular expressions'

An impromptu post finally made its way through the pipeline. I had been wanting to play around with regular expressions for a very long time but hadn't really gotten down to it out of 1. Laziness, and 2. It seemed a bit daunting, and there was no reason I was going to just memorise those itsy bitsy characters for the fun of it.

Finally, some incentive came along. I had been trying to go through some of the (extremely arcane) documentation on tldp.org whenever a dash of some linux-enthusiasm crept in on an unsuspecting me (mostly at around midnight right before a university exam), and they feature a very nice selection of download options for each article. It would've been nice if some other sites that featured a lot of textual content (spanned across many pages listed as links in one) came with that too.

Plus, trying to do something entirely unrelated to what you are supposed to be doing (or what you would be graduating in) is just plain fun.

Hence I started trying to create a script which would download all the web pages that a current page linked to, and those that the downloaded ones linked to and so on (recursively, for a specified number of levels) and along the way, test the waters of the very powerful stuff called regular expressions. Long story short, I did get the hang of it and ended up writing it in Python as opposed to a BASH script because I was very rusty on BASH syntax and python felt better.

Long story short - I did manage to make sense of a lot of regex usages, enough to feel somewhat confident of finding my way around the big bad world of characters galore. The downloader script did work like a charm for a lot of plain-html based websites where you could swear that a link (you could be sure it was in blue) was put there by the good old "a href=blah/blah.htm" tag, which you could conveniently exploit with a simple regular expression similar to this:
[hH][rR][eE][Ff]="(.*\..?htm.?).*"
For many others, it would stumble. The web isn't quite the old one these days, and in an array of bewildering Ruby on Rails, AJAX and stuff that made you wish for the hundredth time that you had played with these things earlier (and also, while we're at it, graduated in Compsci), some tinkering was needed and I was getting pretty much bored with poring through page-sources.

Around one o'clock is when another idea struck. Simple - Download XKCD comics. Now that was interesting and sure had incentives of its own. The good thing with XKCD is that the chap has been pretty much straightforward and consistent with his stuff, and viewing the source yielded all that I needed to know. The latest comic would be displayed when one headed to xkcd.org, and the number of the each comic was there in the permanent link to the comic (the latter being also part of each page). The path ahead was thus - open the latest page, extract the comic number. Any number running up to the latter would serve as a valid suffix to "http://xkcd.com/" without the server screaming a 404 in your face, and "so a little wget magic is all that's necessary" to view any page. A little more regex within the source of each page and we were set.

Download the script here
- Run the python script: it tells you how many comics xkcd has reached currently.
- Enter the number of the comic from where you want to start downloading, and where to stop.
- Twiddle your fingers and watch it download and save stuff into a sub-folder named xkcd_downloaded.
- Each comic is saved as an image and a text file with the same name.
- The txt file is the transcript (the text that pops up when you hover the mouse over the comic)

So there. The entire xkcd comic set right inside a quiet unassuming folder.

There's one annoying bug in the code: If the comic transcript contains characters which are escaped by HTML (such as ampersand, single and double quotes, etc), in spite of my having taken care of them in the code, some transcript files still show up stuff like ' instead of a single quote. Suggestions on how to take care of this are welcome. Solved, thanks to spotting a very silly mistake.

Now the sad part. After dusting off, a casual google search revealed that people had already done xkcd-downloader scripts. There are indeed quite a few of them online, but I am not sure if a lot of them do it as neatly as this :P

The good part was doing it for fun, learning a thing or two on the way.. and it happened :)


3 comments:

Akanksha Pandey said...

"The joy of trying to do something entirely unrelated to what you are supposed to be doing (or what you would be graduating in)".. Hah! This single-handedly explains why I would hardly ever get around to doing what you tried to do.

Though I'd have to say it was a real neat attempt.

Also, this makes me feel bad about not knowing Python. Sigh!

Vrijilesh Rai said...

Way too complicated!

Sriram said...

@akanksha: Haha you've lots of time before you graduate and sit idle and confused like I am doing right now, so it's fine :D

@vrij: Hey man, long time!