programmers against screen scraping

i made a new sticker.

it’s a humorous reference to the several hours i spent last night debugging ravingmadness’s haloscan screen scraping code (to which i have previously referred people in my moving from blogger to wordpress post) in order to help an individual import her haloscan comments into wordpress. except it turns out ravingmadness’s code is somewhat out of sync with the current haloscan template in some minor ways. so it took a while.

however, it appears that haloscan has fixed their caif comment export feature so that it no longer strips any html elements from the comment body, which it used to. i’m thinking it might be time to revisit the code i started writing to read the caif file directly. would be nice to get import-haloscan.php included with the standard wordpress distribution.

4 Comments

Brian

Jun 14, 2005 5:48pm

So I read Raving Madness’s post and I don’t see the phrase “screen scraping” in it so where did you get it in your post. Typically the term is used to specify writing a program that reads its information from a virtual screen provided by some other application. For example, writing a program to access a mainframe database by reading the fields an old 24×80 screen view of some original database application.

Brian

Jun 14, 2005 5:50pm

Also for the uninformed, I being one of them, CAIF (Comment Archive and Interchange Format)

justin

Jun 14, 2005 7:21pm

wikipedia actually has an entry on screen scraping (who knew?) which acknowledges the origin of the term as you described but goes on to say:

“more recently, it often refers to parsing the HTML in generated web pages with programs designed to find particular patterns or parts of content.”

the term gained popularity with the advent of weblogs that did not have feeds. so people and some companies wrote screen scrapers that would attempt to extract the relevant content and reformat it as rss. ravingmadness’s import-haloscan code uses some php array functions to extract the comment code from html source. problem is, the html source has changed.

and if you hover over the first mention of caif, it’s enclosed in acronym tags. in firefox it should have a dotted underline.

justin

Jun 28, 2005 11:00am

on the fragility of screenscraping

4 Comments

Care to Comment?