programmers against screen scraping

i made a new sticker.
programmers against screen scraping

it’s a humorous reference to the several hours i spent last night debugging ravingmadness’s haloscan screen scraping code (to which i have previously referred people in my moving from blogger to wordpress post) in order to help an individual import her haloscan comments into wordpress. except it turns out ravingmadness’s code is somewhat out of sync with the current haloscan template in some minor ways. so it took a while.

however, it appears that haloscan has fixed their caif comment export feature so that it no longer strips any html elements from the comment body, which it used to. i’m thinking it might be time to revisit the code i started writing to read the caif file directly. would be nice to get import-haloscan.php included with the standard wordpress distribution.



So I read Raving Madness’s post and I don’t see the phrase “screen scraping” in it so where did you get it in your post. Typically the term is used to specify writing a program that reads its information from a virtual screen provided by some other application. For example, writing a program to access a mainframe database by reading the fields an old 24×80 screen view of some original database application.


Also for the uninformed, I being one of them, CAIF (Comment Archive and Interchange Format)

wikipedia actually has an entry on screen scraping (who knew?) which acknowledges the origin of the term as you described but goes on to say:

“more recently, it often refers to parsing the HTML in generated web pages with programs designed to find particular patterns or parts of content.”

the term gained popularity with the advent of weblogs that did not have feeds. so people and some companies wrote screen scrapers that would attempt to extract the relevant content and reformat it as rss. ravingmadness’s import-haloscan code uses some php array functions to extract the comment code from html source. problem is, the html source has changed.

and if you hover over the first mention of caif, it’s enclosed in acronym tags. in firefox it should have a dotted underline.


Email (optional)

Blog (optional)