project: i’ve got 68 fairly structured html files (example). they contain a long, non-uniform header that needs to be junked, an <H1> title, a line with author(s), a date, some large chunks of text, and a footer that needs to be junked.
i want to write a quick program to loop through the files, parse them, and import them into a mysql table. this program will be one time use only, but having the knowledge to write this kind of program will be useful in the near future. i feel like perl is the language of champions here, but i haven’t used perl in ages. java? php maybe?
i feel like i should time myself to see how long it will take me to figure this out in perl versus just doing a whole lotta cutting and pasting (4 chunks of information x 68 documents = repetitive stress injury).
update: so far i’ve spent one hour and i’ve figured out enough perl to loop through every file in the directory, spitting out every line in each file.
final update: it took me another 3 hours to get a completely working solution. i had to figure out the perl DBI, which was actually the easiest part. mostly it just took time getting the regular expressions to do what i needed them to do. here’s the code. amazing, huh?