Monday, November 21, 2005

Excursions in sed

I'm messing with IMDB's plain text data files (the plan is to do some triple-store experiments), and decided it's about time I brushed up my knowledge of shell commands. First up was a bit of sed hacking.
Sed is a "stream editor" which means nothing unless you dream about pipes, essentially it is a simple programming language which processes a text stream. I only understand one part of sed which is the substitution operator, s/{pattern}/{replacement}/{operators}, which I think is also available in many high-level languages.
The subsitution operator is fairly easy to understand, if the input matches the pattern, it is replaced with the replacement. The greatest magic of s/// is that the pattern can be grouped using (), and the part of the input taht matches each group can be referred to in the replacement using the special operators \1,\2,..\N. This sed tutorial has a much more complete overview.
The file I wanted to process is movies.list an alphabetical listing of all the movies in IMDB with their year of production. The list separates each record with a newline, and the fields are separated by tabs eg.

Adventure of the Wrong Santa Claus, The (1912) 1912

Note: The date is present twice in the data but for now it is assumed to be part of the title.
To parse this using sed and s///, I need a regexp that matches the whole line, and groups the relevant fields. I also use @ instead of / as a delimiter so I can use / as a normal character, giving the pattern:

@([^\t]*)([\t]*)([0-9]{4,4})$@

  • [^\t]* - means some number of characters which aren't tabs
  • [\t]* - means some number of tabs
  • [0-9]{4,4}$ - means exactly 4 digits at the end of the line.
To output xml, I simply use \1 to refer to the title and \3 the year, giving me a replacement text

@<film>\n\t<title>\1<\title>\n\t<year>\3</year>\n</film>@

Combining it all into a sed command gives

sed -r 's@([^\t]*)(\t*)([0-9]{4})$@<film>\n\t<title>\1\</title>\n\t<year>\3</year>\n</film>@g' movies.list

It turns out my system isn't handling utf-8 very well, so after consulting google, and this to the point document. I add iconv to the pipeline so I finally end up with.

iconv -f latin1 -t utf-8 movies.list | sed -r 's@([^\t]*)(\t*)([0-9]{4})$@<film>\n\t<title>\1</title>\n\t<year>\3</year>\n</film>@g' -

0 comments: