Monday, November 21, 2005

Excursions in sed

I'm messing with IMDB's plain text data files (the plan is to do some triple-store experiments), and decided it's about time I brushed up my knowledge of shell commands. First up was a bit of sed hacking.
Sed is a "stream editor" which means nothing unless you dream about pipes, essentially it is a simple programming language which processes a text stream. I only understand one part of sed which is the substitution operator, s/{pattern}/{replacement}/{operators}, which I think is also available in many high-level languages.
The subsitution operator is fairly easy to understand, if the input matches the pattern, it is replaced with the replacement. The greatest magic of s/// is that the pattern can be grouped using (), and the part of the input taht matches each group can be referred to in the replacement using the special operators \1,\2,..\N. This sed tutorial has a much more complete overview.
The file I wanted to process is movies.list an alphabetical listing of all the movies in IMDB with their year of production. The list separates each record with a newline, and the fields are separated by tabs eg.

Adventure of the Wrong Santa Claus, The (1912) 1912

Note: The date is present twice in the data but for now it is assumed to be part of the title.
To parse this using sed and s///, I need a regexp that matches the whole line, and groups the relevant fields. I also use @ instead of / as a delimiter so I can use / as a normal character, giving the pattern:

@([^\t]*)([\t]*)([0-9]{4,4})$@

  • [^\t]* - means some number of characters which aren't tabs
  • [\t]* - means some number of tabs
  • [0-9]{4,4}$ - means exactly 4 digits at the end of the line.
To output xml, I simply use \1 to refer to the title and \3 the year, giving me a replacement text

@<film>\n\t<title>\1<\title>\n\t<year>\3</year>\n</film>@

Combining it all into a sed command gives

sed -r 's@([^\t]*)(\t*)([0-9]{4})$@<film>\n\t<title>\1\</title>\n\t<year>\3</year>\n</film>@g' movies.list

It turns out my system isn't handling utf-8 very well, so after consulting google, and this to the point document. I add iconv to the pipeline so I finally end up with.

iconv -f latin1 -t utf-8 movies.list | sed -r 's@([^\t]*)(\t*)([0-9]{4})$@<film>\n\t<title>\1</title>\n\t<year>\3</year>\n</film>@g' -

Thursday, November 17, 2005

TagClusters Take 2 - Minimum spanning sets

After my first experiment with clustering tags using a containment relation, I started thinking about pruning, that is removing tags from the hierarchy that were less useful. After thinking about possible utility functions for a while, I came up with a simple approach that requires reformulating the problem statement to: find the smallest set of tags that describes all of the tagged items.
When put this way the solution really just falls out of the problem, it seems a good greedy algorithm should be optimal (though I couldn't be bothered to prove it).

Algorithm
  1. Find all tags used and the frequency of their use.
  2. Add the most popular tag to the minimal set.
  3. Remove all the sites tagged with this tag, and repeat from 1, end when no sites are left.
Code & Results

Sadly the codebase I'm working with really needs to be taken aside and given a good refactoring. Hopefully I'll do it soon and add an update here. Otherwise you can take a look at the sketchy Demo Page and accompanying source code.

Although i haven't tried to formally prove it, intuitively and based on my experiments, it appears that calculating the minimum spanning set also builds a containment tree, so there is no need to combine the approaches.

For the next iteration of tag clusters I intend to extend the experiment to real folksonomic data (ie. an aggregate of all users tag sets).

Sunday, November 06, 2005

SPARQL and Web 2

Henry Story has just posted describing SPARQL as query language for Web 2.0. I think all his usage examples are good, but I think he's missed the point slightly. Henry suggests that Web 2.0 business will expose SPARQL endpoints over web services. This isn't going to happen for several reasons
  1. Economics: There is a lot of value stored in the databases Henry mentions and most companies will not want competitors/users to have unrestricted access to this data. Current web service APIs are designed so the expected value increase from user derived software, is likely to exceed the loss of the value in the data.
  2. Performance: Even if the data is completely open, and the economics doesn't come into play, performance is a major issue. SPARQL queries are designed to be written by Semantic Web engineers, much as SQL queries are designed to be written by database engineers. As an example, consider the following query

    PREFIX foaf: ...
    PREFIX dc: ...
    SELECT ?book
    WHERE { ?book dc:creator ?who
    ?who foaf:name "J. K. Rowling"
    }


    This query (if the WHERE clause is evaluated top to bottom) is highly inefficient, it first searches for all triples with property dc:creator, then filters those such that the dc:creator's foaf:name is "J. K. Rowling". A much more efficient query reverses the patterns in the WHERE clause. I believe automated query rewriting is beyond state of the art at the moment and will continue to be for the foreseeable future, especially when you consider the technical challenge of throwing inferencing into the mix, and the social challenge of open access (eg. consider the query "SELECT ?s ?p ?o WHERE { ?s ?p ?o}").
that's not to say I don't see SPARQL becoming an integral part of Web 2.0. I envisage that the next generation of back-end storage products, will be based on triples, inferencing, and rules. SPARQL will be the query language used to interface with the backend. At the web tier, services will continue to be built on RESTful principles, however more services will expose data as RDF, and publish schemas based on RDFS, OWL etc to enhance their meaning. At the client side aggregators, smushers, inferencers and provers will be fundamental building blocks, and high-level special purpose APIs written to interface with them (eg. BlogEd's RDF Javabeans classes). I think there is room for SPARQL again at this level, but it's likely to be too general and complex for your average application programmer.

Friday, November 04, 2005

Greasemonkey supported by google reader

I just noticed over on the google reader blog, a post about how the google-reader uses compression on its method names, and therefore breaks greasemonkey user scripts. What is really good to see is that they care about this, and provide a stable interface via the html user interface controls. It's nice to imagine a future where all sites are this friendly to the remix culture.

TagClusters - Creating a tag hierarchy

As part of my quest to eventually become a full internet citizen, I'm trying to take more of my software development public. As part of that I intend to a series on mining the vast open metadata reserves that can now be found on the internet. The first few parts of this series are on making sense from tagged data.

Suppose one has the following tagged items

Item A, Tags: code snippets, php, dom, xml
Item B, Tags: code_snippets, javascript, dom, xml
Item B, Tags: code_snippets, javascript
Item B, Tags: code_snippets, php

In order to browse the items, it would be useful to organise the tags in a hierarchy. The simplest (and most commonly used) method is to have a root node for each of the tags, and as children of each the root nodes, all the tags that are used in common with the root tag. For example, with the above tagged items, a partially expanded tree might look like.

+ code_snippets
+---+ php
+----+ dom
+----+ xml
+----javascript
+----+ dom
+----+ xml
+---- ...
+ javascript
+---+ code_snippets
+---+ ...
+ php
+ dom
+ xml

A better method would realise that all the items are tagged "code_snippets", and all the items are tagged either javascript or php, and that all items tagged dom are also tagged xml.
In more formal terms, a better hierarchy for tags uses containment as a partial ordering system.
This is the basis for TagClusters a javascript class for creating tag hierarchies. The above example rendered using TagClusters can be found here.
I've also tried it on a much more complex example - a snapshot of my del.icio.us bookmarks. While an improvement can be seen over the naive approach, it is still less than optimal.