Thursday, November 17, 2005

TagClusters Take 2 - Minimum spanning sets

After my first experiment with clustering tags using a containment relation, I started thinking about pruning, that is removing tags from the hierarchy that were less useful. After thinking about possible utility functions for a while, I came up with a simple approach that requires reformulating the problem statement to: find the smallest set of tags that describes all of the tagged items.
When put this way the solution really just falls out of the problem, it seems a good greedy algorithm should be optimal (though I couldn't be bothered to prove it).

Algorithm
  1. Find all tags used and the frequency of their use.
  2. Add the most popular tag to the minimal set.
  3. Remove all the sites tagged with this tag, and repeat from 1, end when no sites are left.
Code & Results

Sadly the codebase I'm working with really needs to be taken aside and given a good refactoring. Hopefully I'll do it soon and add an update here. Otherwise you can take a look at the sketchy Demo Page and accompanying source code.

Although i haven't tried to formally prove it, intuitively and based on my experiments, it appears that calculating the minimum spanning set also builds a containment tree, so there is no need to combine the approaches.

For the next iteration of tag clusters I intend to extend the experiment to real folksonomic data (ie. an aggregate of all users tag sets).

0 comments: