When put this way the solution really just falls out of the problem, it seems a good greedy algorithm should be optimal (though I couldn't be bothered to prove it).
Algorithm
- Find all tags used and the frequency of their use.
- Add the most popular tag to the minimal set.
- Remove all the sites tagged with this tag, and repeat from 1, end when no sites are left.
Sadly the codebase I'm working with really needs to be taken aside and given a good refactoring. Hopefully I'll do it soon and add an update here. Otherwise you can take a look at the sketchy Demo Page and accompanying source code.
Although i haven't tried to formally prove it, intuitively and based on my experiments, it appears that calculating the minimum spanning set also builds a containment tree, so there is no need to combine the approaches.
For the next iteration of tag clusters I intend to extend the experiment to real folksonomic data (ie. an aggregate of all users tag sets).
0 comments:
Post a Comment