« testing! | Main | I feel like I won the lottery »
November 07, 2005
The Memetic Web, so-called
There was an announcement for the Memetic Web broadcast today on the SIG-IRList, a moderated search and information-retrieval mailing list I subscribe to. The basic idea seems to be a compromise on the semantic web and tagging (Flickr, technorati, del.icio.us, et alia). Briefly: Define a “meme” for your webpages, what it’s about, things it relates to, and so forth, in other words, a summary. Associate a globally-unique string (a meme id) with this summary, and paste it into your webpages. They suggest potentially linking these to various existing classification schemes (like product codes and zip codes, and even presumably things like MedLine keywords), creating sort of a flat projection of existing classification schemes along with the more standard randomness associated with open tagging. Because meme-strings would be globally unique, they could be simply linked back to their definitions, and correct some flaws inherent in simple tag models, but still not be as intimidating as the full-on, hard-core Semantic Web.
What’s a meme and what’s a string?
An initial criticism about the project is the implication (by calling the unique ids a “meme”) is that this adds actual meaning to the search process. What’s really going on is simply adding a particular string to a page, which is treated like any other string in indexing and retrieval. Co-occurence does not equate to meaning, and co-occurence is what the current proposal implicitly relies upon by leveraging Google et alia as the retrieval mechanism.1 In other words, the distinction needs to be maintained between information useful for search (strings) and information useful for people (meme-strings and their associated definitions).
A related problem is that the meme-string has no semantic or syntactic connection to the defining document in and of itself. It’s simply present or absent. It neither relates to any particular section or aspect of the document, nor does is provide any indication of the relevance of its connection. This is a loss compared to more-standard keyword tags, which can (at least in theory) be mapped to related words in the text of the page they’re applied to.
This isn’t to say that containing a meme-string won’t be a decent approximation of meaning for a document, but it’s fairly questionable whether or not this will be any better on average than the words the page already contains (which is, after all, how people currently find things fairly successfully), or that it would be better than an open tagging system.
Whither aboutness?
This brings us along to the questions about the origins of the meme-strings in pages. A post on memetic.org makes the following note:
The microformat
rel="tag"attribute added to a hyperlink provides metadata that the page, or blog post, is about whatever is described on the page linked to. (The example used is a technorati page.)Thus it is comparable to our memelink, which points to the meme’s aboutness page on the memography wiki.
This isn’t entirely true. Well, the fact that it’s comparable is true, since one comparison is that memelinks are in a couple of senses the opposite of rel="tag", and the implication that the two methods perform comparable tasks strikes me as entirely incorrect.
We’ve always been able to assert things about our own pages simply by adding words to them, whether in <meta> tags or in the body text (as was done for the memography proof-of-concept test). Linking these keywords back to an authoritative site doesn’t seem to add much valuable information to a page in addition to simply having the meme-string present (both make identical assertions about the page).
In contrast, rel="tag" allows us to make aboutness assertions about remote pages, which makes it possible to add information to them which would otherwise not exist. The direction of the assertion is with the direction of the link, opposite to the orientation of the memelinks. Similarly, the authority to apply metadata with rel-"tag" is opposite to that of memelinks; rel="tag" allows anyone to apply the data, while memelinks limit it to page authors.
When considering the authority to apply the metadata, meme-strings are equivalent to <meta> tag keywords, googlebombs, and other assorted tom-foolery.
“Perfect” precision and recall?
The email to the SIG-IRList which prompted this little article opened thusly:
I would like to briefly describe a new technique for finding information that has the potential for near perfect precision and recall.
Which is certainly a bold claim. It’s backed up by a small proof-of-concept test as well, claiming 100% precision and recall2 on a small test. To sum up, they created a meme-string, stuck it in three regularly-indexed web pages, waited a few days, and used Google to search for the meme-string.
This is obviously a fairly small test, and it would be overly-pedantic of me to criticise it simply on the basis of the results (and a likely decrease in precision and recall as the system is adopted has already been acknowledged, but the claim that it demonstrates perfect precision and recall raises some interesting questions about IR evaluation in general.
Key to both precision and recall measures is the concept of “relevance”; though the memographers never explicitly state what their criteria for the relevance judgement are, it appears to be “contains the meme-string ‘MEMOZIP-02138’.” This is severely flawed: relevance is usually defined in human terms as answering an information need, rather than simply containing the terms used in the search. It’s a fairly trivial task for a search system to return all the documents containing a particular meme-string, so the somewhat circular definition of relevance reduces the test to seeing how quickly Google can index the pages.
The test search string is MEMOZIP-02138, which corresponds to “a meme for the area in Cambridge, Massachusetts around Harvard University.”3 The 100% recall implies that the only three relevant results for this description are the initial three test websites: CMS Review, CMS Wiki, and skyBuilders.
So what happens if we assume a reasonable information need derived from the meme and apply a different relevance judgement? In other words, let’s look beyond the string and define our information need from the meaning inherent to the meme. A reasonable (and very loose) test might be “has an address in zip 02138”. One example of a document incorrectly left out under this criteria might be: Harvard University’s various schools and faculties, which all share the same zip code. A supplemental search could probably find any number of Harvard students and faculty who have webpages associated with that same zip code, all of which could be considered as relevant to the information need (and meme) as the test websites.
Under these conditions, precision remains 100% (since the query results don’t change, and the relevance of the returned documents don’t change), but recall will drop to at most 20% and possibly further (there are 12 separate addresses listed on the Harvard FAQ, each of which has a page of its own - 3 relevant documents / 15 total = 20% This score would drop further if we were to include any students, faculty, or cafes that might surround the university).4
If we want to be stricter about the relevance judgement, then it’s questionable whether or not the test sites would even be counted as relevant results, since their physical location isn’t particularly relevant to their content. This is an easily-fixed nit-pick for the test, but again raises some problematic questions for the future of the scheme: how will it be possible to judge how closely a page is related to a particular meme-definition?
So, the claim of “perfect” precision and recall isn’t well-founded. While this strikes me as a particularly egregious example of fudging the question to make your answers look good, a certain amount of hyperbole5 is to be expected in announcements about new tag-friendly technologies on the web, and memography is far from alone in having problems defining suitable relevance criteria. The problem of one that’s endemic to search and information retrieval - at least since it moved beyond the “match these keywords” days - and it’s a problem that will persist, since relevance is the sloppiest, least-definable, and most human element in the equation. It also happens to be the most important.
This is the consequence of the distinction mentioned above: the underlying search is unable to map from the (search-useful) string to the (human-useful) meme. This represents a fundamental weakness in the system: when relevance is defined relative to an actual information need rather than the presence of a particular string, recall will always suffer due to non-inclusion of the meme-strings in relevant documents. This is compounded by the other facet of the system previously discussed: the system relies on page authors to insert the meme-strings.
The bugbear of namespace collisions, and the Humpty-Dumpty meaning effect.
There a fair amount of time given to namespace collisions in the memography system, which seems misplaced. Simple tags avoid the namespace collision in much the same way that language does: as words are disambiguated by the presence of other words, so tags are by other tags. Take “mouse” as a brief example. Is it the furry one, or the one with buttons? Does it occur with “computer” and “Logitech”, or possibly “mammal” and “cancer”? There seems to be very little net gain in the formalised naming scheme in terms of reducing namespace collisions compared to the effort of initially creating the meme-string, or finding the appropriate one to search on.
One of the arguable benefits of the memography system is that each meme-string has an authoritative (insofar as a Wiki can ever be authoritative) definition. This is, again, a bit of a false benefit. A meme-string on its own, as mentioned above, bears no syntactic or semantic relation to the document it’s associated with, and cannot easily be reconciled with any other information associated with the page, its entire meaning is defined in the Wiki. On the other hand, consider a set of tags applied via rel="tag" links: they can quite likely be correlated with words in both the linking and linked pages, and derive their meaning from the context. To paraphrase the titular egg, “when I use a tag it means precisely what I want it to mean, no more, no less.” So while the meaning of the meme-string is much more dictionary-like, it’s somewhat less useful than the contextual meanings of the rel="tag" set.
Final Thoughts, Finally.
In the negative side of my mind, I don’t see much of an improvement on simple tags for the web at large in this system, since most of the problems it purports to solve either aren’t really problems (namespace collisions), or are superseded by the problems the system introduces (mostly that of appropriate selection). For smaller systems, it’s possible that this system might be of some benefit, but likely not more than tags with some moderation to correct misspellings.
In the positive side of my mind, if this system was widely adopted, and the tags remained relatively spam-free and accurately applied (none of which strike me as a safe bet), then there’s potential for some interesting things to happen. Oddly, this sounds a lot like what people say about the Semantic Web. Admittedly, this system is simpler than the Semantic Web, but is it simpler enough to gain popular support?
Finally, in the cynical side of my mind, I see a great lumping-together of buzzwords (tag, meme, wiki), with some hyperbolic and ill-founded claims about performance, and I get the creeping feeling that it will either be nought but sound and fury, or someone will become a millionaire because of it.
I realise that this isn’t necessarily the case, and that, as always, search technology will evolve to adjust to the information associated with webpages. Whether this will be to take advantage of it (such as PageRank does links), or ignore it (for instance
<meta name="keywords" content="spam, spam, spam, spammity, spam">), remains to be seen. For now, I’ll just deal with the memography idea as it’s stated: use today’s search engines as they already work.Quick explanation of precision and recall.
- Precision is defined as the ratio of relevant documents returned to all documents returned for a particular search.
- Recall is defined as the ratio of relevant documents returned for a particular query to the total number of relevant documents in the search space.
So, if you search for ‘effects of marshmallow ingestion on siberian hamsters’ and get 100 pages, with 10 relevant pages, and you happen to know there are 20 relevant pages out there, you have recall of 50% (10/20) and precision of 10% (10/100). To get 100% in both measures requires that you return all the relevant pages, and nothing else.
As an interesting exercise, try out this excellent zip code visualisation map to get an idea of what other parts of the USA might also be associated to one degree or another with zip code 02138. Any website in those areas could be considered relevant to the query to one degree or another.
It’s obvious that in the absence of perfect precision and recall, there needs to be some sort of balance struck between precision and recall in evaluation. In reality this could vary not only for different tasks, but even for different queries (I might want higher recall for background research, but might want higher precision looking for a specific quotation). The F-measure is generally considered to be a good measure that takes this desired balance into account if not ideally for every situation, then at least consistently. It’s defined as:
F = (2*R*P) / (R+P)where R is recall and P is precision.Given the level of hype that has persisted on the web since about 1994, maybe a neologism is called for: ultrabole, perhaps? überbole? muchobole? I particularly like the adjectival form “muchobolical” and “überbolism” which sounds like a political philosophy where you claim everything is great, even though it isn’t.
Posted by matt at November 7, 2005 05:53 PM