Sunday, August 31, 2008

Peter (Google) and Christine (the librarians)

I was very lucky to have my new wife with me at SciFoo for many reasons, not the least of which is that she is much better at socializing than I am and thus managed to introduce me to many people I would never normally have met.  One of those people was Christine Borgman, Professor & Presidential Chair in Information Studies at UCLA.  While I was struggling to explain my work on the Entity Describer project to her, she noticed Peter Norvig walk by and dragged him over to join our conversation.  I guess she must have known him from somewhere but I'm not sure where.  Anyway, I didn't realize this at the time, but Peter is head of research at Google.  Ahem... did I mention that SciFoo interactions could be somewhat intimidating?  So there I am, standing between two giants of modern information science trying to explain what it was I was doing there but mostly trying to get some insight into their respective thoughts on the organization of the world's information. It wasn't a long discussion, but here are the basics.
The main question that I posed to them was whether or not and how semantic tagging (a la ED) is or might be useful.  On the surface, the answer from Peter was no and the answer from Christine was yes.  However, the truth of the matter is that they were really talking about supporting different functions for the end user - though this fundamental difference became a little lost during the conversation.  Google is principally focused on providing the best possible results, to the most people, given the least amount of information in the query - that is, keyword based search of the entire Web.  Library-science is typically much more concerned with providing the capacity for people to make very specific requests using much more sophisticated queries that operate over much smaller collections of information (e.g. the library of congress). The fact that there is some overlap in the information needs of the users of these different kinds of systems often brings up the desire for combative  comparison, but I think that, in reality, there is clearly no need for combat because they are simply too different in the functions that they intend to provide.
My interpretation is that Google isn't really concerned with intentionally provided meta-data in the name of end-user, full-Web search because the scale that they operate on seems to render any such indexing by one or even a number of parties almost laughably shallow in its characterization of both the nature of any particular item and its expected relevance to a query.  When you have literally millions of people passively voting and indexing every item of the Web through their decisions to link to it or not, you have very sophisticated algorithms for understanding the text in the pages generating and receiving those links and to top it off you record and process millions of people's behavior when faced with your search results, why should you care what some person or institution says the item is about?  The fact that they (among other search engines) beat out the directory-based approach to finding information on the Web is a clear demonstration that automatic indexing and link based relevance ranking do a better job than meta-data based classification - for the problem of Web scale search.  Google clearly doesn't need human semantic indexers to succeed, though, as Peter said, they certainly use all of the information that exists.  If there happen to be good indexes online (as Connotea turned out to be be for a fairly brief window), then their algorithms will certainly find them and use them - if not, no worries, the algorithms will take advantage of the 'normal' data on the Web and do just fine thank you. 
From the library-science professional perspective, this attitude is clearly annoying.  If human indexing isn't really necessary to find things, what is the point of the field that has devoted itself to the creation of effective ways for people to categorize things for retrieval?  For example, there is a lot of annoyance that the Google Books initiative seems to ignore most, if not all of the meta-data already associated with the books that they are scanning and indexing.  This means that meta-data, even as basic as volume numbers, is inaccessible for searching.  For the library professional that is trained to both search through and construct careful and precise classification structures,the inability to even search for a specific volume of a book is infuriating - particularly knowing that it is well within Google's power to incorporate such abilities into their system.
So on the one side we have the perspective that there is still value in the careful, intentional use of meta-data in the search and retrieval process while on the other side we are quite happy to let the intersection of algorithm and massive passive indexing do the work.  I guess, as is the usual answer, I'd suggest that both sides provide useful functions that are both worth keeping and advancing.  A detailed classification system, either constructed intentionally through the work of professional labor or semi-intentionally through the work of social taggers, provides functionality that is clearly different than what can be achieved by automatic indexing; however, it may not provide any help whatsoever in improving a full-Web-scale keyword-based search.  The essence of the power of intentional classification is the precision of the queries that it enables.  For example, if I want only version 3 of "The Devil's Rights and the Redemption" and thats it or I want only those items that have been tagged as with bioinformatics and to_read by Jaa, there is really no way ( AFAIK) to accomplish this without the intentional recording and utilization of meta-data about those resources.
So, though Peter and Google may have little direct use for ED and its semantic meta-data generating and consuming brethren emanating from the library and information sciences, there are still clearly meaningful applications of such work.  It just happens that providing effective search over the contents of the entire Web based on a string like 'Britney Spears' isn't really one of them.
I'm ok with that.