Tuesday, June 26, 2007

The main problem with LSIDs

As one of the authors of a paper that espoused LSIDs as a great idea and complained about their lack of adoption by the community, I feel obliged to come out and say that I've come to the conclusion that we should not be using them at all (for the time being).

The fundamental problems with LSIDs are the first two symbols in the acronym. If they were simply The Identifier System and thus were not constrained from the beginning to operate in the comparatively small world of the life sciences, the proposal might have stood a chance of working. I think they were really just not ambitious enough. It really is a fantastic bit of work, this paper describing the need for and the implementation of the spec is one of the best works on data integration that I have come across. But, the fact remains that, with a few exceptions, no one has adopted the spec because (IMHO) developers prefer to work with the most widely accepted standards (e.g. those from the W3C) so they don't end up having to re-implement everything when new standards replace the old ones. (There is also the issue of the need for registries but I think that is secondary).

So, until we get to the point where Sun provides a built-in Java class called LSID with methods like getMetaData and getData that I can call with the same ease as the HTTP get method, I think I'll wait with everyone else.


Mark Wilkinson said...

I agree with only a part of what you say, but think you aren't being ambitious enough. What we should be pushing for is that the LSID spec (or something very very similar to it) is re-branded and ADOPTED BY THE W3C!!

What worries me about NOT adopting a new identifier system as we move into the Semantic Web is that we start to hack and kludge our way to full functionality by adding novel behaiours on top of URLs, or start putting the "intelligence" of where to find data/metadata into redirects, purl URLs, or other nasty, centralized, and IMO unsustainable architectures.

LSIDs solve a very distinct set of problems - separation of identity from location; separation of data from metadata; and multiple end-points/protocols for both data and metadata retrieval. As far as I can tell, NONE of the solutions that have been proposed in the discussions within the HCLS community have come close to addressing these three issues in anywhere near as elegant a way as the LSID spec does, and some of the proposals have been a bit worrisome (e.g. "just add a ? to the end of your URL if you want metadata"... where is THAT in the HTTP spec??). Even more odd, to me, is that all of this contorting and hand-wringing is only because people want to be able to stick a URI in their browser and see something at the end of it. Frankly, I just don't see the point of designing architectures around browsers!

One of the keynote talks at the WWW2007 meeting was from a Microsoft fellow (can't remember his name) who reminded us that, within the next 10 years, the interfaces into the Web will become ubiquitous in our lives. "The Browser" is going the way of the Dodo! Why are we so concerned about designing next-generation architectures around last-generation interfaces?

In the BioMoby project we use LSIDs extensively (and by the way, I have almost never found the need to plug one of them into my browser...). Here's one of the uses we have for them:

A Web Service is identified by an LSID. The Moby Central registry knows certain things about that service (its inputs, its outputs, its semantic type, its authorship), and through an hourly "ping" it knows if that service is visible/available or not. This information is available as getMetadata from Moby Central. In addition, however, the service provider knows things about their own service. They know what example inputs and outputs might be, they know system maintenance schedules, etc. All of these things can be provided as getMetadata from the service provider. As a consumer, I want to know about a service, so I go to the LSID authority and say "where can I get information about this service?", the authority says "you can go here (Moby) and here (provider)", I do so, and I can combine the knowledge both resources have about that service. THIS IS ALL PART OF THE LSID SPEC! No hacks, no kludges, no new consensus was required within the community.

I don't know about you, but as for me and my family, we are going to continue using LSIDs until someone comes up with a BETTER alternative!

Benjamin Good said...

Wow Mark, a longer comment than my post, you must care about this a lot :). I meant to convey basically the idea from your first paragraph - that we should focus on pushing for - but then working within the most broadly accepted standards. (I did mention that I thought the spec designers weren't ambitious enough).

I'm not suggesting that LSIDs aren't in fact, the greatest thing since sliced bread. What I am suggesting however, is that, as an author of an RDF harvesting client program that is designed to work on the open Web and not in the closed worlds of the few projects that are bothering with LSIDs (like BioMoby) they are currently a (frustrating) dead end.

No more dead ends...

Morgan Langille said...

DOIs seem to becoming more popular. I just noticed that PDB is now using them for each of their proteins. Is there anything that LSIDs do that DOIs can't?

Benjamin Good said...

Hi Morgan,

Didn't really want to get into this debate.. just needed to vent some frustration exuding from problems I'm facing writing a semantic web crawler.

For a nice comparison of the LSID, DOI, Handle, and URL based object identification strategies see this entry on the Biodiversity Information Standards (TDWG) wiki

Roderic Page said...

I find it interesting that the two digital objects you refer to in this post are both linked to by URLs, as opposed to giving their identifier. Both have DOIs (doi:10.1093/bib/bbl025 and doi:10.1093/bib/5.1.59, respectively).

DOIs are widely used, have considerable backing by the publishing industry, a commitment to persistence, industrial strength resolution (via Handles), and documented APIs for retrieving metadata via industry standards such as OpenURL.

I wonder whether the fact that even bioinformatics bloggers rarely use them tells us anything about how widely used LSIDs would become, given that they have nothing like the infrastructure that is behind DOIs.

Anonymous said...

With all due respect: DOIs are handles, which have a very nice protocol that could be used for semantic web purposes, but isn't. Bibliographic data isn't part of the handle metadata. Crossref, via a very restricted interface, gives you some bibliographic data, but only a dumbed-down subset - they don't want to compete with for-pay services that provide the whole story. And I'm not sure how the persistence story is any better with DOIs than with any other scheme under discussion.

The publishing industry doesn't always act in the public interest, and DOIs are tools of the publishing industry. I'm not sure about the handle system - it seems fairly neutral.