Saturday, September 19, 2009

Metadata passing in the night: Librarians, taxonomists, and the BHL project

In a recent post on iPhylo, Rod Page comments on the Biodiversity Heritage Library:

The more I look at BHL the more I think the resource is (a) wonderfully useful and (b) hampered by some dodgy metadata.

I'll gleefully embrace item (a) and and greet (b) with a a mournful shrug of the shoulders that says, "yes, I'm so sorry, we in the library field have failed our users in this regards."

Page goes on to document known metadata problems in the BHL, including this one:

Another issue is that of duplicates. Searching for publications on Rana grahamii, I found items 41040 and 45847. Although one item is treated as a book, and the other as a volume of the journal Records of the Indian Museum, these are the same thing.

You would think that after 200 years or so, librarians would know the difference between a monograph and a serial? Well, of course we do, but the problem is too often, instead of adhering to standards that meet the common goal of access, libraries have cataloged materials to meet their local users' needs, or to provide access at varying degrees of granularity to meet local standards. This wasn't a problem when all metadata was local, but one we've started to move into large scale, collaborative metadata mashups (such as is the BHL), all those individual aberrations from the standards (as well as the typos, non-standard rule application, etc.) have led not exactly to the "train-wreck" that the Google Book Project is faced with, but maybe something worse, a failure to serve the needs of a key user community of the BHL - taxonomists.

In 2003, my colleagues Tom Garnett  and Suzanne Pilsk (Smithsonian Libraries), Anna Weitzman (Smithsonian/Botany Department) and Chris Lyal (Natural History Museum, London), began work on the digitization of the Biologia Centrali-Americana. After working with Anna and Chris for a few months (through numerous meetings in a windowless conference room - Chris on speakerphone), it dawned on the library side of the group that for a 150 plus years, we'd been providing our users with great metadata, the only trouble was, it was nearly useless to them and the jobs they were trying to do. When we would say "author" we meant the author of a bibliographic work; when they said author they meant the describer of a taxon (e.g. Homo sapiens L. - the L is the author, Linnaeus). Library metadata didn't cover this; Library of Congress Subject Headings (LCSH) might be applied, but were far to broad or general (Frogs -- North America). We were breaking Ranganathan's Law #4 of Library Science: Save the Time of the User.

Of course, part of the problem was we really weren't talking to our users. At the same time that many of the great library thinkers were working (Cutter, Dewey, Ranganathan, Bowker, Poole, etc. - I'm stretching the timeline here, I know!), there were similar life science indexing projects. Charles Davies Sherborn was compiling the Index Animalium (an index to known animal species described 1758 to 1850). Sherborn lists described species as follows:

cucullatus Struthio, Linnaeus, Syst. Nat., ed. 10, 1758, 155.—[Didus ineptus, ed. 12.]

(note since this is a species index, the name is given species/genus, not the usual genus/species, fyi, this is the dodo, later reclassed as R. cucullatus). Note how bibliographic citation is done: Linnaeus, Syst. Nat., ed 10, 1758. Could you find that a library catalog? Unlikely, taxonomists and librarians were (and are) using totally different rules to describe bibliographic things. Every taxonomist knows that Syst. Nat. is Systema Naturae, but rarely capture all the common abbreviations used in taxonomic literature (of course, taxonomist, being human - mostly at least - they do don't always follow their own rules assiduously, make typos, or just plain whack errors; Sherborn is often very inconsistent in his citations, e.g. sometimes referencing the publications of the United States Exploring Expedition by the individual volume authors and sometimes under Charles Wilkes, commander of the Expedition).

The BHL project has been a great experiment in bringing together librarians, informaticians, taxonomists, and computer scientists. It's great to be in yet another roomful (YAR) of librarians/scientists and hash out these problems. The really great thing is that now were talking to each other, not going our own ways, tending our own gardens, but actually working together to solve these problems and build great projects.

