Wednesday, November 10, 2010

Sometimes less is more (when it comes to data)


2008-11-05-dscn6431
Originally uploaded by martin_kalfatovic
In a project I work on (the Biodiversity Heritage Library), we always say “BHL is important because it’s a complete (planned at least) of a type of data (biodiv lit)”. Generally speaking, we don’t data to support this assumption, but a firm called Infochimps is designing metrics and data analysis that can quantify this assumption; here’s an interesting post by Eric Hellman (on his "Go to Hellman" blog which you should all ready by the way) about scaling data value, Eric has a longer/fuller discussion, but the key point in relation to BHL  is the following:
Kromer has noticed that the price (or perhaps cost) of a partial data set follows a non-monotonic curve (see graphic). Small amounts of data are essentially free, but a peak value is reached when portions of the data set are extracted from the full data set.
Kromer has noticed that the price (or perhaps cost) of a partial data set follows a non-monotonic curve (see graphic). Small amounts of data are essentially free, but a peak value is reached when portions of the data set are extracted from the full data set. If we were discussing book metadata, for example, peak value might accrue for a set of the 100,000 top selling books.
There's much less value, according to Kromer, in having a large incomplete chunk of a data set. Data for 10,000,000 books, for example, would have less value than the 100,000 book data set, because it's not complete. Complete data sets become extremely expensive because of the logistics involved, and because of the value of having the complete set.
In the BHL context, substitute “Google” for 10m books and “BHL” for 100k books. The BHL data set, acquired at a higher unit cost than the Google data set, is of more “value” because of the coherency of the data (operations on a small, coherent set of data will return greater value than on large incoherent data sets). So, the current ~85,000 BHL volumes online could be of more value than the entire Google Books set.

No comments: