Table of Contents
I think it is useful to think of four sources of descriptive metadata in libraries. These are not mutually exclusive, and one of the interesting questions we have to address is how they will be mobilized effectively together.
I don’t have good names for these. How about: professional, contributed, programmatically promoted, and intentional?
The curatorial professions have made major investments in knowledge organization, through the development and application of cataloging rules, controlled vocabularies, authorities, gazetteers, and so on. One of our major challenges is releasing the value that has been created through those approaches in web environments. There is much to think about here, and many folks are thinking about it. Currently, these approaches do not tend to work well across silos, they are not made available as web resources themselves so that they can be part of the connected fabric of the web, they only work with the other approaches I mention in particular projects or services, their ‘relating’ power is underused, and higher level services based on data mining or statistical analysis are limited. Now, these types of issues are being addressed, but are some way from routine systemwide application. I believe that these approaches will continue, within a reconfigured system, and we need to make that data work harder. My personal view is that the curatorial professions need to invest more in the shared production of resources which identify and describe authors, subjects, places, time periods, and works.
A major phenomenon of recent years has been the emergence of many sites which invite, aggregate and mine data contributed by users, and mobilize that data to rank, recommend and relate resources. These include, for example, Flickr, LibraryThing, and Connotea. These services have a different focus, and create real value in the way that they organize resources. They also have value in that they reveal relations between people. Libraries have begun to experiment with these approaches, but individual libraries may not have the scale to iron out local or personal idiosyncrasy or emphasis. This is another area which lends itself to shared attention. There are real advantages to be gained. So, for example, as we digitize photographic and other community collections, we will want to mobilize knowledge about those collections that does not exist within the library. Or, if you think about a service like Worldcat Identities, at some stage we will want to allow those ‘identities’ themselves to comment, augment, amend. What this means is that we will have to get rather more sophisticated about managing assertions about resources from different sources.
We are handling more digital materials, where it is possible to programmatically identify and promote metadata from resources themselves or groups of resources. We will also do more to mine collections, including collections of metadata, to discern pattern and relations. We are increasingly familiar with clustering, entity identification, automatic classification and other approaches. Look at the home page for books that Google is creating to see a resource created from mining Scholar, Google Book Search, and big Google to deliver a range of related materials.
I have used this term to refer to the data that we are collecting about use and usage. Pagerank is based on aggregate linking choices. Amazon recommendations are based on aggregate purchase choices. We use holdings data in ranking algorithms, which aggregates selection choices of libraries. This type of data has emerged as a central factor in the major web presences as they seek to provide useful paths through massive amounts of data.
To repeat, these approaches are not mutually exclusive and will increasingly be deployed alongside each other. For example, authority lists may support programmatic identification of personal or place names in large text resources. The shared interests revealed in social networking applications may be abstracted into a form of intentional data to drive recommendations or ‘related work’ services. Patterns of association and interaction will develop between tags and subject headings. And so on.
Much of our discussion pits these approaches against each other. This seems like the wrong approach. Clearly there will always be choices about where one invests effort, especially as the network continues to reconfigure what we do, but the starting point should be how we create better services and what approaches support that, and not a ‘techeological’ position around one or other approach which confuses ideology and technology.