To 'extract, transform and load' or to federate

Lorcan 3 min read

One of the major questions for library systems is the role of metasearch or federation. I have written about this here (Metasearch: a boundary case) and here (Metasearch, Google and the rest).
The issue is that libraries have to manage a range of database resources whose legacy technical and business boundaries do not very well map user preferences or behaviors. The approach has been to try to move away from presenting a fragmentary straggle of databases to bundling them in various ways in a metasearch application, sometimes in one big search, sometimes in smaller course or subject bundles. The issues here are well-known, not least of which is that libraries typically have limited control over the performance of the target databases.
As an alternative, a few libraries have explored consolidating locally loaded data. This can work very well, as it becomes easier to build additional services over a consolidated resource. However, this is a rather too adventurous undertaking for most libraries. Another approach is for a third party to consolidate, and this is what we have seen with Google Scholar, Scopus, Worldcat, and others.
More recently, recognizing the advantages of local consolidation, we have seen the emergence of a new class of library system which pulls together metadata from locally managed stores (e.g. digital repository, ILS, institutional repository, …) and offers an integrated search. This may still have to work closely with a metasearch engine to integrate access to external databases. ILS vendors are moving in this direction, and through Worldcat Local, OCLC is also addressing this type of integration.
This is a discussion worth returning to, but that is not my purpose here. Rather I wanted to point to an interesting treatment of similar issues from a different domain. Mike Stonebraker, database guru and writer in the group blog, The Database Column, has a post where he contrasts two models of data integration: ETL (extract, transform and load) and federation. The focus is on enterprise systems. The ETL model will typically involve a centralized data warehouse and “for each operational system, they will employ some sort of ETL process to transform data instances into the global schema and then load them into the centralized warehouse”.
‘Extract, transform and load’ is a good characterization of what is involved in consolidation of library data, whether this is attempted locally or through third parties. One of the interesting questions is the sophistication of the ‘transform’. Think of author names, for example, or subjects, or other controlled data, and what would be involved to effectively merge data created within different regimes. What is the impact, for search or for faceted display, of limited or no transformation of these elements?
Here are the headings Stonebraker uses for his discussion.

  • Data element “heat”: Hot data favors ETL
  • Indexing: Federation is harder to optimize
  • Resource management: Faster BI query responses for ETL shops
  • Complexity of the schema change: ETL approach performs less joins
  • Contention (concurrency control): Federation contention challenges
  • Timeliness: ETL approaches must deal with out-of-date data issues
  • Mapping: Federations can’t handle some transformations

BI is short for ‘business intelligence’. ‘hot’ data is data that is accessed often.
Now, while it is clear that our environment is similar to that discussed here in many ways it would be interesting to do a similar analysis with our domain in mind to see where there are differences. Of course, one issue is that most of the data under discussion here seems to be within institutional control.
Here is his conclusion:

In summary, virtually all enterprises use the ETL approach for data integration. The data federation market is, in contrast, quite small. The place where I see federations as most viable is when there are many, many data sources (e.g., more than 5,000 sources) and BI users utilize only a small number of them at any given time. In this extreme case, the average data element is accessed zero times before it is updated or deleted. In this instance, one is better off leaving the data where it originates. On the other — more common — hand, when most data elements get used several times, the ETL approach will continue to be preferred. [To ETL or federate … that is the question – The Database Column]

Related entries:

More from
The technology career ladder

The technology career ladder

Library leaders should be drawn from across the organization. Any idea that technology leaders are overly specialised or too distant from general library work is outmoded and counter-productive.
Lorcan 7 min read

Lorcan Dempsey dot net

Deep dives and quick takes: libraries, society, culture and technology

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.