This fascinating book demonstrates how you can build web applications to mine the enormous amount of data created by people on the Internet. With the sophisticated algorithms in this book, you can write smart programs to access interesting datasets from other web sites, collect data from users of your own applications, and analyze and understand the data once you’ve found it. [Programming Collective Intelligence | O’Reilly Media]
I was struck by the foreword by Tim O’Reilly, where he discusses what he sees as distinctive about Web 2.0 as a concept (there is some overlap between the foreward and the blog entry from which I quote here).
When Time Magazine picked “You” as their Person of the Year for 2006, they cemented the idea that Web 2.0 is about “user generated content” — and that Wikipedia, YouTube, and MySpace are the heart of the Web 2.0 revolution. The true story is so much more complex than that. The content that users contribute explicitly to Web 2.0 sites is the small fraction that is visible above the surface. 80% of what matters is below, in the dark matter of implicitly-contributed data.
He talks about how Google’s invention of PageRank was in many ways the defining moment for Web 2.0, as it was one of the first applications to mobilize ‘implicitly-contributed’ data, in this case the ‘choices’ or ‘intentions’ implied by links to webpages.
No one would characterize Google as a “user generated content” company, yet they are clearly at the very heart of Web 2.0. That’s why I prefer the phrase “harnessing collective intelligence” as the touchstone of the revolution. A link is user-generated content, but PageRank is a technique for extracting intelligence from that content. So is Flickr’s “interestingness” algorithm, or Amazon’s “people who bought this product also bought…”, Last.Fm’s algorithms for “similar artist radio”, ebay’s reputation system, and Google’s AdSense. [Programming Collective Intelligence – O’Reilly Radar]
I defined Web 2.0 as “the design of systems that harness network effects to get better the more people use them.” Getting users to participate is the first step. Learning from those users and shaping your site based on what they do and pay attention to is the second step. [Programming Collective Intelligence – O’Reilly Radar]
I have spoken about this ‘implicitly-contributed’ data as ‘intentional’ data, drawing on John Battelle’s notion of Google’s database of intentions’, the progressively richer map of user choices it is amassing.
In general, I think that we are too much occupied in libraries by the 20%, the ‘user generated content’, and not enough by network effects and the ‘dark matter of implicitly contributed data’ which drives ranking, relating and recommending on many sites.
Incidentally, I remember thinking that the Time cover story was a clumsy attention grab.