Everything is data


Table of Contents

An interesting perspective on Wikipedia is provided by Dan Cohen who notes that the current discussion about authority does not engage with the full potential significance of Wikipedia. He describes how a large, openly available knowledge base like Wikipedia is a valuable resource for emerging data mining and search technologies.

Let me provide a brief example that I hope will show the value of having such a free resource when you are trying to scan, sort, and mine enormous corpora of text. Let’s say you have a billion unstructured, untagged, unsorted documents related to the American presidency in the last twenty years. How would you differentiate between documents that were about George H. W. Bush (Sr.) and George W. Bush (Jr.)? This is a tough information retrieval problem because both presidents are often referred to as just “George Bush” or “Bush.” Using data-mining algorithms such as Yahoo’s remarkable Term Extraction service, you could pull out of the Wikipedia entries for the two Bushes the most common words and phrases that were likely to show up in documents about each (e.g., “Berlin Wall” and “Barbara” vs. “September 11” and “Laura”). You would still run into some disambiguation problems (“Saddam Hussein,” “Iraq,” “Dick Cheney” would show up a lot for both), but this method is actually quite a powerful start to document categorization. [Dan Cohen – Digital Humanities Blog – The Wikipedia Story That’s Being Missed]

This is an interesting example of how what is processable will be processed where it can add value – for somebody.
Via if:book.


comments so far.

Sign in to comment or sign up if you are not already a member.