If knowledge is power, then data sets are barrels of crude oil. Both oil and data need refinement to extract useful material. Without those steps, you have only a gummy mess that is as difficult to wade through as it is to clean off seabirds.
The process of extracting useful, accurate predictions from data is rarely as simple as taking a linear regression or calculating a table of covariances. Data can come in myriad forms: numerical (weighs 154 lb.), categorical (drives a Lexus), or something more insidious, like a series of tagged images or a map of nodes. One could imagine a perplexed refinery receiving a mixture of shale oil, sandy oil, and conventional crude with the encouraging note, "Good luck, finish by next month!
Stretching the Pipeline
With the internet as your pipeline, anyone, anywhere can do the refining, distilling a model from the data using their particular skillset. The implications are tremendous—no matter the problem, specialized help is available and cheap!
Crowdsourcing solutions to technical problems through public contests has become very popular in recent years. While the million-dollar Netflix to predict movie rankings might be the most visible application of crowdsourced brilliance, it wasn't the first. In 2001, Eli Lilly launched Innocentive to farm out difficult chemical syntheses before "crowdsourcing" was even a word. It has since spun off and expanded its challenges and prizes; though you'll still find things with lots of aromatic rings, the company also trumpets successful solutions in oil spill recovery, water purification, ALS biomarkers (a $1 million prize), and even flashlights that make life without electricity a little easier.
Competing in an Innocentive challenge is a lot like answering a call for proposals—because it is. You don't know how you're doing until the judges look them over, but such is the nature of the problems, being mostly design. Only data and modeling challenges lend themselves to automatic evaluation and regularly updated leaderboards, and that's where Netflix, and later Kaggle, come in.
The Netflix prize only ran once (and successfully beat their in-house algorithm by over 10%), but Kaggle, formed in 2010, has already closed 22 competitions and has six others running. They too have good results to show off—their users beat state-of-the-art models in every competition, including mapping dark matter for NASA and predicting HIV progression in patients.
Connecting Refined Thoughts
Running these contests isn't always easy. Researchers claimed to be able to de-anonymize parts of the Netflix Prize dataset in 2007, and Netflix was sued in 2009 over potential privacy breaches from the contest. Anonymity has continued to be elusive: Kaggle's IJCNN social network challenge asked users to predict the connection of nodes in a social network based on other nodes, but the winners managed to predict the links by connecting the data to Flickr accounts.
Though it wasn't the answer the sponsors were looking for, the unmasking answer was valuable and emphasizes two considerations: First, sensitive data must be handled very carefully and anonymization is paramount (they should have a contest for that!). Secondly, while malicious behavior by participants is unlikely, politely mischievous behavior is likely, and it is important to decide and state beforehand whether such results are desirable or not.
While the small cash prizes won't drive the establishment of research groups, they do connect people with problems and tip the work-pastime balance. The solvers just needed a little push to bring all of their resources to bear on the problem, whether because of the prizes, or the statement of importance, or direct competition.
Lancelot Ware, the founder of the high-IQ society Mensa, mentioned that he was disappointed that Mensa members wasted so much time solving arbitrary puzzles. Perhaps this will assuage Ware's disappointment. For the rest of us, we have cooler puzzles with real results, a little extra cash, and, most importantly, some refined fuel to power the advancement of mankind.