The Blogs: Dec 21, 2011

Mind Games: Crowdsourcing as a R&D Resource

Zachary Russ

If knowledge is power, then data sets are barrels of crude oil.  Both oil and data need refinement to extract useful material. Without those steps, you have only a gummy mess that is as difficult to wade through as it is to clean off seabirds.

The process of extracting useful, accurate predictions from data is rarely as simple as taking a linear regression or calculating a table of covariances. Data can come in myriad forms: numerical (weighs 154 lb.), categorical (drives a Lexus), or something more insidious, like a series of tagged images or a map of nodes.  One could imagine a perplexed refinery receiving a mixture of shale oil, sandy oil, and conventional crude with the encouraging note, "Good luck, finish by next month!"

Stretching the Pipeline
With the internet as your pipeline, anyone, anywhere can do the refining, distilling a model from the data using their particular skillset.  The implications are tremendous—no matter the problem, specialized help is available and cheap!

Crowdsourcing solutions to technical problems through public contests has become very popular in recent years. While the million-dollar Netflix to predict movie rankings might be the most visible application of crowdsourced brilliance, it wasn't the first. In 2001, Eli Lilly launched Innocentive to farm out difficult chemical syntheses before "crowdsourcing" was even a word.  It has since spun off and expanded its challenges and prizes; though you'll still find things with lots of aromatic rings, the company also trumpets successful solutions in oil spill recovery, water purification, ALS biomarkers (a $1 million prize), and even flashlights that make life without electricity a little easier.

Competing in an Innocentive challenge is a lot like answering a call for proposals—because it is. You don't know how you're doing until the judges look them over, but such is the nature of the problems, being mostly design. Only data and modeling challenges lend themselves to automatic evaluation and regularly updated leaderboards, and that's where Netflix, and later Kaggle, come in.

The Netflix prize only ran once (and successfully beat their in-house algorithm by over 10%), but Kaggle, formed in 2010, has already closed 22 competitions and has six others running. They too have good results to show off—their users beat state-of-the-art models in every competition, including mapping dark matter for NASA and predicting HIV progression in patients.

Connecting Refined Thoughts
Running these contests isn't always easy.  Researchers claimed to be able to de-anonymize parts of the Netflix Prize dataset in 2007, and Netflix was sued in 2009 over potential privacy breaches from the contest. Anonymity has continued to be elusive: Kaggle's IJCNN social network challenge asked users to predict the connection of nodes in a social network based on other nodes, but the winners managed to predict the links by connecting the data to Flickr accounts. 

Though it wasn't the answer the sponsors were looking for, the unmasking answer was valuable and emphasizes two considerations: First, sensitive data must be handled very carefully and anonymization is paramount (they should have a contest for that!). Secondly, while malicious behavior by participants is unlikely, politely mischievous behavior is likely, and it is important to decide and state beforehand whether such results are desirable or not.

While the small cash prizes won't drive the establishment of research groups, they do connect people with problems and tip the work-pastime balance. The solvers just needed a little push to bring all of their resources to bear on the problem, whether because of the prizes, or the statement of importance, or direct competition.

Lancelot Ware, the founder of the high-IQ society Mensa, mentioned that he was disappointed that Mensa members wasted so much time solving arbitrary puzzles. Perhaps this will assuage Ware's disappointment. For the rest of us, we have cooler puzzles with real results, a little extra cash, and, most importantly, some refined fuel to power the advancement of mankind.

Related content

Subscription center

GEN MAGAZINE

Genetic Engineering & Biotechnology News (GEN) has retained its position as the most widely read biotechnology publication around the globe since its launch in 1981. Published 21 times a year and with additional exclusive editorial content online, GEN's unique news and technology focus includes the entire bioproduct life cycle from early-stage R&D, to applied research including omics, biomarkers, as well as diagnostics, to bioprocessing and commercialization.

Subscribe

e-NEWSLETTERS

Add GEN to your Inbox! Subscribe today to our complimentary e-newsletters and stay abreast of the latest biotech news and trends. Click the magnifying glass icon next to a newsletter title to view a sample.

 

All fields are required

Email
  Confirm Email
  First Name
  Last Name
  Organization
  E-Alert Format
 
 
  Sign Up

ADVERTISEMENT

ADVERTISEMENT

ADVERTISEMENT

GENpoll

Potential of Synthetic Biology

Do you think synthetic biology has the ability to offer solutions for problems related to health and the environment?

Suggest a Poll