Researchers from the Whitehead Institute needed just a computer, an Internet connection, and publicly accessible online resources to identify nearly 50 individuals who had submitted personal genetic material as participants in genomic studies.
“This is an important result that points out the potential for breaches of privacy in genomics studies,” research team leader Yaniv Erlich, Ph.D., principal investigator and Whitehead Fellow, said in a statement.
A more detailed description of the group’s work was the topic of a study published this week in the journal Science. The study sought to show how the full names and identities of genomic research participants can be determined under certain circumstances, even when their genetic information is held in de-identified form within databases.
The study surfaces just three months after the Presidential Commission for the Study of Bioethical Issues urged that individual interests in privacy must be respected and secured in order to realize the promise of whole-genome sequencing in advancing clinical care and the greater public good. In a report, Privacy and Progress in Whole-Genome Sequencing, the commission recommended that Washington define clear access to and permissible uses of whole-genome sequence data, ensure security for data, create a fully-informed consent process, and join states in hammering out “a consistent floor of protections” for whole-genome sequence data.
In their study, Dr. Erlich and colleagues analyzed short tandem repeats on the Y chromosomes (Y-STRs) of men whose genetic material was collected by the Center for the Study of Human Polymorphisms (CEPH) and whose genomes were sequenced and made publicly available as part of the 1000 Genomes Project. A strong correlation can be made between surnames and the DNA on the Y chromosome, researchers found, as both the Y chromosome and family surnames are transmitted from father to son.
Through surname inference, Erlich’s group was able to discover the family names of the men by submitting their Y-STRs to publicly accessible databases maintained by genealogists and genetic genealogy companies, which store the Y-STR data by surname. The team identified nearly 50 American male and female participants in CEPH, after validating their inferences with Internet record search engines, obituaries, genealogical websites, and public demographic data from the National Institute of General Medical Sciences’ (NIGMS) Human Genetic Cell Repository at New Jersey’s Coriell Institute.
Researchers concluded that the the posting of genetic data from a single individual can reveal deep genealogical ties, as well as help identify distant relatives who may have no acquaintance with the person releasing genetic data.
Erlich shared his group’s findings with NIGMS and the National Human Genome Research Institute (NHGRI) before publication. The two agencies responded by shifting some demographic information from the publicly accessible portion the NIGMS cell repository, hoping to help reduce the risk of future breaches. In an article published in the same issue of Science, NIGMS director Judith H. Greenberg and NHGRI director Eric D. Green called for a balance between the privacy rights of research participants and the benefits to society achieved by sharing of biomedical research data.
In the statement, Dr. Erlich declared that he had no intention of revealing the names of genetic study participants identified by his group, nor did he want to curtail public sharing of genetic information: “More knowledge empowers participants to weigh the risks and benefits and make more informed decisions when considering whether to share their own data. We also hope that this study will eventually result in better security algorithms, better policy guidelines, and better legislation to help mitigate some of the risks described.”