In life sciences laboratories, data is everywhere and nowhere—everywhere in the sense that different kinds of data accumulate on disparate platforms, and nowhere in the sense that pools of data often become stranded due to formatting issues.
Data may be entered into Excel spreadsheets and paper notebooks if it comes from early-stage research. Or data may be entered into electronic laboratory notebooks (ELNs) and laboratory information management systems (LIMs) if it comes from later-stage research. Integrating “early” data and “late” data into a cohesive picture can be a Herculean task, especially in projects that stretch over many years or encompass many laboratories. It can pull scientists into searches for already-gathered data, distracting scientists from their research for days at a time.
Data management burdens can be lightened by applying artificial intelligence (AI) and machine learning (ML) technologies. For example, these technologies can facilitate comprehensive searches for specific data sets even if the searches must contend with disparities in formats or original storage types. But first, these technologies need to win the confidence of scientists.
“Researchers have become habituated to struggling to find data,” says Emerson Huitt, founder and CEO of Snthesis. “It’s difficult for them to envision answering their questions quickly and getting insights into the research pipeline that can yield organizational impacts.
“We worked with a large customer who saw that power firsthand when we unified (and harmonized) six years’ of data from 50 scientists. Before, two PhDs searched through the data full time instead of performing any actual scientific work. Answering a question, like how many samples had ever been tested against a specific target, took weeks.”
The answer was returned by Snthesis in about 90 seconds. The customer, Huitt recalls, was “completely blown away.”
Harmonization is key
Snthesis has developed a platform called Bio. According to the company, the system is designed to ingest “all data.” It can work with 300-plus file formats and handwritten notes. (Excel notebooks and other new files can be dragged and dropped into the system.) Bio can also upload data automatically from ELNs and LIMs.
The platform also aggregates, harmonizes, and contextualizes data, even if data sources include thousands of spreadsheets created by different people over a period of years. Then the data may be subjected to comprehensive searches via the platform’s graphical query tools.
“We process the data and analyze the shapes of the data to define categories for the system to recognize—categories that may include the origin of the samples and who collected them,” Huitt explains. “And we work with our clients to identify and extract the things that are important to them.”
That includes extracting and linking data based on semantic meaning throughout the files. Accuracy thresholds can be specified to narrow the search or expand it to related terms. A search for “headaches,” for example, could be expanded to include migraines.
“When implementing this solution, an organization doesn’t need to have its data properly structured,” asserts Joe Insinga, Snthesis’s chief growth officer. “The organization just needs a vision of how it wants data to be structured.”
Companies don’t have to do it alone, though. “We have a workshop on day one,” Insinga says. “We discuss what companies want the result to look like and what vernacular or data classifications they want the system to use.” Different teams, he continues, often have different names for the same things, or researchers have an inconsistent approach to labeling spreadsheet columns from week to week.
Specifically, during the workshop, this means identifying the types of data that are most relevant, documenting them and communicating them formally to analysis teams across the organization. “Often, this has never been done,” Insinga notes. When such is the case, the workshop can become a discovery process for the organization. Notably, the teams needn’t always agree. The system can deal with variations as long as those variations are input into the Snthesis platform.
“To minimize any issues, we enable the organization to connect disparate data sources and to unify them without manual work or correlation,” Huitt adds. “For example, we can match up the column labeling in an automated fashion. Natural language processing has come a long way in the past 20 years or so.” Natural language processing, a form of AI that allows computers to understand and generate human language, may be best known for its applications in facilitating web searches and analyzing various forms of electronic communication. But it is also being used to analyze records of various kinds in healthcare.
The Snthesis Bio platform leverages proprietary natural language processing models that extract the right data from spreadsheets and laboratory notes and then harmonize everything that is collected. “Then we implement those models, building self-management tools so clients can manage and catalogue their data,” Huitt remarks.
Snthesis isn’t the only company that is developing technology to harmonize and integrate laboratory data. As Huitt points out, the European Union has provided significant levels of funding to make data more intraoperative, funding that is likely to stimulate additional development in Snthesis’s niche. Huitt remains confident, however, that Snthesis’s platform is more comprehensive and further along in development than the platforms developed by Snthesis’s competitors.
Notably, the Snthesis platform delivers data rather than conclusions. “Machine learning is good for prescriptive recommendations and describing data,” Huitt says. “But it isn’t yet reliable enough to be used for extracting conclusions from the data without accompanying human analysis.”
Apply big-tech magic
Huitt spent years building custom software tools that performed data management for customers who ranged from university laboratories to Fortune 500 companies. “I saw the same types of solutions being built over and over again,” he relates. “Customers believed they were so different that they couldn’t use off-the-shelf solutions, so they would spend millions of dollars building custom software. That’s very risky, especially for a fast-moving target like research data.
“Having worked in that space, as well as at the laboratory bench doing analysis and data management for a small team, I saw the problem from every angle.” He also reached out to big-tech colleagues about how other organizations—including Google and Facebook—make big data meaningful without touching each piece of data.
In December 2018, he formed Snthesis. Initially, the company was preoccupied with getting its technology “off the ground,” Huitt relates. Since then, the challenges have changed. According to Huit, the company is now more focused “on growing, on building its brand, and on being able to continue to compete.” Complicating these tasks is the difficulty of recruiting talent in a competitive industry. What used to be a recruitment advantage for Snthesis—the company’s willingness to offer remote work arrangements—became an industry norm during the pandemic.
Snthesis is also recognizing that it needs more visibility. One visibility challenge is the way “Snthesis” underperforms as a search term. Search engines tend to change it to “Synthesis,” so search results pointing to Snthesis usually follow those pointing to soundalike companies. This challenge may be little more than a nuisance, but it does suggest that Snthesis is more focused on developing its technology than perfecting its marketing. That’s not uncommon among young companies with scientific founders.
“I should have spent more time building our brand from day one,” Huitt admits. He’s beginning to remedy that. The company hosted a webinar on GEN earlier this year, and Huitt spoke at his first industry conference as CEO in September. Huitt will also be participating in the Bio-Hackathon Europe this November, in Barcelona.
For now, Huitt and Snthesis are devoted to taming the industry’s data monster. Looking forward, he says his goal is “to completely bridge the gap between benchtop research and data analysis.” That entails a broader integration with LIMs and other data sources.