Researchers from the Broad Institute and Harvard University have developed a tool that can tackle large data sets in a way that no other software program can. Part of a suite of statistical tools called MINE, it can tease out multiple patterns hidden in health information from around the globe, statistics amassed from a season of major league baseball, data on the changing bacterial landscape of the gut, and much more.
From Facebook to physics to the global economy, the world is filled with data sets that could take a person hundreds of years to analyse by eye. Sophisticated computer programs can search these data sets with great speed, but fall short when researchers attempt to even-handedly detect different kinds of patterns in large data collections.
"There are massive data sets that we want to explore, and within them, there may be many relationships that we want to understand," said Broad Institute associate member Pardis Sabeti, senior author of the paper and an assistant professor at the Center for Systems Biology at Harvard University. "The human eye is the best way to find these relationships, but these data sets are so vast that we can’t do that. This toolkit gives us a way of mining the data to look for relationships."
The researchers tested their analytical toolkit on several large data sets, including one provided by Harvard colleague Peter Turnbaugh who is interested in the trillions of micro-organisms that live in the gut. Working with Turnbaugh, the research team harnessed MINE to make more than 22 million comparisons and narrowed in on a few hundred patterns of interest that had not been observed before.
"The goal of this statistic is to take data with a lot of different dimensions and many possible correlations and pick out the top ones," said Michael Mitzenmacher, a senior author of the paper and professor of computer science at Harvard University. "We view this as an exploration tool – it can find patterns and rank them in an equitable way."
One of the tool’s greatest strengths is that it can detect a wide range of patterns and characterise them according to a number of different parameters a researcher might be interested in. Other statistical tools work well for searching for a specific pattern in a large data set, but cannot score and compare different kinds of possible relationships. MINE, which stands for Maximal Information-based Nonparametric Exploration, is able to analyse a broad spectrum of patterns.
"Standard methods will see one pattern as signal and others as noise," said David Reshef, a co-first author of the paper who is currently a graduate student in the Harvard-MIT Health Sciences and Technology program and also worked on this project as a graduate student in the department of statistics at the University of Oxford. "There can potentially be a variety of different types of relationships in a given data set. What’s exciting about our method is that it looks for any type of clear structure within the data, attempting to find all of them."
Not only does MINE attempt to identify any pattern within the data, but it also attempts to do so with an eye toward capturing different types of patterns equally well. "This ability to search for patterns in an equitable way offers tremendous exploratory potential in terms of searching for patterns without having to know ahead of time what to search for," said David Reshef.
Broad Institute of Harvard and MIT