Big Data

Every so often there is a point in history where you can mark what life was like before and how life is much different after.

Such is the case with the “Big Data” movement—harnessing the power of computers and supercomputers to draw insights from massive amounts of information—that power will unlock mysteries across scientific and societal boundaries in a variety of fields from genomics, environmental engineering, and high-energy physics, to law enforcement and security.

“Illinois is in a unique position amongst academic institutions because of its long history in computing and data,” said Bill Kramer, project director for Blue Waters, the National Science Foundation-funded sustained-petascale supercomputer at the National Center for Supercomputing Applications (NCSA). “The University has the critical mix of people who understand the methods, people who understand the physical resources, and people who understand the investigative goals.”

As one of the most powerful supercomputers in the world, Blue Waters became fully operational last spring. Although Blue Waters is the fastest supercomputer on a university campus, it is just one piece of the computing puzzle. As a world leader in computing and information technology, Illinois traces its computational roots back to the original vacuum-tube Illiac, generations of mainframe supercomputers, as well as world-changing innovations such as PLATO and Mosaic.

$100 Million Grainger Engineering Breakthroughs Initiative

Big Data is a key element of the Grainger Engineering Breakthroughs Initiative. More than $20 million of that support funds Big Data initiatives and more than a dozen new senior faculty in the field.

Illinois’ approach to Big Data targets six areas:

  1. Sensing and context of data collection and generation
  2. Data storage and representation
  3. Data analysis, science and engineering
  4. Data interference and visualization
  5. Computing support for data analysis
  6. Data intensive applications

The definition of Big Data isn't universal. It's often thought of in terms of how to acquire data, store it, understand the forms it takes, use the right tools to analyze it, and make decisions based on the knowledge that’s revealed. However, for Engineering at Illinois, the approach to Big Data is even broader. The idea that Big Data is more a philosophy than it is a tool in our arsenal. It's a new way of doing things, in which our search for solutions is heavily driven by data.

Computer engineering will provide the infrastructure. Communications and signal processing allow us to gain understanding, and the area of decision control will give us the understanding to draw inferences. We are using these areas of expertise to make science-based, credible decisions that will have a significant impact on the world around us.

The Science of Big Data

“In the same manner in which mathematics is the science for dealing with reasoning, Big Data is going to be the way you think about the relationships and causal effects of observations in the world,” explained Roy Campbell, a professor of computer science. “You may have millions of data points, but you can still reason and use it in a methodical scientific approach. That’s where the science is going to emerge from in this discussion. Some of those principles come from machine learning, some from statistics, and some are based on visualizing results so you can deliver valuable insights to the people who can use that information.”

Campbell reports that the amount of data in the world is increasing by about 40 percent each year. In addition to taking that data to analyze the past, more and more Big Data will be used to predict how society will behave in future situations.

Unstructured Data

Kramer points to three kinds of data—structured data that comes from simulation, observational data that comes largely from experiments and tools like telescopes or particle accelerators, and unstructured data. Unstructured data is the untidy compilation of video and text that we produce daily as a society, inside and outside of scientific research.

Dan Roth has been at the forefront of research in natural language understanding and machine learning. Roth and his team have developed tools that can analyze human language, categorize it, parse it semantically, and “wikify” it—disambiguate it and map snippets of text to the relevant Wikipedia pages. These tools are used by numerous researchers and some commercial companies to access text in more sophisticated ways than a keyword search, which is used by search engines such as Google.

“Until 20 years ago, most of the data the corporations and agencies had was in databases,” Roth noted. “Today, between 85 and 95 percent is textual data. Therefore, you have to be able to figure out how to deal with unstructured data.” Among the beneficiaries of this method is medicine, where over a million articles are written each year and a lot more textual electronic records are generated by caretakers.

“For example: A physician goes to see a patient and wants to know something about their medical history, but that history is 110 pages long,” Roth explained. “The doctor doesn’t have time to look at the 110 pages to determine what it relevant. We are developing automatic tools that will allow people to access the data in a more sensible way.” As researchers develop new ways to extract information, Roth is also leading an effort for associating credibility and trustworthiness to data that we are all exposed to on the web, and the sources that provide the data.

Societal Impact

As these tools develop, data scientists will further impact quality of life, perhaps doing things like predicting weather events and analyzing traffic patterns.

“The upsides are compelling for all sorts of reasons,” Campbell explained. “We can actually determine what gets in the way of having a great life. It is the engineering motto in some sense: ‘If you are really going to improve society, then you’re going to have to find the truths underlying all the different parts of society and give it to everyone’.”

One of the most important areas in which Big Data will have a huge impact is in the healthcare industry. For instance, Steven Lumetta, a University of Illinois professor of electrical and computer engineering, predicts more and more people will be basing medical choices on genetic results. Because of the difficulty of processing huge amounts of genetic data with reliability, Illinois researchers launched the CompGen Initiative in the fall of 2013—a four-year, $2.6 million research effort to build the next-generation computational platform for genomics.

Pete Nardulli, a professor of political science and the director of the Cline Center for Democracy at Illinois, has been using data science tools to help extract information on civil strife.

“Our main interest is in societal development and the role institutions play,” Nardulli said. “These days civil strife is such an important factor in development.” The Center has assembled a news repository of tens of millions of articles from every country of the world from the end of World War II to today on state repression, political violence, terrorism, strikes, coups, and the like. Nardulli and his team are wrapping up a two-year project with Roth and NCSA, whose natural language processing tools have taken the operation to a whole new level of sophistication.

“Most of the content that we need to advance knowledge is in the form of unstructured data,” Nardulli noted. “Without these Big Data tools, social sciences will fall behind and become stagnant because they can’t possibly come to grips with the amount of data that’s available. I think that Illinois can be a leader in these efforts.”

“Computer science is no longer an area that only looks inward,” Roth added. “There is a huge change in the field and a focus on developing technologies that have societal impact. At Illinois, because we are  so diverse and know how to collaborate, we have a significant  opportunity for leading the data science revolution.”