Making Sense of Unstructured Data
Editor’s Note: Every so often there is a point in history where you can mark what life was like before and how life is much different after. Such is the case with the “Big Data” movement—directing the power of high-performance computers and supercomputers to manipulate massive amounts of information—to unlock the mysteries across science and societal boundaries from genomics, environmental engineering, and high-energy physics, to law enforcement and security. This is one in a series of articles that demonstrates how the University of Illinois is a leader in the field of Big Data.
Suppose you want to learn as much as you can about demonstrations that happened in New York around December 5 in the early nineties; it is natural to do so using a keyword search like Google to search newspaper archives. You type in “December 5” and “demonstration” and get some results. What happens, however, if an article used the term “the week after Thanksgiving” or “the first week of December” to reference the date and does not sue the word “demonstration” but rather “street protest” or “civil unrest” or any of the hundreds of other options. You wouldn’t be able to satisfy your information need.
“Between 85 and 95 percent of the data today is textual data,” notes Roth. “The real challenge here is to access it and ‘understand’ it so that we can do something with the information in the textual data? Today we use keyword search to access the data. Hopefully everyone realizes that it’s not enough and wants more. This cuts across scientific communities, agencies and corporations. They all have a lot of textual data and need to be able to access the data in a way that requires some level of understanding of the meaning in the text, context-driven access.”
Roth is a prime example of the resources in both knowledge and machine that make the University of Illinois a leader in the emerging field of Big Data. Because textual data has been a driving force in the field, Roth’s research is at the heart of where it is heading.
“For instance, in the past, the way political science was done by hiring 10 graduate students to read articles, and they summarize it, but of course this is not scalable,” adds Roth. “The technology that we are developing will help us understand the text better.”
Roth’s team has developed some of the most commonly used tools in textual analytics and information extraction. He is a pioneer in developing statistical machine learning techniques in natural language processing.
His Cognitive Computational Group has developed, among other products, a wikifier, which a user can drop text in, and it will identify the key entities and concepts in it, disambiguate it and map it to the relevant Wikipedia page and provide additional information on these key concepts in the text. For instance if you use the name “Armstrong,” based on its context, the wikifier will know if the writer means Lance Armstrong, Louis Armstrong or Neil Armstrong.
“If you think about it from the perspective of accessing information, it’s a level of processing of the text that reflects some understanding that the reader may not have had. We can help them access information this way,” Roth said.
Roth is also developing tools that will allow its user to access information that’s relevant to them, and doing so by automating language understanding. He is confident that these kinds of tools will be pushed into other domains and developed in the context of scientific data, for example, where the number of new articles coming each year is so large that newcomers and even experts find it difficult to follow.
“Until 20 years ago, most of the data the corporations and agencies had was in databases, today it is in text.” Roth explained. “Therefore you have to be able to figure out how to deal with unstructured data.”
Perhaps no industry will benefit more than the medical domain, which produces millions of documents each year.
“Say a physician goes to see a new patient and wants to know something about their medical history, but it’s 110 pages long,” explained Roth. “That physician doesn’t have the time to look at all 110 pages and figure out what is relevant. We need to be able to develop these automatic tools that will allow people to access the data in a more sensible way.”
That is if a patient actually consults a physician. Today more and more people are making medical decisions from articles written on-line. But how can people make distinctions from trustworthy sources and some that aren’t as reliable? One of Roth’s research efforts is to look at information sources and the content they generate and develop an algorithmic framework will allow developing some distribution over the trustworthiness of sources and the credibility of the information.
“There has been a lot of work about extracting information from text and my group has also made a lot of progress in this direction, but we also need to ask ourselves – can we believe the data?” Roth asks. “In today’s world, because there is so much data out there, there has to be also an effort devoted to associating credibility with the content and with the sources that generated the data. It’s our job as experts who understand the technology to tell people not to believe everything is written on the web.
“Beyond the algorithmic framework, we are developing user interfaces for people that will allow them to access credibility information when they are accessing information,” he adds. “In some cases, maybe there are two perspectives, but one of them dominates in terms of count. The algorithm can help people know that there are multiple views on a given topic, and also provide levels of credibility for these opinions.”
Roth is intentional in reaching out to share his expertise with others. He has annually offered a data science summer institute, which has attracted students from across the country to learn the foundations of data science and apply them to society issues, such as assisting the local police department.
He has been working with groups both inside and outside the university such as Pete Nardulli, a professor of political science at Illinois and the Director of the Cline Center for Democracy. The Center has assembled a news repository of tens of millions of articles from every country of the world from the end of World War II to today on state repression, political violence, terrorism, strikes, coups, etc. Nardulli and his team are wrapping up a two-year project with Roth and NCSA, whose natural processing tools have taken the operation to a whole new level of sophistication.
“Most of the content that we need to advance knowledge is in the form of unstructured data,” Nardulli noted. “Without these tools, social sciences will fall behind and become stagnant because they can’t possibly come to grips with the amount of data that’s available. I think that Illinois can be a leader in these efforts.”
“One of the cool things here is that there is a chance for societal impact,” concludes Roth. “Computer science is no longer an area that looks inward. There is a huge chance of using these technologies to make a real difference. At Illinois, because we are so diverse and know how to collaborate, we have a big chance of leading the data science revolution.”