Big-Data Analytics in Resource-constrained Regime: Statistical Limits and Computational Challenges

Addressing the Problem

Statistical inference involving massive datasets is increasingly the norm of modern data analytic practice. Traditional approaches in statistical inference assume the data-constrained regime where it is expensive to collect samples or data points. In modern Big Data analytics, the amount of the data available and the speed at which such datasets are being collected are increasing at extreme rates. To cope with the statistical and computational challenges in such datasets, we need a new paradigm for statistical inference in the resource-constrained regime, where data are abundant but we are increasingly pressed by time and memory space. The key questions still remaining unanswered are: When dealing with big data, how do resource constraints impact the performance of inference and learning and how can we strike the optimal tradeoff between statistical performance and resource efficiency?

Traditionally, theoretical computer scientists have focused on the computational and space complexity of various worst-case problems rather than problems of a statistical nature. In contrast, statisticians are mostly concerned with performing inference tasks optimally under various statistical models from a decision-theoretic perspective, without addressing the constraints on computational and storage resources. However, the ever-growing amount of data dictates us to take into account the computational and space complexity of inference and learning. To date, the connection between statistical inference and complexity theory on big data has remained largely an untapped area, which recently has garnered a surge of attention from various communities such as computer science, engineering, and statistics. There is a dire need for bridging the gap between these two viewpoints on statistical inference with massive datasets.

Research Goals

Combining both statistical and computational perspectives, we plan to undertake a systematic study of various inference and learning problems in big-data analytics. In addition to the information-theoretic determination of the fundamental limits, the major innovation lies in identifying the optimal statistical performance under resource constraints such as computational and space complexity, as well as designing approximation algorithms with efficient implementations and provable guarantees, thereby quantifying the optimal tradeoff between statistical performance and resource utilization.

Current Activities

  • Determine the statistical fundamental limits of various inference problems via information-theoretic methods (as benchmark for designing and comparing computationally efficient procedures);
  • Determine the complexity-theoretic performance limits of efficient inference algorithms by establishing connections to hardness results in theoretical computer science via rigorous reduction arguments;
  • Understand the underlying reason for the gap between the statistical and computational limits by studying the spectral structure of the network data. To bridge the gap, one direction aligned with the recent trend in machine learning is to identify the low-dimensional structure which can be leveraged to circumvent the computational barrier. Another direction is to develop scalable and efficient approximation algorithms to achieve the optimal tradeoff between statistical performance, computational complexity and sample size; and,
  • Understand the principles underlying fast-streaming algorithms with optimal space complexity, and develop a new framework for designing streaming algorithms. Leveraging known approaches, the goal is to provide a guideline for designing memory efficient streaming algorithms.
  • Apply the group’s results to large datasets that are publicly available such as the dataset collection available via the Stanford Network Analysis Project (SNAP).