This lecture presented at Remote Sensing, Uncertainty Quantification and a Theory of Data Systems Workshop - Cahill Center, California Institute of Technology
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
CLIM Program: Remote Sensing Workshop, Some Ideas on Theory of Data Systems - Ansu Chatterjee, Feb 12, 2018
1. Some ideas on Theory of Data Systems
Ansu Chatterjee
School of Statistics,
University of Minnesota
February 12, 2018
2. What’s contained below
This is partially based on papers by Chandrasekaran, De
Domenico, Smith, Johnson, and other talks we heard
earlier today.
Contains a lot of ramblings.
Important disclaimer: I’m not an expert on most things
discussed below!
3. The main idea
Chandrasekaran (and Jordan, Soh, Berthet) proposed
using a bit of both of these considerations: so there is a
cost for time/storage, there is also a quality constraint on
the resulting estimate, and both depend on the sample
size.
De Domenico brought in multi-layered networks: the
different data repositories, the different computational
resources, algorithms, models, can form networks.
Smith introduced different blocking methods.
Crichton brought out the different levels of abstractions and
complexities that underlie a data analytic task.
Johnson brought together all the above elements.
4. The main idea
The way we do Statistics now: (i) data is a resource, (ii)
the more we use data the better our estimates get, (iii)
there is no cost associated with obtaining data,
pre-processing, storing, time required to access data and
to run the algorithm.
The way algorithms are evaluated in (theoretical) computer
science: (i) time for implementation and storage are the
important things to consideration, (ii) quality and usage of
the output does not matter, (iii) data is a constraint (the
more data, the greater is the demand on storage and time).
Constrained view of what is optimal.
5. The goals may converge
Example
Consider i.i.d. observations X1, . . . , Xn from Np(0, Σ).
The goal is to estimate (λ1, P1), the highest eigenvalue
and the corresponding eigenvector of Σ.
This can be done in O(p2) steps more or less (depends on
assumptions, skills).
That might be a futile exercise, since in general the sample
maximum eigenvalue is not consistent for the population
maximum eigenvalue.
Solution: Make assumptions about Σ, which may reduce
computational complexity and make the computational
meaningful.
6. Many goals in both computation and statistics
CS: Traditionally, we consider time and storage
requirements as important quantifiers of complexity.
CS: The distributed nature of data, and distributed
computing, generates additional quantifiers. (Think today’s
talks, think MapReduce and beyond.)
Statistics: Robustness (to assumptions, model, quality of
data) are additional quantifiers of desirable statistical
properties.
Statistics: High-dimensional data requires strong
assumptions.
7. Statistics is forgiving
Example
An estimator of the mean is ¯Xn.
Anything wrong with ¯Xn + 42/n? Absolutely nothing (up to
first order asymptotics).
We do not need the estimators to be more than O(n−α)
precise.
Statistical inference often does not depend on being
ultra-precise, and shouldn’t be considered trustworthy if it
does (takes us back to robustness).
(As Venkat pointed out, and George Box before him:)
Statistical models are approximations anyway. :
Essentially, all models are wrong, but some are useful. (G.
E. P. Box)
True for not just statistical models, but also for all other
sciences as well.
8. There is more to it
Example
Facebook, allegedly, has a data center in Lapland:
There are benefits to Facebook with reduced costs of
cooling towers.
There are environmental costs to Lapland.
There are possible economic benefits to Sweden.
Possibly a cost-benefit analysis, the way economist do,
may be useful here. (Real options?)
9. Don’t forget the stakeholders
Example
All tweets are stored (hopefully):
Is that necessary? Informative?
For example: all tweets of the president may be personally
important to him.
They may be important to future historians.
There are different kinds of stakeholders for each
computation/statistics exercise. Often, the core issue is
one of inference.
My personal view: Think of streaming data as the template,
and consider careful preferential sampling and online
updating. (Leads to non-regular asymptotics!)