Stochastic Systems Group  

Bounds on the Sample Complexity of Fitting Low Dimensional Manifolds to High Dimensional Data
Hariharan Narayanan
LIDS, MIT
We are increasingly confronted, in speech signals, images, geneexpression data, and other sources, with very high dimensional data. It is of considerable interest to know whether such data lie in the vicinity of a low dimensional manifold, since if this is the case, existing algorithms for nonlinear dimensionality reduction (such as Locally Linear Embedding, ISOMAP, Laplacian Eigenmaps, Hessian Eigenmaps, Diffusion maps etc) can be used to recover an approximation to such a manifold, leading to a better understanding of the topology, structure and symmetries of the data, economical storage and faster algorithms.
In this talk, we focus on the sample complexity of finding a submanifold whose complexity (measured in terms of parameters like its volume, curvature and covering number) is bounded above by predetermined constants, that approximately minimizes the expected squared distance to a random data point, among all such manifolds. One special case is the kmeans problem, where the manifold of interest is a set of k points and another is curve fitting, where the manifold of interest is a curve. Our main result is that if data is drawn i.i.d from a probability distribution supported on a unit ball in a high dimensional space R^m, the sample complexity for finding an epsilonoptimal submanifold (in terms of least squares) from a family of submanifolds whose ``covering numbers" at a scale epsilon are bounded above by U(epsilon), depends at most linearly on U(epsilon), and more importantly, has no dependence on the ambient dimension m.
This is based on work with Sanjoy Mitter.
Problems with this site should be emailed to jonesb@mit.edu