Advertisement
NO ACCESS
Reports

Nonlinear Dimensionality Reduction by Locally Linear Embedding

Science
22 Dec 2000
Vol 290, Issue 5500
pp. 2323-2326

Abstract

Many areas of science depend on exploratory data analysis and visualization. The need to analyze large amounts of multivariate data raises the fundamental problem of dimensionality reduction: how to discover compact representations of high-dimensional data. Here, we introduce locally linear embedding (LLE), an unsupervised learning algorithm that computes low-dimensional, neighborhood-preserving embeddings of high-dimensional inputs. Unlike clustering methods for local dimensionality reduction, LLE maps its inputs into a single global coordinate system of lower dimensionality, and its optimizations do not involve local minima. By exploiting the local symmetries of linear reconstructions, LLE is able to learn the global structure of nonlinear manifolds, such as those generated by images of faces or documents of text.

Get full access to this article

View all available purchase options and get full access to this article.

Already a subscriber or AAAS Member? Log In

REFERENCES AND NOTES

1
M. L. Littman, D. F. Swayne, N. Dean, A. Buja, in Computing Science and Statistics: Proceedings of the 24th Symposium on the Interface, H. J. N. Newton, Ed. (Interface Foundation of North America, Fairfax Station, VA, 1992), pp. 208–217.
2
T. Cox, M. Cox, Multidimensional Scaling (Chapman & Hall, London, 1994).
3
Takane Y., Young F. W., Psychometrika 42, 7 (1977).
4
J. Tenenbaum, in Advances in Neural Information Processing 10, M. Jordan, M. Kearns, S. Solla, Eds. (MIT Press, Cambridge, MA, 1998), pp. 682–688.
5
The set of neighbors for each data point can be assigned in a variety of ways: by choosing the K nearest neighbors in Euclidean distance, by considering all data points within a ball of fixed radius, or by using prior knowledge. Note that for fixed number of neighbors, the maximum number of embedding dimensions LLE can be expected to recover is strictly less than the number of neighbors.
6
For certain applications, one might also constrain the weights to be positive, thus requiring the reconstruction of each data point to lie within the convex hull of its neighbors.
7
Fits: The constrained weights that best reconstruct each data point from its neighbors can be computed in closed form. Consider a particular data point x⃗ with neighbors η⃗j and sum-to-one reconstruction weights wj. The reconstruction error |x⃗ – Σj=1K wjη⃗j|2 is minimized in three steps. First, evaluate inner products between neighbors to compute the neighborhood correlation matrix, Cjk = η⃗j · η⃗k, and its matrix inverse, C−1. Second, compute the Lagrange multiplier, λ = α/β, that enforces the sum-to-one constraint, where α = 1 − ΣjkCjk−1(x⃗ · η⃗k) and β = ΣjkCjk−1. Third, compute the reconstruction weights: wj = ΣkCjk−1(x⃗ · η⃗k + λ). If the correlation matrix C is nearly singular, it can be conditioned (before inversion) by adding a small multiple of the identity matrix. This amounts to penalizing large weights that exploit correlations beyond some level of precision in the data sampling process.
8
Indeed, LLE does not require the original data to be described in a single coordinate system, only that each data point be located in relation to its neighbors.
9
The embedding vectors Y⃗i are found by minimizing the cost function Φ(Y) = Σi|Y⃗i – ΣjWijY⃗j|2 over Y⃗i with fixed weights Wij. This optimization is performed subject to constraints that make the problem well posed. It is clear that the coordinates Y⃗i can be translated by a constant displacement without affecting the cost, Φ(Y). We remove this degree of freedom by requiring the coordinates to be centered on the origin: ΣiY⃗i = 0⃗. Also, to avoid degenerate solutions, we constrain the embedding vectors to have unit covariance, with outer products that satisfy 1/N Σi Y⃗iY⃗i = I, where I is the d × d identity matrix. Now the cost defines a quadratic form, Φ(Y) = Σij Mij(Y⃗i·Y⃗j), involving inner products of the embedding vectors and the symmetric N × N matrix
Mij=δijWijWji+kWkiWkj
where δij is 1 if i = j and 0 otherwise. The optimal embedding, up to a global rotation of the embedding space, is found by computing the bottom d + 1 eigenvectors of this matrix (24). The bottom eigenvector of this matrix, which we discard, is the unit vector with all equal components; it represents a free translation mode of eigenvalue zero. (Discarding it enforces the constraint that the embeddings have zero mean.) The remaining d eigenvectors form the d embedding coordinates found by LLE. Note that the matrix M can be stored and manipulated as the sparse matrix (IW)T(IW), giving substantial computational savings for large values of N. Moreover, its bottom d + 1 eigenvectors (those corresponding to its smallest d + 1 eigenvalues) can be found efficiently without performing a full matrix diagonalization (25).
10
Manifold: Data points in Fig. 1B (N = 2000) were sampled from the manifold (D = 3) shown in Fig. 1A. Nearest neighbors (K = 20) were determined by Euclidean distance. This particular manifold was introduced by Tenenbaum (4), who showed that its global structure could be learned by the Isomap algorithm.
11
Faces: Multiple photographs (N = 2000) of the same face were digitized as 20 × 28 grayscale images. Each image was treated by LLE as a data vector with D = 560 elements corresponding to raw pixel intensities. Nearest neighbors (K = 12) were determined by Euclidean distance in pixel space.
12
Words: Word-document counts were tabulated for N = 5000 words from D = 31,000 articles in Grolier's Encyclopedia (26). Nearest neighbors (K = 20) were determined by dot products between count vectors normalized to unit length.
13
D. DeMers, G. W. Cottrell, in Advances in Neural Information Processing Systems 5, D. Hanson, J. Cowan, L. Giles, Eds. (Kaufmann, San Mateo, CA, 1993), pp. 580–587.
14
Kramer M., AIChE J. 37, 233 (1991).
15
T. Kohonen, Self-Organization and Associative Memory (Springer-Verlag, Berlin, 1988).
16
Bishop C., Svensen M., Williams C., Neural Comput. 10, 215 (1998).
17
Klock H., Buhmann J., Pattern Recognition 33, 651 (1999).
18
Hastie T. J., Stuetzle W., J. Am. Stat. Assoc. 84, 502 (1989).
19
Donnell D. J., Buja A., Stuetzle W., Ann. Stat. 22, 1635 (1994).
20
Martinetz T., Schulten K., Neural Networks 7, 507 (1994).
21
Beymer D., Poggio T., Science 272, 1905 (1996).
22
Although in all the examples considered here, the data had a single connected component, it is possible to formulate LLE for data that lies on several disjoint manifolds, possibly of different underlying dimensionality. Suppose we form a graph by connecting each data point to its neighbors. The number of connected components (27) can be detected by examining powers of its adjacency matrix. Different connected components of the data are essentially decoupled in the eigenvector problem for LLE. Thus, they are best interpreted as lying on distinct manifolds, and are best analyzed separately by LLE.
23
If neighbors correspond to nearby observations in time, then the reconstruction weights can be computed online (as the data itself is being collected) and the embedding can be found by diagonalizing a sparse banded matrix.
24
R. A. Horn, C. R. Johnson, Matrix Analysis (Cambridge Univ. Press, Cambridge, 1990).
25
Z. Bai, J. Demmel, J. Dongarra, A. Ruhe, H. van der Vorst, Eds., Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide (Society for Industrial and Applied Mathematics, Philadelphia, PA, 2000).
26
Lee D. D., Seung H. S., Nature 401, 788 (1999).
27
R. Tarjan, Data Structures and Network Algorithms, CBMS 44 (Society for Industrial and Applied Mathematics, Philadelphia, PA, 1983).
28
I. T. Jolliffe, Principal Component Analysis (Springer-Verlag, New York, 1989).
29
Kambhatla N., Leen T. K., Neural Comput. 9, 1493 (1997).
30
We thank G. Hinton and M. Revow for sharing their unpublished work (at the University of Toronto) on segmentation and pose estimation that motivated us to “think globally, fit locally”; J. Tenenbaum (Stanford University) for many stimulating discussions about his work (4) and for sharing his code for the Isomap algorithm; D. D. Lee (Bell Labs) and B. Frey (University of Waterloo) for making available word and face data from previous work (26); and C. Brody, A. Buja, P. Dayan, Z. Ghahramani, G. Hinton, T. Jaakkola, D. Lee, F. Pereira, and M. Sahani for helpful comments. S.T.R. acknowledges the support of the Gatsby Charitable Foundation, the U.S. National Science Foundation, and the National Sciences and Engineering Research Council of Canada.

Information & Authors

Information

Published In

Science
Volume 290 | Issue 5500
22 December 2000

Submission history

Received: 7 August 2000
Accepted: 17 November 2000
Published in print: 22 December 2000

Permissions

Request permissions for this article.

Authors

Affiliations

Sam T. Roweis
Gatsby Computational Neuroscience Unit, University College London, 17 Queen Square, London WC1N 3AR, UK.
Lawrence K. Saul
AT&T Lab—Research, 180 Park Avenue, Florham Park, NJ 07932, USA.
E-mail: [email protected] (S.T.R.); [email protected] (L.K.S.)

Metrics & Citations

Metrics

Article Usage
Altmetrics

Citations

Export citation

Select the format you want to export the citation of this publication.

Cited by

    Loading...

    View Options

    Check Access

    Log in to view the full text

    AAAS ID LOGIN

    AAAS login provides access to Science for AAAS Members, and access to other journals in the Science family to users who have purchased individual subscriptions.

    Log in via OpenAthens.
    Log in via Shibboleth.
    More options

    Purchase digital access to this article

    Download and print this article for your personal scholarly, research, and educational use.

    Purchase this issue in print

    Buy a single issue of Science for just $15 USD.

    View options

    PDF format

    Download this article as a PDF file

    Download PDF

    Media

    Figures

    Multimedia

    Tables

    Share

    Share

    Share article link

    Share on social media