Advertisement

Abstract

A language-independent means of gauging topical similarity in unrestricted text is described. The method combines information derived from n-grams (consecutive sequences of n characters) with a simple vector-space technique that makes sorting, categorization, and retrieval feasible in a large multilingual collection of documents. No prior information about document content or language is required. Context, as it applies to document similarity, can be accommodated by a well-defined procedure. When an existing document is used as an exemplar, the completeness and accuracy with which topically related documents are retrieved is comparable to that of the best existing systems. The results of a formal evaluation are discussed, and examples are given using documents in English and Japanese.

References

Cohen, J. D., France's patent no. 2,694,984.
Schmitt, J. C., U.S. Patent 5,062,143 (1990).
ANGELL, R.C., AUTOMATIC SPELLING CORRECTION USING A TRIGRAM SIMILARITY MEASURE, INFORMATION PROCESSING & MANAGEMENT 19: 255 (1983).
Cavnar, W. B., The Second Text Retrieval Conference (TREC-2): 171 (1994).
Cavnar, W. B., N-Gram-Based Text Categorization, Proceedings of the 1994 Symposium on Document Analysis and Information Retrieval: 161 (1994).
COHEN, J.D., HIGHLIGHTS - LANGUAGE-INDEPENDENT AND DOMAIN-INDEPENDENT AUTOMATIC-INDEXING TERMS FOR ABSTRACTING, JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE 46: 162 (1995).
COHEN, J.D., unpublished data.
Cormen, T. H., Introduction to Algorithms (1990).
Harman, D. K., The Second Text Retrieval Conference (TREC-2) (1994).
Harman, D. K., The Third Text Retrieval Conference (TREC-3) (1995).
HUFFMAN, S.M., unpublished data.
HULL, J.J., EXPERIMENTS IN TEXT RECOGNITION WITH BINARY N-GRAM AND VITERBI ALGORITHMS, IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 4: 520 (1982).
Knuth, D. E., Sorting and Searching 3 (1973).
Mah, C. P., DISCIPLE Final Report, PAR Technology Corporation Report 83-121 (1983).
PETERSON, J.L., COMPUTER-PROGRAMS FOR DETECTING AND CORRECTING SPELLING-ERRORS, COMMUNICATIONS OF THE ACM 23: 676 (1980).
POLLOCK, J.J., SPELLING ERROR-DETECTION AND CORRECTION BY COMPUTER - SOME NOTES AND A BIBLIOGRAPHY, JOURNAL OF DOCUMENTATION 38: 282 (1982).
Salton, G., Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer: 379 (1989).
SALTON, G, DEVELOPMENTS IN AUTOMATIC TEXT RETRIEVAL, SCIENCE 253: 974 (1991).
SALTON, G, GLOBAL TEXT MATCHING FOR INFORMATION-RETRIEVAL, SCIENCE 253: 1012 (1991).
SALTON, G, AUTOMATIC-ANALYSIS, THEME GENERATION, AND SUMMARY OF MACHINE-READABLE TEXTS, SCIENCE 264: 1421 (1994).
SCHAMBER, L, A REEXAMINATION OF RELEVANCE - TOWARD A DYNAMIC, SITUATIONAL DEFINITION, INFORMATION PROCESSING & MANAGEMENT 26: 755 (1990).
Scholtes, J., International Joint Conference on Neural Networks, Singapore 1: 95 (1991).
SHANNON, C.E., PREDICTION AND ENTROPY OF PRINTED ENGLISH, BELL SYSTEM TECHNICAL JOURNAL 30: 50 (1951).
Shannon, C. E., The Mathematical Theory of Communication (1949).
SUEN, C.Y., N-GRAM STATISTICS FOR NATURAL-LANGUAGE UNDERSTANDING AND TEXT PROCESSING, IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1: 164 (1979).
SWANSON, D.R., J AM SOC INFORM SCI 26: 755 (1990).
WILLETT, P, DOCUMENT-RETRIEVAL EXPERIMENTS USING INDEXING VOCABULARIES OF VARYING SIZE .2. HASHING, TRUNCATION, DIGRAM AND TRIGRAM ENCODING OF INDEX TERMS, JOURNAL OF DOCUMENTATION 35: 296 (1979).
YANNAKOUDAKIS, E.J., THE GENERATION AND USE OF TEXT FRAGMENTS FOR DATA-COMPRESSION, INFORMATION PROCESSING & MANAGEMENT 18: 15 (1982).
ZAMORA, A, AUTOMATIC DETECTION AND CORRECTION OF SPELLING-ERRORS IN A LARGE DATA-BASE, JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE 31: 51 (1980).
ZAMORA, E.M., THE USE OF TRIGRAM ANALYSIS FOR SPELLING ERROR-DETECTION, INFORMATION PROCESSING & MANAGEMENT 17: 305 (1981).
Get full access to this article

View all available purchase options and get full access to this article.

Already a Subscriber?

Information & Authors

Information

Published In

Science
Volume 267 | Issue 5199
10 February 1995

Submission history

Published in print: 10 February 1995

Permissions

Request permissions for this article.

Authors

Affiliations

Marc Damashek
Department of Defense, Fort George G. Meade, MD 20755-6000, USA.

Metrics & Citations

Metrics

Article Usage
Altmetrics

Citations

Export citation

Select the format you want to export the citation of this publication.

Cited by
  1. Performance of Text Retrieval Systems, Science, 268, 5216, (1417-1418), (1995)./doi/10.1126/science.268.5216.1417-c
    Abstract
  2. Performance of Text Retrieval Systems, Science, 268, 5216, (1417-1418), (1995)./doi/10.1126/science.268.5216.1417.c
    Abstract
Loading...

View Options

Get Access

Log in to view the full text

AAAS ID LOGIN

AAAS login provides access to Science for AAAS Members, and access to other journals in the Science family to users who have purchased individual subscriptions.

Log in via OpenAthens.
Log in via Shibboleth.
More options

Purchase digital access to this article

Download and print this article for your personal scholarly, research, and educational use.

Purchase this issue in print

Buy a single issue of Science for just $15 USD.

View options

PDF format

Download this article as a PDF file

Download PDF

Media

Figures

Multimedia

Tables

Share

Share

Share article link

Share on social media