Research Menu

Skip Search Box

Information Sorting and Retrieval by Language or Topic

Technical Description:

This technique is an extremely simple, fast, completely general method of sorting and retrieving machine-readable text according to language and/or topic. The method is totally independent of the particular languages or topics of interest, and relies for guidance solely upon examples (e.g., existing documents, fragments, etc.) provided by the user. It employs no dictionaries, keywords, stoplists, stemming, syntax, semantics, or grammar; nevertheless, it is capable of distinguishing among closely-related topics (previously considered inseparable) in any language, and it can do so even in text containing a great many errors (typically 10-15% of all characters). The technique can be quickly implemented in software on any computer system, from microprocessor to supercomputer, and can easily be implemented in inexpensive hardware as well. It is directly scalable to very large data sets (millions of documents). U.S. Patent No. 5,418,951.

Commercial Application:

  • Language and topics-independent sorting and retrieval of documents satisfying dynamic criteria defined only by existing documents.
  • Clustering of topically related documents, with no prior knowledge of the languages or topics that may be present. If desired, this activity can automatically generate document selectors.
  • Specialized sorting tasks, such as identification of duplicate or near-duplicate documents in a large set.

Released: 1993

Reference Number: Acq.

If you are interested in exploring this technology further, please call 443-445-7159 or express your interest in writing to the:

National Security Agency
NSA Technology Transfer Program
9800 Savage Road, Suite 6541
Fort George G. Meade, Maryland 20755-6541


Date Posted: Jan 15, 2009 | Last Modified: Jan 15, 2009 | Last Reviewed: Jan 15 2009