Classification of Machine-Printed and Handwritten Text for Document Images
Technical Challenge:The proliferation of economical computers, scanners, and fax machines has resulted in the use of document images as a common practice in nearly every office around the world today. Document images can contain an incredible variety of information including tables, graphs, drawings, charts, and text characters. These documents may be written in a large variety of languages and may contain handwritten or machine-printed information. Automated techniques have been developed to process large volumes of document images, but these techniques are limited. For example, a character recognition algorithm may recognize only machine-printed English characters. Reliable algorithms to automatically process all types and varieties of document images are needed.
Description:The purpose of this technology is to discriminate between handwritten and machine-printed document images, for a diverse set of documents. A document image is input and the system automatically determines whether the document was written by hand or was produced by some type of electronic printer or typewriter. This technology is independent of the language used or the character font style. When handwritten images are processed by character recognition, the resulting text contains numerous errors, which greatly reduce the performance of information retrieval applications. This technology could be run prior to the character recognition process and allow only the machine-printed documents to be passed to the character recognition engine. This technology will not only increase the accuracy of information retrieval, but also will reduce the required computational requirements.
Demonstration Capability:This capability can currently be demonstrated on the internal NSA computers, but could be easily exported to other unclassified systems for demonstrations.
Potential Commercial Application(s):Companies involved in automated document processing will be potentially interested in this technology. Xerox and Ricoh both have significant commercial efforts in this area. In addition, companies that develop character recognition engines for document images, such as Caere and Xerox, may be interested as well.
Patent Status:Issued: United States Patent Number 7,072,514 (Updated)".
Reference Number: 1272
If you are interested in exploring this technology further, please express your interest in writing to the:
National Security Agency
Date Posted: Jan 15, 2009 | Last Modified: Jan 15, 2009 | Last Reviewed: Jan 15 2009