What is Optical Character Recognition (OCR)?

Woman scanning historic newspaper into OCR program
Scanning historic newspaper with OCR software.  Getty Images

Optical Character Recognition (OCR) refers to software that creates a digital version of a printed, typed, or handwritten document that computers can read without the need to manually type or enter the text. OCR is generally used on scanned documents in PDF format, but can also create a computer-readable version of text within an image file.

What is OCR?

OCR, also referred to as text recognition, is software technology that transforms characters such as numbers, letters, and punctuation (also called glyphs) from printed or written documents into an electronic form more easily recognized and read by computers and other software programs. Some OCR programs do this as a document is scanned or photographed with a digital camera and others can apply this process to documents that have been previously scanned or photographed without OCR. OCR allows users to search within PDF documents, edit text, and re-format documents.

What is OCR Used For?

For quick, every day scanning needs, OCR may not be a big deal. If you do a large amount of scanning, being able to search within PDFs to find the exact one you need can save quite a bit of time and makes OCR functionality in your scanner program more important. Here are some other things OCR helps with:

  • Automated data processing and data entry (Example: Job applicant tracking systems for resumes)
  • Making scanned books searchable
  • Converting handwritten scans to computer-readable text
  • Making documents more usable by reader programs that assist visually-impaired users
  • Preserving historic documents and newspapers, while also making them searchable
  • Data extraction and transfer to accounting programs (Example: Receipts and invoices)
  • Indexing documents for use by search engines
  • Recognition of driver license plates by speed camera and red-light camera software
  • Speech synthesizers for people who cannot speak – theoretical physicist, Stephen Hawking, is perhaps the most well-known user of a speech synthesizer program

Why Use OCR?

Why not just take a picture, right? Because you wouldn't be able to edit anything or search the text because it would just be an image. Scanning the document and running OCR software can turn that file into something you can edit and be able to search.

History of OCR

While the very earliest use of text recognition dates to 1914, the wide-spread development and use of OCR-related technologies began in earnest in the 1950s, specifically with the creation of very simplified fonts that were easier to convert to digitally-readable text. The first of these simplified fonts was created by David Shepard and commonly known as OCR-7B. OCR-7B is still in use today in the financial industry for the standard font used on credit cards and debit cards. In the 1960s, postal services in several countries began using OCR technology to vastly speed up mail sorting, including the United States, Great Britain, Canada, and Germany. OCR is still the core technology used to sort mail for postal services around the world. In 2000, key knowledge of the limits and capabilities of OCR technology was used to develop the CAPTCHA programs used to stop bots and spammers.

Over the decades, OCR has grown more accurate and more sophisticated due to advancements in related technology areas such as artificial intelligence, machine learning, and computer vision. Today, OCR software uses pattern recognition, feature detection, and text mining to transform documents faster and more accurately than ever before.