Google uses OCR to Index the Text from Scanned Documents

Justin Kerr

Just when we thought search couldn’t get any better, Evin Levey  product manager at Google has blogged about a new feature that could have a dramatic impact on your search results. Scanned documents have been appearing in Google’s search results for quite some time now, but for the most part they were usually weren’t at the top your list regardless of how relevant they may have been. The reason for this is simple; when the search engine runs into an Adobe PDF file that was scanned as an image; it wasn’t able to read the contents other then what was contained within the meta tag. The article may well have been the definitive source on the topic for which you were searching, but until now they had no way of knowing what was in the document or sorting out key words in any type of automated fashion. On Thursday this all changed and it appears the search engine has successfully implemented a form of optical character recognition that can index the text for easy searching. This adds significant power to Google’s ability to catalog things such as books which are commonly achieved as images in PDF format.

Since millions of books are available as creative commons and scanning projects have been actively publishing these works to the web, the ability to search and find results will unlock countless additional sources of information.  Care to try out some examples of the new feature?

Examples are courtesy of the Google Blog.

[repairing aluminum wiring]
[spin lock performance]
[Mumps and Severe Neutropenia]

[Steady success in a volatile world]

Around the web