FRIHOSTFORUMSSEARCHFAQTOSBLOGSCOMPETITIONS
You are invited to Log in or Register a free Frihost Account!


How to convert document to text without printing/scanning?





infinisa
Hello Folks

Every now and again I need to convert a document in the form of an image file into text - typically when someone e-mails me a scanned document.

The only way I know to do this is to print the document out and then scan it back in as text, which seems to me to be rather daft!

Surely there must be some software out there that does the conversion directly, without the need to print. If you know of a solution, please tell me!

Thanks
badai
the software that you use to scan, can't it accept image input? usually all OCR can. beside, the scanned document itself is an image.
infinisa
Hello badai

Thanks for your reply.

The answer to your question:
Quote:
the software that you use to scan, can't it accept image input? usually all OCR can.
is no. The scanning software I use came with my HP multifunction (psc 2410), and unfortunately will only scan from this device. In fact, the program won't even start if the device isn't switched on.

As for your comment:
Quote:
the scanned document itself is an image
this is of course true. When I scan a document myself, I can choose to scan it as text if I want. But when I receive a document that someone else scanned (usually by e-mail), I don't have that option available. That's when I usually have to print the document and scan it back in again as text.
badai
i use fine reader. but it's not a free software.
infinisa
Hello again badai

Thanks for this info.

I guess you mean the software by ABBY.

I notice that besides the "ABBYY FineReader 9.0 Professional Edition" (139 Euro), they also sell the "ABBYY PDF Transformer 2.0 Pro" (85.00 Euro). Maybe this would be sufficient for my purpose?

Of course, I would MUCH rather get some free software to do the job!
orno
If you have microsoft office installed, you could use the bundled Microsoft Office Document Imaging utility (google it)

If you are dealing with non-confidential documents, you could always try an online OCR service such as WeOCR
Animal
I believe the latest version of Adobe Reader allows you to copy images as text. The documents would need to be either originally in PDF format, or re-converted to that though.
infinisa
Thanks orno & Animal

I shall certainly give these a try!
infinisa
Hello orno

Quote:
If you have microsoft office installed, you could use the bundled Microsoft Office Document Imaging utility (google it)


Thanks very much for the suggestion about using Microsoft Office Document Imaging (MODI) – it does just what I need, and it’s really cool! And to think I’ve been using MS Office for years without knowing it even existed!

As it’s not obvious to the uninitiated how to use MODI, here is a quick guide for all you guys out there.

1. Do I have Microsoft Office Document Imaging (MODI) installed?

MODI is a component of MS Office (at least in versions 2003 & 2007).
Unfortunately, it’s not installed by default (i.e. if you choose “Type of installation: Typical” when installing MS Office).
To see if it is installed in your computer, look in the Programs Menu for Microsoft Office: Microsoft Office Tools. If you see the following items, MODI is installed; otherwise it’s not:
• Microsoft Office Document Imaging
• Microsoft Office Document Scanning

2. How do I install MODI?

For this, you won’t need the installation CD, unless you chose to delete the installation files when you installed MS Office.

In the Control Panel, select “Microsoft Office (whatever) Edition 2003” or “2007 Microsoft Office system”, depending on your version of MS Office, and press Change.

Choose “Add or Remove Features”.
For Office 2003 only, check the “Choose advanced Customization of applications” box.
In the list of components, go to “Office Tools”, expand it (by pressing the + sign).
You’ll now see the item “Microsoft Office Document Imaging”: Change the option to “Run ALL from My Computer”, so it changes colour from grey to white.
Now press Update or Continue and the components will be installed.

2. How do use MODI to scan for text in a document?

First of all, MODI only accepts two file types: .mdi & .tif(f).
Actually, there’s a good reason for this.
When a file is in .tif(f) format (Tagged Image File Format), scanned text is added to the document itself, which allows you to search for text within the document (when the document is opened in MODI), or when searching documents via Office or Windows.

Probably the file you want to scan for text isn’t in .tif format, so the first step is to convert it.
We’ll consider two cases: .jpg and .pdf:

2a. How do I convert a .jpg file to .tif format?

This one’s easy. Just open it in Paint, and use Save As to save in .tif format.

2b. How do I convert a .pdf file to .tif format?

This one’s requires much more work (maybe someone has an easier method out there?)
First, you need to go to www.pdf995.com and download and install some (free) software:
• Free Converter (pre-requisite for Pdf995)
• Pdf995
• PdfEdit995

Now open your PDF document, and “print” it to the PDF995 virtual printer, to create another PDF file. This may seem a rather dumb thing to do, but you have to do it for the next step.

Open PdfEdit995 (look in the Programs Menu under Software995), and go to the Image tab.
In the Extraction pane, choose Format tiff24nc, and click on “Convert the last document printed to image/s”.

The resulting .tif file is placed in My Documents\Pdf995, and opens in its default program.
Don’t be scared by its huge size – it’ll get much smaller in the next step.
If you wish, change its name and move it to a different folder.

2c. How do use MODI to scan for text in a .tif document?

Now you’ve got your .tif file, you can finally open Microsoft Office Document Imaging (look in the Programs Menu under Microsoft Office: Microsoft Office Tools).

Open your .tif document
For a more comfortable view, in the View menu deactivate the "Thumbnail Pane", and in the Zoom submenu choose "Page Width".

Now execute the text recognition function, either by clicking “Recognize Text Using OCR” in the Tools menu, or by pressing 9th button on the toolbar.

Apparently nothing happens, but actually the scanned text has been added to the document, so the first thing to do is to save it (and you’ll see that if you started with a pdf file, the .tif file is now actually much smaller).

The easiest way to see that the file contains the scanned text is to try a text search, and it should work.

Finally, you can get hold of the scanned text as a separate (.htm) document either by clicking “Send Text to Word” in the Tools menu, or by pressing 10th button on the toolbar. You can choose which folder to save to in the next dialogue, and choose whether or not to include any pictures.
The generated .htm document opens in Word.

And that’s it – Have fun!
infinisa
Hello again orno

Quote:
If you are dealing with non-confidential documents, you could always try an online OCR service such as WeOCR


Thanks for the suggestion about using the WeOCR Server – it also does just what I need, and it’s pretty good too.

Here are some notes on this service for interested Frihosters (I’m assuming you’ve seen my last post post about Microsoft Office Document Imaging):

Advantages compared with Microsoft Office Document Imaging:
- You don’t need to have MS Office installed to use it
- You don’t have to install anything
- It accepts (and .bmp) .jpg files directly

Disadvantages compared with Microsoft Office Document Imaging:
- The quality of the result isn’t as good
- You can’t do a text search on the original document (in .tif format) as you can in Microsoft Office Document Imaging

In any case, it doesn’t accept .pdf files directly, so you’ll need to convert these to .jpg format first.
You can do this using PdfEdit995 as described in my previous post.

Hope this is useful.
infinisa
Hello Animal

Quote:
I believe the latest version of Adobe Reader allows you to copy images as text. The documents would need to be either originally in PDF format, or re-converted to that though.

You’re quite right, Adobe Reader allows you to save the document’s text by clicking “Save as Text” in the File menu.

The problem is, this is only any use if the document was created as a text document; it doesn’t work if the document is just an image (e.g. a fax saved in PDF format). In this case, you need to scan the document for text, which you can do using the methods suggested by orno (Microsoft Office Document Imaging or the WeOCR Server) and discussed in detail my last two posts.

Thanks anyway.
infinisa
mastertech wrote:
you find alot of tolls to convert from image or pdf to txt
http://www.freewaregenius.com/2011/11/01/how-to-extract-text-from-images-a-comparison-of-free-ocr-tools/

Hi mastertech

Thanks for this. It seems a very thorough survey. I shall take a good look when I can.

Meanwhile (and it's been almost 4 years since my original post!), I've just started using Tracker Software's PDF-Tools 4.0. I had to pay USD45 for a license, but it paid for itself on the first day.

I needed to scan a bunch of pages to a single PDF file, and was fed up my scanner's software, which obliges me to scan each page to a separate jpg file (which I then had convert to PDF's and join them into a single PDF file, both steps using PDFEdit995).

With this new software, I do it all in a single step.

It also includes a function that converts from PDF to Word. I've only just tested this, and it seems to do a fair job with plain text (which is the easy part), but gets rather lost with, say, equations. So still looking for the Holy Grail on this one! May be I'll find a better (and free!) solution on the review you suggested.
badai
i'm always amaze at the ability of some people spend hours and hours digging up old topic.

something should be done, especially if the problem has been solved.
sonam
badai wrote:
i'm always amaze at the ability of some people spend hours and hours digging up old topic.

something should be done, especially if the problem has been solved.


He, he, he... Lot of members don't read date and just post where they find some answer. Laughing

Sonam
mikelilin
Quote:
If you have microsoft office installed, you could use the bundled Microsoft Office Document Imaging utility (google it)

If you are dealing with non-confidential documents, you could always try an online OCR service such as WeOCR


appreciate, this free online ocr is good to use.
Related topics
This topic is locked: you cannot edit posts or make replies.    Frihost Forum Index -> Computers -> Computer Problems and Support

FRIHOST HOME | FAQ | TOS | ABOUT US | CONTACT US | SITE MAP
© 2005-2011 Frihost, forums powered by phpBB.