Sep. 06, 2010

Tesseract-OCR boxfile AJAX editor

While working on a side project that uses tesseract-ocr, I ran into a situation where it was extremely cumbersome to train the program for new environments and character sets via command-line. So, I developed a web interface for it called "Tesseract OCR Chopper"; without further ado:

Live test drive: https://pp19dd.com/tesseract-ocr-chopper/ (In beta...)

Live test drive with a preloaded sample image.

* You can use this sample file: https://pp19dd.com/wp-content/uploads/2010/09/sample1.png

Update March 2012: Interface has been updated for version 3 format (added page number to coordinates line.)

Not that the program (tesseract-OCR) itself is at fault - but - there wasn't a convenient interface to edit its boxfiles. There are linux tools for the project but they are a bit difficult (for some) to port to other operating systems.  Lastly, I felt that this should be the new era of operating-system agnosticism.

The project included something called "PHP boxfile reader" but I found that the program - while perfectly functional - could use some updating.

Problems with boxfilereader.php (included in tesseract-ocr files):

  • You must generate and provide boxfile to it via command line (can't interface with /usr/bin/tesseract)
  • Limited to GD-supported file formats
  • Interface clumsy - can't make box adjustments, results are spit out as an alert()
  • In short, too much manual labor; still, better than nothing

As for the overall process of training- the input files had to be in uncompressed TIFF unless you could easily compile libtiff into it.  And then, you'd still have to get input files into TIFF format.  I'll let google search results for "tesseract libtiff" speak for itself.

My Version of the Program

So, I coupled imagemagick with some fancy web 2.0 and AJAX (I thought the 2.0 meme went out of fashion, but, apparently it didn't.)  The end result is something that sounds like a biker club: "Tesseract OCR Chopper."

The immediate improvements are mostly interface based (GUI and process):

  • It can interface directly with /usr/bin/tesseract and generate the boxfile
  • Character boxes can be click-dragged into position - or resized
  • Via sticky-select, you can select multiple characters quickly and hit the DEL key
  • All coordinate updates are instantly reflected in the boxfile (right-hand side)
  • Can upload all imagemagick format supported files (TIF, PNG, JPG, etc)
  • When you either drag or click a box, keyboard is instantly focused so you can retype - correct its character

Bad quality, but, you can see the dynamic nature of the program in this video:

To be honest, there are some problems with this setup - namely performance with larger character sets.  I ran this program with the included tesseract English OCR sample files and - unless you're using chrome -  the interface was slow and would choke the JavaScript interpreter occasionally.  On my machine I had found that 1,000 characters was a good performance limit so this demo is soft-limited to that amount.  Future versions (yes I will continue development on this) will likely have a way to paginate between thousands of characters.

Firefox held its own but until the new release comes out it will be much slower in comparison. Having said that, I've only tested this program with Firefox.

But, don't let me tell you that. Give the program a live test drive.  If you need a sample file to upload, use this:

I'd love some feedback on this; tentatively my feature list will go along these lines:

  • Pagination between larger character sets (1-1000, 1001-2000, etc)
  • Switching the interface to the HTML5 <canvas>element
  • Ability to actually chop the image and explode the characters; tesseract is limited in that it requires nicely spaced characters to train its algorithms for a new font set
  • Ability to remote-scrape website images (instead of just uploading images)
  • Creating a wizard for the entire train process

Find this interesting, or useful? Consider sharing the post.

36 responses to “Tesseract-OCR boxfile AJAX editor”

  1. sriranga(77yrsold) says:

    whether it is supported for Indic (Indian lang) I am interested to use for Kannada script

  2. Dino says:

    As far as I know, it should be supported - the boxfile editor should be using UTF-8. Training the OCR engine, on the other hand, is a separate problem.

  3. Mike says:

    Amazing utility! I stumbled upon this via a youtube video looking for some shots of tesseract in action. It looks very cool, and offered as a web service makes it an enticing platform independent tool. Good luck on the progress!

  4. Matt says:

    this is so coooooooolll...

    thanx , saved me a lot of time !!!

  5. Melissa says:

    Exactly what I have been looking for! Segmentation of the image text is the biggest problem Tesseract has if you could figure out how to do that with Image Magick you would have 90% of the project complete. With that said that is what I am attempting to do today.

  6. Robert says:

    Seems nice! I tried a few other box editors, but this one seems very easy! Especially the placing of the 'boxes'.

    Thanks.

  7. Robert says:

    Hello,

    I think there is a 0 (zero) missing as a last column (see http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3) that is if you want to use it as a version 3 box file.

    Cheers,
    Robert

  8. Dino says:

    @ Robert: thanks for the heads up. I've updated the last column as an output option.

  9. Derek says:

    Hey -- this looks like a great tool, but I'm trying to train Tesseract for a language it doesn't yet support (Georgian), and so I really need the ability to use my preexisting trained data files, otherwise I have to start from scratch every time I perform training. Do you have the source available somewhere so that I can run a local copy? Thanks!

  10. Dino says:

    @Derek, the web interface that I provide is written in JavaScript; though I didn't package the code to be downloaded, its sources can be copied and pasted (they are not minified.) The only invisible part is a file upload mechanism and transformation to TIFF.

    Though that small snippet of server-side code is unavailable, it's also fairly useless for the tesseract exercise. If you view log for any of the scans (blue button, top right), there are sections that show verbatim command line parameters that go to both imagemagick and tesseract. Hope this helps.

  11. Siyad says:

    Please give me traineddata for the font 'Brush Script MT'.

  12. Siyad M says:

    Can i upload pdf file and convert into tiff?

  13. Dino says:

    To be honest, I didn't know until you asked -- and the answer in this setup appears to be no. I'm sure you can automate ripping individual PDF pages and converting them to another format for tesseract-ocr, but not through this web page.

  14. Jang Dong Gun says:

    I would like to see a offline version, can you make a offline version ?

    I think online use regularly may abuse server resource.

  15. Dino says:

    @Jang Dong Gun - There are a number of offline versions of the boxfile editors out there, found on the tesseract-ocr wiki page. This one is a proof of concept more than anything, but, 99% of its code is in JavaScript, which is visible to everyone. The 1% is simply a conversion routine from images to TIFF via imagemagick.

  16. Many thanks for this tool! We're preparing a typographic workshop in Seoul that will use Tesseract, and it will be super useful! See http://typojanchi.org/2013/program_en/#contents

  17. Dino says:

    @Pierre: thanks for the feedback and thanks for sharing; glad you got some use out of this.

  18. peiman says:

    hello
    can you publish your box file?
    i see your file worked better than the oficial provided file!!

    regards

  19. Kelvin says:

    Amazing tool! But i'm trouble locating the converted image and box file. Could someone please guide me?

  20. Chris says:

    Looks promising. How do I save the results of the editing? How do I get a traineddata file out of this?
    Thanks!!

  21. Dino says:

    @Chris, this is just one step in the chain. The boxfile readout on the right is what you'd dump into a file and proceed with.

  22. Daniel says:

    I think that the width of the bounding boxes of the characters is correct but the left upper point is off by 1 pixel.
    Tested it for saving Tesseract 2.0 format

  23. Roger says:

    Why doesn't anybody tell me what to do with the box file once it is created? The google training document is horrendously complicated. All I want to do is get tesseract to be more accurate. How about a dummy guide for casual users? Anyone?

  24. James Battersby says:

    Hi,

    I've been trying to upload a TIFF to this but I keep getting an empty response? Any ideas? Tried on Macbook/Chrome and Windows 7/Firefox.

  25. Jake Drew says:

    Roger,
    To answer your question: "Why doesn't anybody tell me what to do with the box file once it is created?"

    See the "Make Box Files" here:

    https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3

  26. Nel says:

    Hi Dino,
    I'm developing a App that uses tessaract.
    Maybe you can help with some tips.

    You said that yours tessaract-OCR doesn't work very well with large pictures, but is because the training or is because the tessaract works worse with large pics (Like 13mp) ?

  27. Nel says:

    Hi Dino,
    I'm developing a App that uses tesseract.
    Maybe you can help with some tips.

    You said that yours tesseract-OCR doesn't work very well with large pictures, but is because the training or is because the tesseract works worse with large pics (Like 13mp) ?

  28. satheesh kumar selvaraj says:

    could any one please tell me once i edited the box files how can i use these box files in tesseract

  29. Dino says:

    @Nel: sorry, what I meant to say was that my training interface doesn't work well with very large images or dense characters, since it's web-based and inefficient (until I get off my lazy behind and make that version 2 I talked about years ago). Tesseract, on the other hand, should work fine.

  30. Dino says:

    @Satheesh: like Jake Drew mentioned above, see "Make Box Files" at https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3

  31. Nel says:

    @Dino , thanks for the reply!
    I'm using your interface to train my tesseract, maybe you can give me some tips.
    - It's better use the same character with diferent sizes/densities?
    - For my app, some pictures can put some noise on the paper, it's better count with this in trainning, or is better that i train the tesseract without any noise?

    @Satheesh: I hope i'm not responding too late.
    https://code.google.com/p/tesseract-ocr/wiki/AddOns#Tesseract_box_editors_and_traning_tools

    On this page you have several tools to train tesseract.
    In my case, i use this cycle:
    -> Pass the image on this site [Thanks Dino!]
    -> Create the box in command line [See the link about trainning that Dino gives you]
    -> Create the .tessdata with Serak tesseract Trainer V0.4.

    You can follow this tutorial:
    http://peepswrite.wordpress.com/2013/05/26/training-tesseract-3-02/

    Once again, thanks Dino!

  32. Prabu says:

    My tesseract engine(3.2) is unable to read my image(no output) whereas while uploading in yours gives me better output(70-80%). You have mentioned the usage of image magic scripts, could you please tell me the script that you are using. thanks in advance:)

  33. Dino says:

    @Prabu - if you click view log button, switch to the imagemagick tab, and it'll show the exact command lines used for image processing. I'm using ImageMagick 6.5.4-7 and LIBTIFF version 3.9.4.

  34. Federico says:

    "Live test drive with a preloaded sample image." is not working anymore.

    @Dino
    I'm using your chopper 😛 how do i save the result?

  35. Frank says:

    Thanks a lot for the tools, I'd like to ask you, in this web app, are you running a trained version of Tesseract or an off the shelves untrained version? And is the original c++ implementation or the tesseract.js version?
    Thanks in advance,
    Frank

  36. Ian says:

    Hi Dino,

    Thanks for making this awesome tool. When I upload .tif images, in all cases either:
    a) After I click upload image (I am upload .tif files), it appears to run successfully but no image appears.
    b) The loaded image appears but no box file is generated.

    I am using Google Chrome + a Mac (if that helps).

    Thanks,
    Ian

Leave a Reply

Your email address will not be published. Required fields are marked *

Posts on this blog solely represent my personal opinions and technical experience.

© 2009-2017 Edin (Dino) Beslagic