While working on a side project that uses tesseract-ocr, I ran into a situation where it was extremely cumbersome to train the program for new environments and character sets via command-line. So, I developed a web interface for it called "Tesseract OCR Chopper"; without further ado:
Live test drive: https://pp19dd.com/tesseract-ocr-chopper/ (In beta...)
Live test drive with a preloaded sample image.
* You can use this sample file: https://pp19dd.com/wp-content/uploads/2010/09/sample1.png
Update March 2012: Interface has been updated for version 3 format (added page number to coordinates line.)
Not that the program (tesseract-OCR) itself is at fault - but - there wasn't a convenient interface to edit its boxfiles. There are linux tools for the project but they are a bit difficult (for some) to port to other operating systems. Lastly, I felt that this should be the new era of operating-system agnosticism.
The project included something called "PHP boxfile reader" but I found that the program - while perfectly functional - could use some updating.
Problems with boxfilereader.php (included in tesseract-ocr files):
- You must generate and provide boxfile to it via command line (can't interface with /usr/bin/tesseract)
- Limited to GD-supported file formats
- Interface clumsy - can't make box adjustments, results are spit out as an alert()
- In short, too much manual labor; still, better than nothing
As for the overall process of training- the input files had to be in uncompressed TIFF unless you could easily compile libtiff into it. And then, you'd still have to get input files into TIFF format. I'll let google search results for "tesseract libtiff" speak for itself.
My Version of the Program
So, I coupled imagemagick with some fancy web 2.0 and AJAX (I thought the 2.0 meme went out of fashion, but, apparently it didn't.) The end result is something that sounds like a biker club: "Tesseract OCR Chopper."
The immediate improvements are mostly interface based (GUI and process):
- It can interface directly with /usr/bin/tesseract and generate the boxfile
- Character boxes can be click-dragged into position - or resized
- Via sticky-select, you can select multiple characters quickly and hit the DEL key
- All coordinate updates are instantly reflected in the boxfile (right-hand side)
- Can upload all imagemagick format supported files (TIF, PNG, JPG, etc)
- When you either drag or click a box, keyboard is instantly focused so you can retype - correct its character
Bad quality, but, you can see the dynamic nature of the program in this video:
Firefox held its own but until the new release comes out it will be much slower in comparison. Having said that, I've only tested this program with Firefox.
But, don't let me tell you that. Give the program a live test drive. If you need a sample file to upload, use this:
- Sample file: https://pp19dd.com/wp-content/uploads/2010/09/sample1.png
- Live test drive: https://pp19dd.com/tesseract-ocr-chopper/
- Live test drive with a preloaded sample image.
I'd love some feedback on this; tentatively my feature list will go along these lines:
- Pagination between larger character sets (1-1000, 1001-2000, etc)
- Switching the interface to the HTML5 <canvas>element
- Ability to actually chop the image and explode the characters; tesseract is limited in that it requires nicely spaced characters to train its algorithms for a new font set
- Ability to remote-scrape website images (instead of just uploading images)
- Creating a wizard for the entire train process