OCR from JPG
- reinaldocrespo
- Posts: 918
- Joined: Thu Nov 17, 2005 5:49 pm
- Location: Fort Lauderdale, FL
OCR from JPG
Hello Everyone;
I've been using open source and free Tesseract libs to perform OCR on JPG documents. Typically there is a free standing scanner somewhere in the facility where people feed tons of documents into. The scanner swallows these documents like the cookie monster. All these documents make it to a directory where a FWH program reads from. Documents are stored by the scanner as .jpg documents. Using Tesseract, the FWH app reads the .jpg, does OCR and detects an "Encounter" number printed somewhere on the document which is a unique key to a record on a .dbf table. This "Encounter" number is detected using a regular expression. Then, this .jpg as well as the "texted" document are saved to the corresponding record on the table on a corresponding Memo field.
All this works pretty well, except in 10% of the documents where Tesseract doesn't OCR with the needed precision and thus the "encounter" number can't be completely extracted. I've tried using the Tesseract learning tools with some success, but still; there is percentage of documents where the OCR isn't good enough for no apparent reason.
Is there anyone here using any other OCR library with better results?
I don't need a scanning library. I got that already and it is not needed for this use. The scanner "knows" how _ into .jpg and save to a directory. What I need is to be able to read a .jpg and convert to text with a higher degree of accuracy.
Oh, and one last thing; the font on these documents isn't always the same and I have no way of controlling that.
Thank you,
Reinaldo.
I've been using open source and free Tesseract libs to perform OCR on JPG documents. Typically there is a free standing scanner somewhere in the facility where people feed tons of documents into. The scanner swallows these documents like the cookie monster. All these documents make it to a directory where a FWH program reads from. Documents are stored by the scanner as .jpg documents. Using Tesseract, the FWH app reads the .jpg, does OCR and detects an "Encounter" number printed somewhere on the document which is a unique key to a record on a .dbf table. This "Encounter" number is detected using a regular expression. Then, this .jpg as well as the "texted" document are saved to the corresponding record on the table on a corresponding Memo field.
All this works pretty well, except in 10% of the documents where Tesseract doesn't OCR with the needed precision and thus the "encounter" number can't be completely extracted. I've tried using the Tesseract learning tools with some success, but still; there is percentage of documents where the OCR isn't good enough for no apparent reason.
Is there anyone here using any other OCR library with better results?
I don't need a scanning library. I got that already and it is not needed for this use. The scanner "knows" how _ into .jpg and save to a directory. What I need is to be able to read a .jpg and convert to text with a higher degree of accuracy.
Oh, and one last thing; the font on these documents isn't always the same and I have no way of controlling that.
Thank you,
Reinaldo.
- Antonio Linares
- Site Admin
- Posts: 37481
- Joined: Thu Oct 06, 2005 5:47 pm
- Location: Spain
- Contact:
Re: OCR from JPG
Reinaldo,
> This "Encounter" number is detected using a regular expression
What regular expression do you use ?
Could you provide a working example to test it here ?
> This "Encounter" number is detected using a regular expression
What regular expression do you use ?
Could you provide a working example to test it here ?
- reinaldocrespo
- Posts: 918
- Joined: Thu Nov 17, 2005 5:49 pm
- Location: Fort Lauderdale, FL
Re: OCR from JPG
Antonio;
I'm not sure if the regular expression will make any difference. In general the encounter number always begins with "EN" followed by the last two digits of the year, currently that would be "16", then followed by a dash "-", and then followed by up to eight digits.
It is possible for older documents from 2015 to be scanned so my regular expression provides for that. Sometimes transcriptionists will type the lower case letter L or an upper case letter i instead of a number 1 as that is faster when typing so my regular expression provides for that. Sometimes transcriptionists will leave one or more blank spaces instead of dash just after "EN" so my regular expression provides that that too. At the end my regular expression is a litter hard to follow because of these exceptions, but if not for the exceptions the regex would be like this:
Here is the code that finds the encounter number from the OCR'ed text which is stored on var cText using my regular expression that provides for the exceptions mentioned above:
My regular expression will recognize an Encounter number in any of these:
EN1S-00023382 (here someone typed S instead of 5)
EN15 00000001 (here they leave two blank spaces instead of a dash)
EN16 00000001 (here they leave one blank space instead of a dash)
EN 16 00000001 (here they leave a blank space after EN and another after 16 and no dash)
ENl6 00000001 (here they typed the lower case letter L instead of number 1)
ENlS 00000001
EN16.00000001
These are things transcriptionists do to type faster or just because....
Using regular expressions to find a pattern is really easy and takes very little programming. But that is not where I need help. My question is about OCR engines that we can use with FWH other than Tesseract. I'm actually hoping someone can share another OCR library as I really wish I didn't have to pay a lot of money for it.
Reinaldo.
I'm not sure if the regular expression will make any difference. In general the encounter number always begins with "EN" followed by the last two digits of the year, currently that would be "16", then followed by a dash "-", and then followed by up to eight digits.
It is possible for older documents from 2015 to be scanned so my regular expression provides for that. Sometimes transcriptionists will type the lower case letter L or an upper case letter i instead of a number 1 as that is faster when typing so my regular expression provides for that. Sometimes transcriptionists will leave one or more blank spaces instead of dash just after "EN" so my regular expression provides that that too. At the end my regular expression is a litter hard to follow because of these exceptions, but if not for the exceptions the regex would be like this:
Code: Select all
"EN16-[0-9]{8}"
Code: Select all
LOCAL aParsed := hb_regExAll( "EN ?[*]?[1,l,I,\]] ?[5,S,6]-?(.)*([0-9, ])* ", cText,,,,.t. )
LOCAL cLogFile := "UnIndexedReports_" + STRTRAN( DTOC( Date() ), "/", "" ) + ".log"
//if no regex matching elements then abort.
IF EMPTY( aParsed ) .OR. LEN( IFNIL( aParsed, {} ) ) != 1
SaveAsUnIndexed( cFileName, cText, "Can't find any legit reg-ex encounter pattern on OCRed text" )
Logfile( cLogFile, { "Can't find any reg-ex encounter pattern on OCRed text", cFileName, cText } )
RETURN .F.
ENDIF
EN1S-00023382 (here someone typed S instead of 5)
EN15 00000001 (here they leave two blank spaces instead of a dash)
EN16 00000001 (here they leave one blank space instead of a dash)
EN 16 00000001 (here they leave a blank space after EN and another after 16 and no dash)
ENl6 00000001 (here they typed the lower case letter L instead of number 1)
ENlS 00000001
EN16.00000001
These are things transcriptionists do to type faster or just because....
Using regular expressions to find a pattern is really easy and takes very little programming. But that is not where I need help. My question is about OCR engines that we can use with FWH other than Tesseract. I'm actually hoping someone can share another OCR library as I really wish I didn't have to pay a lot of money for it.
Reinaldo.
- reinaldocrespo
- Posts: 918
- Joined: Thu Nov 17, 2005 5:49 pm
- Location: Fort Lauderdale, FL
Re: OCR from JPG
...oh and in case the question was how to OCR using Tesseract, then here is that code:
Code: Select all
...
handle := TessBaseAPICreate() //Using Tesseract to OCR image
IF TessBaseAPIInit3( handle, NIL, "eng" ) != 0 //abort if english traindata file can't be found locally.
Logfile( "Trace.log", { "Tesseract OCR engine can't be found.", handle } )
TessBaseAPIEnd( handle )
TessBaseAPIDelete( handle )
handle := NIL
RETURN NIL
ENDIF
...
img := pixRead( cFile )
TessBaseAPISetImage2( handle, img )
nRet := TessBaseAPIRecognize( handle, Nil )
cText := STRTRAN( TessBaseAPIGetUTF8Text( handle ), CHR( 10 ), CRLF )
...
#pragma BEGINDUMP
#include <hbapi.h>
#include <capi.h>
#include <allheaders.h>
HB_FUNC( PIXREAD ){
PIX * pPix = pixRead( hb_parc( 1 ) );
hb_retptr( pPix );
}
HB_FUNC( TESSVERSION ){
const char * zsTest = TessVersion();
hb_retc( zsTest );
}
HB_FUNC( TESSBASEAPICREATE ) {
TessBaseAPI * pHandle = TessBaseAPICreate();
hb_retptr( pHandle );
}
HB_FUNC( TESSBASEAPIINIT3 ){
hb_retni( TessBaseAPIInit3( ( TessBaseAPI * ) hb_parptr( 1 ), hb_parc( 2 ), hb_parc( 3 ) ) );
}
HB_FUNC( TESSBASEAPISETIMAGE2 ){
TessBaseAPISetImage2( ( TessBaseAPI * ) hb_parptr( 1 ), ( const PIX * ) hb_parptr( 2 ) );
}
HB_FUNC( TESSBASEAPIRECOGNIZE ){
hb_retni( TessBaseAPIRecognize( ( TessBaseAPI * ) hb_parptr( 1 ), (ETEXT_DESC*) hb_parptr( 2 ) ) );
}
HB_FUNC( TESSBASEAPIGETUTF8TEXT ){
hb_retc( TessBaseAPIGetUTF8Text( ( TessBaseAPI * ) hb_parptr( 1 ) ) );
}
HB_FUNC( TESSDELETETEXT ){
TessDeleteText( (char *) hb_parc( 1 ) );
}
HB_FUNC( TESSBASEAPIEND ){
TessBaseAPIEnd( ( TessBaseAPI * ) hb_parptr( 1 ) );
}
HB_FUNC( TESSBASEAPIDELETE ){
TessBaseAPIDelete( ( TessBaseAPI * ) hb_parptr( 1 ) );
}
HB_FUNC( PIXDESTROY ){
PIX * pPix = ( PIX * ) hb_parptr( 1 );
pixDestroy( &pPix );
}
#pragma ENDDUMP
- Antonio Linares
- Site Admin
- Posts: 37481
- Joined: Thu Oct 06, 2005 5:47 pm
- Location: Spain
- Contact:
Re: OCR from JPG
Reinaldo,
The ones that have failed, what format used ?My regular expression will recognize an Encounter number in any of these:
EN1S-00023382 (here someone typed S instead of 5)
EN15 00000001 (here they leave two blank spaces instead of a dash)
EN16 00000001 (here they leave one blank space instead of a dash)
EN 16 00000001 (here they leave a blank space after EN and another after 16 and no dash)
ENl6 00000001 (here they typed the lower case letter L instead of number 1)
ENlS 00000001
EN16.00000001
- reinaldocrespo
- Posts: 918
- Joined: Thu Nov 17, 2005 5:49 pm
- Location: Fort Lauderdale, FL
Re: OCR from JPG
The format does not matter. The problem is the OCR engine. For example:
When the text is EN16-00000001 it might come back as garbage:
EN16 ©#µ
No regular expression will understand that because some of the original text on the jpg was not correctly translated into text.
But, again that only happens 10% of the time. It has nothing to do with the regex expression. It has all to do with the OCR engine. I'm afraid I'm not getting my point across. I'm looking for ways to convert text on a .jpg file into plain text accurately. I'm currently doing it with Tesseract and I'm looking for alternatives.
Reinaldo.
When the text is EN16-00000001 it might come back as garbage:
EN16 ©#µ
No regular expression will understand that because some of the original text on the jpg was not correctly translated into text.
But, again that only happens 10% of the time. It has nothing to do with the regex expression. It has all to do with the OCR engine. I'm afraid I'm not getting my point across. I'm looking for ways to convert text on a .jpg file into plain text accurately. I'm currently doing it with Tesseract and I'm looking for alternatives.
Reinaldo.
- reinaldocrespo
- Posts: 918
- Joined: Thu Nov 17, 2005 5:49 pm
- Location: Fort Lauderdale, FL
Re: OCR from JPG
Hello Antonio;
By comparing output from Tesseract using command line vs the API, I was able to see that the command line results are much better. After careful examination I learned that the default page segmentation mode for tesseract from command line is 3 while from the API is 6. For the type of documents I'm processing PSM of 3 yields much better results.
Here is my code:
This is my code for the wrapper functions being used:
To build with harbour you will need Tesseract and leptonica dlls and build the .libs to link with your program. You can download tesseract from here: https://sourceforge.net/projects/tesseract-ocr/
My hope is that you can test on your side and see if you can reproduce the problem and maybe help find what's happening. Thank you.
By comparing output from Tesseract using command line vs the API, I was able to see that the command line results are much better. After careful examination I learned that the default page segmentation mode for tesseract from command line is 3 while from the API is 6. For the type of documents I'm processing PSM of 3 yields much better results.
Here is my code:
Code: Select all
handle := TessBaseAPICreate()
//abort if english traindata file can't be found locally.
IF TessBaseAPIInit3( handle, NIL, "eng" ) != 0
RETURN NIL
ENDIF
...
//page segmentation mode can be set via API call TessBaseAPISetPageSegMode(), or by
//setting variable "tessedit_pageseg_mode", or by reading from config file. Possible values:
//1 -Automatic page segmentation with OSD
//3 -Fully automatic page segmentation, but no OSD, or OCR
//TessBaseAPIReadConfigFile( handle, "tessapi_config" )
//TessBaseAPISetVariable( handle, "tessedit_pageseg_mode", "3" )
TessBaseAPISetPageSegMode( handle, 3 )
//print all tesseract ocr engine internal variables to file tesseract.log on cur dir.
IF lDebug ; TessBaseAPIPrintVariablesToFile( handle, "tesseract.log" ) ;ENDIF
//Open input image with leptonica library API pixRead
IF lDebug ; logfile( "trace.log", { "pixread file", cfile } ) ;ENDIF
img := pixRead( ALLTRIM( cPath ) + cFile )
IF lDebug ; logfile( "trace.log", { "TessBaseAPISetImage2", cfile } ) ;ENDIF
TessBaseAPISetImage2( handle, img )
//Recognize is called from GetUTF8Text but it doesn't hurt to call before and
//makes debugging easier. Program freezes when executing TessBaseAPIRecognize() only
//when PageSegMode is changed above.
IF lDebug ; logfile( "trace.log", { "TessBaseAPIRecognize ", cfile } ) ;ENDIF
//program freezes here but only when pageSeg_Mode is changed.
IF TessBaseAPIRecognize( handle, Nil ) <> 0 ; LOOP ;ENDIF
//if TessBaseAPIRecognize above is commented then program will freeze when executing
//TessBaseAPIGetUTF8Text(). Recognize is called internally from GetUTF8Text so we know the
//problem is at Recognize.
IF lDebug ; logfile( "trace.log", { "TessBaseAPIGetUTF8Text", cfile } ) ;ENDIF
cText := STRTRAN( TessBaseAPIGetUTF8Text( handle ), CHR( 10 ), CRLF )
...
TessDeleteText( cText )
pixDestroy( img )
...
TessBaseAPIEnd( handle )
TessBaseAPIDelete( handle )
Code: Select all
//-------------------------------------------------------------------------------------
/**/
#pragma BEGINDUMP
#include <hbapi.h>
#include <capi.h>
#include <allheaders.h>
HB_FUNC( PIXREAD ){
PIX * pPix = pixRead( hb_parc( 1 ) );
hb_retptr( pPix );
}
HB_FUNC( TESSBASEAPISETPAGESEGMODE ){
TessBaseAPISetPageSegMode( ( TessBaseAPI * ) hb_parptr( 1 ), hb_parni( 2 ) );
}
HB_FUNC( TESSBASEAPIGETPAGESEGMODE ){
hb_retni( TessBaseAPIGetPageSegMode( ( TessBaseAPI * ) hb_parptr( 1 ) ) );
}
HB_FUNC( TESSBASEAPISETVARIABLE ) {
TessBaseAPISetVariable( ( TessBaseAPI * ) hb_parptr( 1 ), hb_parc( 2 ), hb_parc( 3 ) ) ;
}
HB_FUNC( TESSBASEAPIPRINTVARIABLESTOFILE ) {
TessBaseAPIPrintVariablesToFile( ( TessBaseAPI * ) hb_parptr( 1 ), hb_parc( 2 ) ) ;
}
HB_FUNC( TESSBASEAPIREADCONFIGFILE ) {
TessBaseAPIReadConfigFile( ( TessBaseAPI * ) hb_parptr( 1 ), hb_parc( 2 ) ) ;
}
HB_FUNC( TESSBASEAPIGETHOCRTEXT ) {
hb_retc( TessBaseAPIGetHOCRText( ( TessBaseAPI * ) hb_parptr( 1 ), 0 ) );
}
HB_FUNC( TESSVERSION ){
const char * zsTest = TessVersion();
hb_retc( zsTest );
}
HB_FUNC( TESSBASEAPICREATE ) {
TessBaseAPI * pHandle = TessBaseAPICreate( NULL, NULL, NULL, 3, 3 );
hb_retptr( pHandle );
}
HB_FUNC( TESSBASEAPIINIT3 ){
hb_retni( TessBaseAPIInit3( ( TessBaseAPI * ) hb_parptr( 1 ), hb_parc( 2 ), hb_parc( 3 ) ) );
}
HB_FUNC( TESSBASEAPISETIMAGE2 ){
TessBaseAPISetImage2( ( TessBaseAPI * ) hb_parptr( 1 ), ( const PIX * ) hb_parptr( 2 ) );
}
HB_FUNC( TESSBASEAPIRECOGNIZE ){
hb_retni( TessBaseAPIRecognize( ( TessBaseAPI * ) hb_parptr( 1 ), (ETEXT_DESC*) hb_parptr( 2 ) ) );
}
HB_FUNC( TESSBASEAPIGETUTF8TEXT ){
hb_retc( TessBaseAPIGetUTF8Text( ( TessBaseAPI * ) hb_parptr( 1 ) ) );
}
HB_FUNC( TESSDELETETEXT ){
TessDeleteText( (char *) hb_parc( 1 ) );
}
HB_FUNC( TESSBASEAPIEND ){
TessBaseAPIEnd( ( TessBaseAPI * ) hb_parptr( 1 ) );
}
HB_FUNC( TESSBASEAPIDELETE ){
TessBaseAPIDelete( ( TessBaseAPI * ) hb_parptr( 1 ) );
}
HB_FUNC( PIXDESTROY ){
PIX * pPix = ( PIX * ) hb_parptr( 1 );
pixDestroy( &pPix );
}
#pragma ENDDUMP
My hope is that you can test on your side and see if you can reproduce the problem and maybe help find what's happening. Thank you.
- Antonio Linares
- Site Admin
- Posts: 37481
- Joined: Thu Oct 06, 2005 5:47 pm
- Location: Spain
- Contact:
Re: OCR from JPG
Reinaldo,
Have you reported it in their forums ?
I am reviewing your source code and it seems fine, but I am not used with this software at all
Have you reported it in their forums ?
I am reviewing your source code and it seems fine, but I am not used with this software at all
- reinaldocrespo
- Posts: 918
- Joined: Thu Nov 17, 2005 5:49 pm
- Location: Fort Lauderdale, FL
Re: OCR from JPG
Antonio;
I understand. No problem [no pasa nada]. I have been trying the google group as well as StackOverflow. So far I don't get much back.
The more I work with Tesseract the more I realize how good and flexible this product is. Being able to convert to text any image file is a powerful tool for many purposes.
I will post again when/if I solve the problem.
Thank you.
I understand. No problem [no pasa nada]. I have been trying the google group as well as StackOverflow. So far I don't get much back.
The more I work with Tesseract the more I realize how good and flexible this product is. Being able to convert to text any image file is a powerful tool for many purposes.
I will post again when/if I solve the problem.
Thank you.
- Antonio Linares
- Site Admin
- Posts: 37481
- Joined: Thu Oct 06, 2005 5:47 pm
- Location: Spain
- Contact:
Re: OCR from JPG
Reinaldo,
Have you checked the memory consume ?
Have you checked the memory consume ?
- reinaldocrespo
- Posts: 918
- Joined: Thu Nov 17, 2005 5:49 pm
- Location: Fort Lauderdale, FL
Re: OCR from JPG
Thank you for the suggestion. I did that as well as checking CPU consumption but all that seems to check just fine. I'm reading through the c++ source code for command line utility in hopes of finding something. I will update this thread with any new information.
Thank you,
Thank you,