OCR from JPG

reinaldocrespo · Post by **reinaldocrespo** » Fri Sep 30, 2016 5:17 pm

Hello Everyone;

I've been using open source and free Tesseract libs to perform OCR on JPG documents. Typically there is a free standing scanner somewhere in the facility where people feed tons of documents into. The scanner swallows these documents like the cookie monster. All these documents make it to a directory where a FWH program reads from. Documents are stored by the scanner as .jpg documents. Using Tesseract, the FWH app reads the .jpg, does OCR and detects an "Encounter" number printed somewhere on the document which is a unique key to a record on a .dbf table. This "Encounter" number is detected using a regular expression. Then, this .jpg as well as the "texted" document are saved to the corresponding record on the table on a corresponding Memo field.

All this works pretty well, except in 10% of the documents where Tesseract doesn't OCR with the needed precision and thus the "encounter" number can't be completely extracted. I've tried using the Tesseract learning tools with some success, but still; there is percentage of documents where the OCR isn't good enough for no apparent reason.

Is there anyone here using any other OCR library with better results?

I don't need a scanning library. I got that already and it is not needed for this use. The scanner "knows" how _ into .jpg and save to a directory. What I need is to be able to read a .jpg and convert to text with a higher degree of accuracy.

Oh, and one last thing; the font on these documents isn't always the same and I have no way of controlling that.

Thank you,

Reinaldo.

Post by **Antonio Linares** » Sat Oct 01, 2016 7:54 am

Reinaldo,

> This "Encounter" number is detected using a regular expression

What regular expression do you use ?

Could you provide a working example to test it here ?

reinaldocrespo · Post by **reinaldocrespo** » Sat Oct 01, 2016 5:15 pm

Antonio;

I'm not sure if the regular expression will make any difference. In general the encounter number always begins with "EN" followed by the last two digits of the year, currently that would be "16", then followed by a dash "-", and then followed by up to eight digits.

It is possible for older documents from 2015 to be scanned so my regular expression provides for that. Sometimes transcriptionists will type the lower case letter L or an upper case letter i instead of a number 1 as that is faster when typing so my regular expression provides for that. Sometimes transcriptionists will leave one or more blank spaces instead of dash just after "EN" so my regular expression provides that that too. At the end my regular expression is a litter hard to follow because of these exceptions, but if not for the exceptions the regex would be like this:

Code: Select all

"EN16-[0-9]{8}"

Here is the code that finds the encounter number from the OCR'ed text which is stored on var cText using my regular expression that provides for the exceptions mentioned above:

Code: Select all

   LOCAL aParsed   := hb_regExAll( "EN ?[*]?[1,l,I,\]] ?[5,S,6]-?(.)*([0-9, ])* ", cText,,,,.t. )
   LOCAL cLogFile  := "UnIndexedReports_" + STRTRAN( DTOC( Date() ), "/", "" ) + ".log"


   //if no regex matching elements then abort.
   IF EMPTY( aParsed ) .OR. LEN( IFNIL( aParsed, {} ) ) != 1  
      SaveAsUnIndexed( cFileName, cText, "Can't find any legit reg-ex encounter pattern on OCRed text" )
      Logfile( cLogFile, { "Can't find any reg-ex encounter pattern on OCRed text", cFileName, cText } )
      RETURN .F.
   ENDIF

My regular expression will recognize an Encounter number in any of these:

EN1S-00023382 (here someone typed S instead of 5)
EN15 00000001 (here they leave two blank spaces instead of a dash)
EN16 00000001 (here they leave one blank space instead of a dash)
EN 16 00000001 (here they leave a blank space after EN and another after 16 and no dash)
ENl6 00000001 (here they typed the lower case letter L instead of number 1)
ENlS 00000001
EN16.00000001

These are things transcriptionists do to type faster or just because....

Using regular expressions to find a pattern is really easy and takes very little programming. But that is not where I need help. My question is about OCR engines that we can use with FWH other than Tesseract. I'm actually hoping someone can share another OCR library as I really wish I didn't have to pay a lot of money for it.

Reinaldo.

reinaldocrespo · Post by **reinaldocrespo** » Sat Oct 01, 2016 5:39 pm

...oh and in case the question was how to OCR using Tesseract, then here is that code:

Code: Select all

...
   handle := TessBaseAPICreate()       //Using Tesseract to OCR image

   IF TessBaseAPIInit3( handle, NIL, "eng" ) != 0     //abort if english traindata file can't be found locally.
      Logfile( "Trace.log", { "Tesseract OCR engine can't be found.", handle } )
      TessBaseAPIEnd( handle )
      TessBaseAPIDelete( handle )
      handle := NIL 
      RETURN NIL 
   ENDIF 

...
         img := pixRead( cFile )
         TessBaseAPISetImage2( handle, img )
         nRet := TessBaseAPIRecognize( handle, Nil )
         cText := STRTRAN( TessBaseAPIGetUTF8Text( handle ), CHR( 10 ), CRLF )

...
#pragma BEGINDUMP
#include <hbapi.h>
#include <capi.h>
#include <allheaders.h>

HB_FUNC( PIXREAD ){
  PIX * pPix = pixRead( hb_parc( 1 ) );

  hb_retptr( pPix );
}


HB_FUNC( TESSVERSION ){
  const char * zsTest = TessVersion();
  hb_retc( zsTest );
}


HB_FUNC( TESSBASEAPICREATE ) {
  TessBaseAPI * pHandle = TessBaseAPICreate();
  hb_retptr( pHandle );
}

HB_FUNC( TESSBASEAPIINIT3 ){
  hb_retni( TessBaseAPIInit3( ( TessBaseAPI * ) hb_parptr( 1 ), hb_parc( 2 ), hb_parc( 3 ) ) );
}

HB_FUNC( TESSBASEAPISETIMAGE2 ){
  TessBaseAPISetImage2(  ( TessBaseAPI * ) hb_parptr( 1 ), ( const PIX * ) hb_parptr( 2 ) );
}

HB_FUNC( TESSBASEAPIRECOGNIZE ){
  hb_retni( TessBaseAPIRecognize( ( TessBaseAPI * ) hb_parptr( 1 ), (ETEXT_DESC*) hb_parptr( 2 ) ) );
}

HB_FUNC( TESSBASEAPIGETUTF8TEXT ){
  hb_retc( TessBaseAPIGetUTF8Text( ( TessBaseAPI * ) hb_parptr( 1 ) ) );
}

HB_FUNC( TESSDELETETEXT ){
  TessDeleteText( (char *) hb_parc( 1 ) );
}

HB_FUNC( TESSBASEAPIEND ){
  TessBaseAPIEnd( ( TessBaseAPI * ) hb_parptr( 1 ) );
}

HB_FUNC( TESSBASEAPIDELETE ){
  TessBaseAPIDelete( ( TessBaseAPI * ) hb_parptr( 1 ) );
}

HB_FUNC( PIXDESTROY ){
  PIX * pPix =  ( PIX * ) hb_parptr( 1 );
  pixDestroy( &pPix );
}


#pragma ENDDUMP

Post by **Antonio Linares** » Sat Oct 01, 2016 9:22 pm

Reinaldo,

My regular expression will recognize an Encounter number in any of these:

EN1S-00023382 (here someone typed S instead of 5)
EN15 00000001 (here they leave two blank spaces instead of a dash)
EN16 00000001 (here they leave one blank space instead of a dash)
EN 16 00000001 (here they leave a blank space after EN and another after 16 and no dash)
ENl6 00000001 (here they typed the lower case letter L instead of number 1)
ENlS 00000001
EN16.00000001

The ones that have failed, what format used ?

reinaldocrespo · Post by **reinaldocrespo** » Sat Oct 01, 2016 10:40 pm

The format does not matter. The problem is the OCR engine. For example:

When the text is EN16-00000001 it might come back as garbage:

EN16 ©#µ

No regular expression will understand that because some of the original text on the jpg was not correctly translated into text.

But, again that only happens 10% of the time. It has nothing to do with the regex expression. It has all to do with the OCR engine. I'm afraid I'm not getting my point across. I'm looking for ways to convert text on a .jpg file into plain text accurately. I'm currently doing it with Tesseract and I'm looking for alternatives.

Reinaldo.

reinaldocrespo · Post by **reinaldocrespo** » Thu Nov 17, 2016 10:55 pm

Hello Antonio;

By comparing output from Tesseract using command line vs the API, I was able to see that the command line results are much better. After careful examination I learned that the default page segmentation mode for tesseract from command line is 3 while from the API is 6. For the type of documents I'm processing PSM of 3 yields much better results.

Here is my code:

Code: Select all

   handle := TessBaseAPICreate() 

   //abort if english traindata file can't be found locally.
   IF TessBaseAPIInit3( handle, NIL, "eng" ) != 0     
       RETURN NIL 
   ENDIF

...
         //page segmentation mode can be set via API call TessBaseAPISetPageSegMode(), or by 
         //setting variable "tessedit_pageseg_mode", or by reading from config file. Possible values:
         //1 -Automatic page segmentation with OSD 
         //3 -Fully automatic page segmentation, but no OSD, or OCR

         //TessBaseAPIReadConfigFile( handle, "tessapi_config" )
         //TessBaseAPISetVariable( handle, "tessedit_pageseg_mode", "3" )
         TessBaseAPISetPageSegMode( handle, 3 ) 

         //print all tesseract ocr engine internal variables to file tesseract.log on cur dir.
         IF lDebug ; TessBaseAPIPrintVariablesToFile( handle, "tesseract.log" )  ;ENDIF


         //Open input image with leptonica library API pixRead
         IF lDebug ; logfile( "trace.log", { "pixread file", cfile } ) ;ENDIF
         img := pixRead( ALLTRIM( cPath ) + cFile )

         IF lDebug ; logfile( "trace.log", { "TessBaseAPISetImage2", cfile } ) ;ENDIF
         TessBaseAPISetImage2( handle, img )

         //Recognize is called from GetUTF8Text but it doesn't hurt to call before and 
         //makes debugging easier.  Program freezes when executing TessBaseAPIRecognize() only 
         //when PageSegMode is changed above.
         IF lDebug ; logfile( "trace.log", { "TessBaseAPIRecognize ", cfile } ) ;ENDIF
         //program freezes here but only when pageSeg_Mode is changed.
         IF TessBaseAPIRecognize( handle, Nil ) <> 0  ; LOOP   ;ENDIF    

         //if TessBaseAPIRecognize above is commented then program will freeze when executing 
         //TessBaseAPIGetUTF8Text().  Recognize is called internally from GetUTF8Text so we know the 
         //problem is at Recognize.
         IF lDebug ; logfile( "trace.log", { "TessBaseAPIGetUTF8Text", cfile } ) ;ENDIF
         cText := STRTRAN( TessBaseAPIGetUTF8Text( handle ), CHR( 10 ), CRLF )
...
         TessDeleteText(  cText )
         pixDestroy( img )

...
   TessBaseAPIEnd( handle )
   TessBaseAPIDelete( handle )

This is my code for the wrapper functions being used:

Code: Select all

//-------------------------------------------------------------------------------------
/**/
#pragma BEGINDUMP
#include <hbapi.h>
#include <capi.h>
#include <allheaders.h>

HB_FUNC( PIXREAD ){
  PIX * pPix = pixRead( hb_parc( 1 ) );

  hb_retptr( pPix );
}


HB_FUNC( TESSBASEAPISETPAGESEGMODE ){
  TessBaseAPISetPageSegMode( ( TessBaseAPI * ) hb_parptr( 1 ), hb_parni( 2 ) );
}


HB_FUNC( TESSBASEAPIGETPAGESEGMODE ){
  hb_retni( TessBaseAPIGetPageSegMode( ( TessBaseAPI * ) hb_parptr( 1 ) ) );
}



HB_FUNC( TESSBASEAPISETVARIABLE ) {
   TessBaseAPISetVariable( ( TessBaseAPI * ) hb_parptr( 1 ), hb_parc( 2 ), hb_parc( 3 ) ) ;
}



HB_FUNC( TESSBASEAPIPRINTVARIABLESTOFILE ) {
   TessBaseAPIPrintVariablesToFile( ( TessBaseAPI * ) hb_parptr( 1 ), hb_parc( 2 ) ) ;
}


HB_FUNC( TESSBASEAPIREADCONFIGFILE ) {

   TessBaseAPIReadConfigFile( ( TessBaseAPI * ) hb_parptr( 1 ), hb_parc( 2 ) ) ;

}


HB_FUNC( TESSBASEAPIGETHOCRTEXT ) {
  hb_retc( TessBaseAPIGetHOCRText( ( TessBaseAPI * ) hb_parptr( 1 ), 0 ) );

}


HB_FUNC( TESSVERSION ){
  const char * zsTest = TessVersion();
  hb_retc( zsTest );
}


HB_FUNC( TESSBASEAPICREATE ) {
  TessBaseAPI * pHandle = TessBaseAPICreate( NULL, NULL, NULL, 3, 3 );
  hb_retptr( pHandle );
}

HB_FUNC( TESSBASEAPIINIT3 ){
  hb_retni( TessBaseAPIInit3( ( TessBaseAPI * ) hb_parptr( 1 ), hb_parc( 2 ), hb_parc( 3 ) ) );
}

HB_FUNC( TESSBASEAPISETIMAGE2 ){
  TessBaseAPISetImage2(  ( TessBaseAPI * ) hb_parptr( 1 ), ( const PIX * ) hb_parptr( 2 ) );
}

HB_FUNC( TESSBASEAPIRECOGNIZE ){
  hb_retni( TessBaseAPIRecognize( ( TessBaseAPI * ) hb_parptr( 1 ), (ETEXT_DESC*) hb_parptr( 2 ) ) );
}

HB_FUNC( TESSBASEAPIGETUTF8TEXT ){
  hb_retc( TessBaseAPIGetUTF8Text( ( TessBaseAPI * ) hb_parptr( 1 ) ) );
}

HB_FUNC( TESSDELETETEXT ){
  TessDeleteText( (char *) hb_parc( 1 ) );
}

HB_FUNC( TESSBASEAPIEND ){
  TessBaseAPIEnd( ( TessBaseAPI * ) hb_parptr( 1 ) );
}

HB_FUNC( TESSBASEAPIDELETE ){
  TessBaseAPIDelete( ( TessBaseAPI * ) hb_parptr( 1 ) );
}

HB_FUNC( PIXDESTROY ){
  PIX * pPix =  ( PIX * ) hb_parptr( 1 );
  pixDestroy( &pPix );
}


#pragma ENDDUMP

To build with harbour you will need Tesseract and leptonica dlls and build the .libs to link with your program. You can download tesseract from here: https://sourceforge.net/projects/tesseract-ocr/

My hope is that you can test on your side and see if you can reproduce the problem and maybe help find what's happening. Thank you.

Post by **Antonio Linares** » Fri Nov 18, 2016 7:22 pm

Reinaldo,

Have you reported it in their forums ?

I am reviewing your source code and it seems fine, but I am not used with this software at all

reinaldocrespo · Post by **reinaldocrespo** » Fri Nov 18, 2016 7:26 pm

Antonio;

I understand. No problem [no pasa nada]. I have been trying the google group as well as StackOverflow. So far I don't get much back.

The more I work with Tesseract the more I realize how good and flexible this product is. Being able to convert to text any image file is a powerful tool for many purposes.

I will post again when/if I solve the problem.

Thank you.

Post by **Antonio Linares** » Fri Nov 18, 2016 7:32 pm

Reinaldo,

Have you checked the memory consume ?

reinaldocrespo · Post by **reinaldocrespo** » Fri Nov 18, 2016 8:14 pm

Thank you for the suggestion. I did that as well as checking CPU consumption but all that seems to check just fine. I'm reading through the c++ source code for command line utility in hopes of finding something. I will update this thread with any new information.

Thank you,

FiveTech Software tech support forums

OCR from JPG

OCR from JPG

Re: OCR from JPG

Re: OCR from JPG

Re: OCR from JPG

Re: OCR from JPG

Re: OCR from JPG

Re: OCR from JPG

Re: OCR from JPG

Re: OCR from JPG

Re: OCR from JPG

Re: OCR from JPG