Text Extraction – PCLTool SDK vs. Simple PCL Parsers / PCL Emulators

Posted on | June 13, 2012 | No Comments

Simple PCL Parsers can handle these situations.1. Text found in legible ASCII strings of text using spaces for positioning.


2. Text found in legible ASCII strings of text with relative positioning x,y coordinates to position them over a form overlay.
Simple PCL Parsers can’t handle these situations3. Partially legible ASCII text that is micro-justified with absolute positioning x,y coordinates of each character.

4. Illegible text mapped to a temporary download font of a complete symbol set, but offset by +3 or -27 character cells. These were mostly from Windows 95 printer drivers. We have a function that can automatically add or deduct character cell offsets to properly map the text from these types of files.

5. Illegible text mapped to a custom download font of a complete character set with a custom symbol set like EBCDIC. These PCL files can be from mainframe systems that generate non-PCL print streams that get converted by a AFP/IPDS to PCL protocol converter. They translate the EBCDIC font into a HP font with a custom symbol set because there is no HP EBCDIC symbol set. We use our EBCDIC.cxx file to remap the EBCDIC mapped text to HP PC-8 character set.
Simple PCL Parsers are ineffective in these situations.

Simple PCL Emulators can render the PCL but can’t successfully extract the text.
6. Illegible text mapped to a foreign language symbol set. HP has 100's of symbol sets.


7. Text is in a PCL6 (PCL XL) print stream.


8. Illegible Double-Byte Text from Chinese, Japanese and Korean systems.
Most of our competitor’s can handle this, but not if it's PCL6.9. Illegible "Scrambled" text mapped to a temporary download font with characters assigned to font cells in the order in which they are used on the file. We call this "scrambled" text and these characters can be downloaded several times in the same PCL file into different cell locations. We then produce a "missed.pcl" file that we view to produce a Character Descriptor Recognition file (.CDR) which our product uses to know whenever the shape of an "A" at 12 pts. appears on the page to map it to the ASCII character "A".
Only PageTech’s PCL Tool SDK can handle these difficult situations.10. Illegible "Scrambled Text" with different temporary fonts using the same download font id number. This happens when Windows print streams are concatenated into one large batch file. We first usually need to run the font optimization routine on these files because they can be 90% font resources. We then produce a "missed.pcl" file that we view to produce a Character Descriptor Recognition file (.CDR) which our product uses to know whenever the shape of an "A" at 12 pts. appears on the page to map it to the ASCII character "A".

NOTE: This type of file is mostly generated by Insurance Companies

11. Mixed ASCII text and rasterized text characters. Three numbers in the middle of a social security number could be raster data with the characters the precede and follow being ASCII text. We can define a RECT area on the page for each field of text with the option to treat it as a raster text field if no legible text is found. We then produce a "missed.pcl" file that we view to produce a Character Descriptor Recognition file (.CDR) which our product uses to know whenever the shape of an "A" at 12 pts. appears on the page to map it to the ASCII character "A".

Note: This type of file is occurs when a application software developer of a program that generates claim forms goes out of their way to make it difficult to extract text from their print files. They do this to protect their claim form processing revenue stream from claim form pirates that capture the PCL, extract the text and migrate it into EDI systems to process the claims at a lower cost.

12. Raster text of one monospaced font like legacy bank statements. We can establish a grid to overlay on the rendered page in memory in order to trace outlines around all the glyphs font in the grid. We then produce a "missed.pcl" file that we view to produce a Character Descriptor Recognition file (.CDR) which our product uses to know whenever the shape of an "A" at 12 pts. appears on the page to map it to the ASCII character "A".

Note: This type of file is usually generated from mainframe systems that merge monospaced statement text with pages of cancelled check images from legacy check processing systems.

13. Raster text of two or more fonts that cannot be placed within the same OCR grid. The raster test is not all the same point size or stroke weight, which requires to grids to be established. We then produce a "missed.pcl" file that we view to produce a Character Descriptor Recognition file (.CDR) which our product uses to know whenever the shape of an "A" at 12 pts. appears on the page to map it to the ASCII character "A".
If you enjoyed this post, make sure you subscribe to my RSS feed!

Comments

Leave a Reply

You must be logged in to post a comment.