Text Extraction from PCL with Searchable Text and Indexes

Posted on | February 7, 2012 | No Comments

Document archiving requirements often include full text extraction for fully-searchable PDFs and extracted data to use with content aggregation applications.

PageTech is the leading software development company that provides custom, extensible and fast solutions for applications requiring the manipulation, optimization, transformation or re-purposing of complex PCL print streams.

What is Complex PCL?

There are hundreds of PCL parsers that can extract text from very simple PCL. The question is… do you have very simple PCL? If not, then PCLTool SDK is the only product that can extract text from complex and problematic PCL.

PCL Tool SDK can extract text from applications that generate legacy or complex PCL print streams. It can de-construct old style, mainframe-generated bank statements with cancelled check images into text and individual TIFF images of each cancelled check image. With the dis-assembled statement text and image objects, you can use output management/mail optimization software to:

  • Re-design the print stream for a more up-to-date look and feel
  • Add PostNet or other barcodes
  • Apply address corrections
  • Apply ZIP+4 barcodes
  • Apply OMR Marks for inserters, folders, etc.

PageTech .TNX File Format

Our Text Object Extraction (*.TNX) file format is not just text parsed from a PCL file. It contains all of the text objects found on the internal logical display list just prior to imaging each page in the interpreter.  A .TNX file can be generated as a byproduct of the conversion process by one of the following methods:

  • Use our sample TNXDemo.tpt script that uses the Conversion=None JobParam.
  • Use the Convert -> Extract Text function in either PCLTool.exe or PCLWorks.exe to extract all the text information available in the file.

Both programs use TNXDumpG.exe to extract and format the text.

Not only do we capture the text, we capture the absolute positioning, current symbol set, font and font metrics used in the file.  All this information is written to our proprietary .TNX format. We encourage everyone to test our many text extraction solutions by downloading the PCL Tool SDK Live Evaluation and reviewing the Text Extraction Methods page for examples.

In addition Free PCL Analysis is provided through our Technical Support Form. Send us your zipped PCL files, application workflow and text extraction requirements, and we will respond with a detailed recommendation on the best solution for your unique PCL application issues.

How do Clients Take Advantage of PCLTool SDK Text Extraction Functionality?

1.) Systems Integrators and MIS Departments usually end their search for a PCL transformation tool when they find the first product that can simply convert a complete multi-document PCL file into a one page PDF file. Unfortunately, they usually end-up purchasing multiple tools and often have to write custom code to decollate multiple document data streams into individual PDF files with the correct filenaming convention and external rapid batch indexes.

We see many clients try to extract the text AFTER the PCL has been converted.  The best chance of retrieving searchable text from PCL is when it’s in the NATIVE PCL file format.  This is why PCLTool SDK extracts text during the conversion process and can easily optimize, transform and re-purpose the extracted text into many other formats.

2.) Service Bureaus sometimes want to extract all the text from legacy applications in order to re-construct it using various third-party output management/mail management tools to update the look of the document, add color, graphics, OMR marks, ZIP+4 barcodes, etc. Not only can PCL Tool SDK extract all the text objects for this purpose, it can also extract all the raster objects (ie. cancelled check images, graphs, etc.) for these custom applications.

3.) Medical/Pharmacy/Laboratory Imaging Solution providers that need to capture the serial or parallel printer output from an ophthalmologic, EKG, pharmacy or other device can use our PCL Tool SDK to capture and re-direct the print stream for conversion and/or text extraction. We can also convert the difficult PCL3GUI format that is used to print to HP DeskJets used by most of these devices. HP has never provided technical documentation for PCL3GUI format, so it’s rarely supported by PCL transformation tools from other vendors.

If you enjoyed this post, make sure you subscribe to my RSS feed!


Leave a Reply

You must be logged in to post a comment.