Whitening PDF Documents with Java

 

The Internet Archive is a wonderful resource for old books, many of which are available as PDF files. One minor quibble I have with this fabulous library is that a lot of the material was scanned in color, resulting in pages with a nasty yellow tint (reflecting their age). However, I'm mostly interested in textbooks, that were usually published in black and white. In those cases, a color scan is unnecessary, and actually something of a drawback since a PDF reader can take noticeably longer to render such pages.

[Yellowed Document]

The three programs available at the bottom of this page (GetPage.java, WhitenPage.java, and WhitenDoc.java) address this issue by allowing you to adjust the brightness and contrast of a grayscale version of a PDF file. The resulting document has 'cleaner' looking pages which also load faster.

[Whitened Document]

 

Whitening in Four Stages

Stage 1. Get a Page

GetPage.java is used to extract a page from a PDF file as a PNG image. For example,

[Use GetPage]

 

Stage 2. Manipulation in ImageJ

This image (elements20-17.png) is loaded into the ImageJ application and converted to a 8-bit grayscale using its "Image > Type > 8-bit" menu item. Then its brightness and contrast are manipulated using the dialog displayed by the "Image > Adjust > Brightness/Contrast" item. The Minimum and Maximum sliders should be adjusted until the image is suitable, and their values (displayed underneath the graph) noted down for stages 3 and 4. In the picture below, the settings are 19 and 137.

[Use ImageJ]

There's no need to save any changes to the image when the application is closed.

 

Stage 3. Whiten a Page

WhitenPage.java is used try out the ImageJ minimum and maximum settings on a single page from a PDF file before converting the entire document. For example,

[Use WhitenPage]

The resulting page is saved as a PDF file (elements20-W17.pdf) which can be examined in any PDF reader.

 

Stage 4. Whiten a Document

WhitenDoc.java applies the ImageJ minimum and maximum settings to every page in a document. For example,

[Use WhitenDoc]

This step can be time consuming. For example, the conversion of a 1,000 page encyclopedia took nearly 10 minutes.

 

  Downloads

My code uses Apache PDFBox for PDF manipulation and ImageJ for image manipulation. These libraries can be downloaded from their websites, or from below.

 

Dr. Andrew Davison
E-mail: ad@coe.psu.ac.th
Back to my home page