Pdf Headaches

MarjaE · October 15, 2018, 9:22pm

Really, pdfs shouldn’t be the standard that they are. But they are. Some of the frustrating failure modes I’ve encountered before trying to repair pdfs:

They have passwords. Which my Mac goes right through, but the Kindle can’t load.
They have jpeg2000 images. Which save space. But my Mac takes much more time to try to go through, and my Kindle skips. Which is especially frustrating when it’s a scanned book and every single page is a jpeg2000 image.
They have background images. Which may look great on the intended device, but obscure the text on other devices.
They lack text layers.
They have text layers, but these are gibberish in another text encoding from the actual text, or these are full of ligatures, or both.

I’ve tried various ways to repair pdfs but often run into other problems afterwards:

They lose text from certain pages, because it was raster text rather than text text.
They lose text from other pages, because of encoding errors.
They lose scale and only part of the original page shows on the new page.
They end up with white text on a white background. I can correct this in cpdf, but it’s one more step.

Is it so much to want to be able to take a pdf file, drop it on an app, and get a readable pdf file?

I have a particularly bad migraine, so I don’t want to elaborate my solutions just yet. I have found a workable solution for most scanned pdfs, but there has to be a better way for pdf-born-pdfs.

MarjaE · October 15, 2018, 10:39pm

For scanned pdfs, I use willus’s k2pdfopt.

http://www.willus.com/k2pdfopt/

I’ve set up custom versions to simply convert everything to a compatible format: -ui -mode copy

And to convert it for my e-reader: -ui -mode copy -dev dx

It can also ocr pdfs using tesseract, but it crashes when overworked, so I prefer to run k2pdfopt as above and then run ocrmypdf on the results. And yes this works better than to run ocrmypdf and then run k2pdfopt on the results.

Output goes in the same folder as the input.

P.S. Major limitations are that it has to convert any text or vector to raster graphics, and Mac users can’t just drop files on the app, we have to open the app once for each file.

For ocr, I run ocrmypdf -l [language] --force-ocr --output-type pdfa-1 [input.pdf] [output.pdf]

Unfortunately, I have to type each term each time. I haven’t figured out a way to run it in automator or to use pre-set settings.

Output goes in the home folder.

MarjaE · October 16, 2018, 2:32am

For pdf-born-pdfs, I tend to use a combination of ghostscript and cpdf. Fortunately, my brother coded an automator adaptation of one of the required scripts, and I’ve modified it for the others.

I started with this trick here:

http://www.spoonylife.org/level-3/converting-a-pdf-from-version-1-4-to-1-5-1-6-etc

So to make the pdf 1.4-compatible without compression, you can run:

gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dNOPAUSE -dQUIET -dBATCH -sOutputFile=[output.pdf] [input.pdf]

And with compression of images:

gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dNOPAUSE -dQUIET -dBATCH -sOutputFile=[output.pdf] [input.pdf]

Unfortunately, this doesn’t remove background images, and can leave a gray box around other images, adding to the problem. And the compression can obscure pages which were rasterized in the original.

So to remove all the images:

gs -sDEVICE=pdfwrite -dFILTERIMAGE -dFILTERVECTOR -dCompatibilityLevel=1.4 -dNOPAUSE -dQUIET -dBATCH -sOutputFile=[output.pdf] [input.pdf]

Unfortunately, this can yield all the bugs I’ve discussed as side-effects. Running it through the simple 1.4 filter first can help, but it adds another step to the process. Running through cpdf blacktext can help with white-on-white text. Running through cpdf split, then then 1.4, then 1.4 text-only, then cpdf merge usually keeps from losing text or losing scale, but bloats the resulting file. Running through 1.4, 1.4 text-only afterwards doesn’t bring the size back down, but does reintroduce the rendering bugs.

P.S. -dFILTERVECTOR avoids a gray box around each image, and doesn’t trigger all the errors of -dFILTERIMAGE. But it leaves the background images.

MarjaE · October 16, 2018, 7:09pm

Ghostscript documentation is a contradictory mess. Filed a bug report asking for clearer documentation on the scaling issue.

https://bugs.ghostscript.com/show_bug.cgi?id=699968

MarjaE · October 17, 2018, 3:10am

-dFILTERVECTOR sometimes creates similar scaling bugs to -dFILTERIMAGE.

-dDEVICEWIDTHPOINTS=800 -dDEVICEHEIGHTPOINTS=1180 -dFIXEDMEDIA -dPDFFitPage -dPDFSETTINGS=/screen can’t fix these scaling bugs.

mutool clean -d -l -g looks good at first, but mutool clean -d -l -g [input.pdf] and then gs -sDEVICE=pdfwrite -dFILTERIMAGE -dFILTERVECTOR -dCompatibilityLevel=1.4 -dNOPAUSE -dQUIET -dBATCH -sOutputFile=[output.pdf] [out.pdf] loses about half the pages.

cpdf -scale-to-fit “135mm 200mm” -blacktext -squeeze -clean [input.pdf] -o [out.pdf] and then gs -sDEVICE=pdfwrite -dFILTERIMAGE -dFILTERVECTOR -dCompatibilityLevel=1.4 -dNOPAUSE -dQUIET -dBATCH -sOutputFile=[output.pdf] [out.pdf] also loses about half the pages. Removing -squeeze and -clean does not help.

Automator’s preset pdf tools don’t look like they contain anything relevant.

MarjaE · November 11, 2018, 3:47am

While k2pdfopt is the best option for scanned pdfs, I’d refer to avoid the choppy text rendering for pdf-born-pdfs. I’m also concerned about the massive file sizes given my Kindle’s limited disk size.

Ghostscript doesn’t work with buggy pdfs.

I’ve yet to find pdf repair tools which can correct buggy pdfs. Or correct them and convert them to 1.4, and possibly strip vector and/or raster images.

I’m thinking that my best option may be a pdf to epub tool, after all, but I’m not sure which one. It’d have to work on the Mac. It’d have to preserve tables and convert columns to continuous text without merging text from 2 columns.

Has anyone tried this: https://www.lightenpdf.com/pdf-to-epub-mac.html

MarjaE · November 11, 2018, 9:04pm

I’ve been trying the demo. One of the pdfs which has caused me so much trouble in Ghostscript is still causing my trouble here.

Whatever I do, when I check the output, it “could not be opened.”

Other problems involve tables and paragraph breaks. It’s better than anything else I’ve tried, but far from ideal.

MarjaE · November 12, 2018, 2:52am

I’ve nowtried several other demos. None of which work. It’s possible that the Aiseesoft apps might work-- but they’re expensive, and lack demos.

MarjaE · November 20, 2018, 5:59pm

I’ve installed the Lighten one, PDF to EPUB+.

It works much better than the others. It recognizes text flow, and usually recognizes tables. It has trouble with kerning and paragraphs.

It silently fails if file names include &, and crashes or freezes on about 5% of Bundle of Holding or DriveThroughRPG books. Unfortunately there’s a lot of overlap between its crashes and Ghostscript’s failures. I’m currently trying to get it to crash on a free download so I can send it for diagnosis…

LockeCJ · November 20, 2018, 9:43pm

Have you tried Calibre? It’s available for Mac. I don’t begin to claim to understand all of the issues that you’re having, but it may be worth a look if you haven’t already tried it.

MarjaE · November 20, 2018, 11:44pm

Yes. I use Calibre. It’s sildaleik for organizing my library, converting epub to mobi or vice-versa, or transferring to other devices. It’s unbruk at converting pdf to epub. It loses tables.

MarjaE · November 21, 2018, 3:35am

It’s frustrating that the pdf format enable so much balderdash:

Malformed pdfs which only show in certain readers.
Malformed pdfs which lose content after processing-- which is why I’m not just using Ghostscript for this.
Passwords so they don’t show on certain devices.
Jpeg 2000 images.

And it makes it hard to tell which balderdash is causing which problems.

MarjaE · November 21, 2018, 6:23am

Okay, working solution:

Remove any special characters from file name.
Open PDF in an ebook-reading app.
Export PDF as PDF.
Drag new PDF into PDF to EPUB+.
Export new PDF as EPUB.
Import EPUB into Calibre, type up metadata, etc.

Simple! /s

Now I’m wondering if steps 1. through 3. can get Ghostscript conversions to work too…

No they can’t…

MarjaE · September 3, 2020, 9:30pm

I’ve written a script to help with this. Written for the bash shell in the MacOS Automator so it may require tweaks for other software.

The idea is to split each pdf in 3 parts and then splice them back together-- the cover, which I’ve rasterized, the images from each page, again rasterized, and the text from each page, blackened and inserted after the images. This makes it easier for me to read the text, and makes it easier for the Kindle to handle the images regardless how they’ve been constructed. It breaks tables of contents.

I’ve also written a varient with -dev dx after each k2pdfopt -mode copy, and with different output file names, for a grayscale output optimized for the Kindle Dx.

By default K2 increases contrast, so if you prefer not to, that’s another tweak.

It requires Ghostscript, Cpdf, K2pdfopt, and Qpdf. Cpdf should be free for non-commercial use, but I’d still prefer an open source alternative to it, and it’s no longer available via Homebrew.

I’ve installed k2pdfopt to ~/Applications and I’ve installed the others using Homebrew.

Each app seems to have slightly inconsistent standards for standard output and standard input. In the end, I instructed each one to export a set filename to a “Splice” folder, or import a set filename from there. I’ve been able to run the whole sequence that way, first splitting, then processing, and then splicing the pdf back together.

I haven’t replaced all the older code where it used ` instead of (), maybe eventually.

for f in “$@”
do
# Copy and Rasterize 1st page from source pdf using k2pdfopt
~/Applications/k2pdfopt -ui -mode copy -p 1 -x -o “/Users/Marja/Splice/RGBCover_copy.pdf” “$f” $@
# Copy text from source pdf file using Ghostscript, turn text black using Cpdf
# The color conversion strategy should help with the 2nd stage if I switch to Ghostscript
# - and -_ indicate standard output and input
# Due to compatibility issues, dumping to ~/Splice/Text.pdf
/usr/local/bin/gs -sDEVICE=pdfwrite -dFILTERIMAGE -dFILTERVECTOR -dCompatibilityLevel=1.4 -sColorConversionStrategy=RGB -sstdout=%sstderr -dNOPAUSE -dQUIET -dBATCH -sOutputFile="/Users/Marja/Splice/Text.pdf" “$f” &&
/usr/local/bin/cpdf “/Users/Marja/Splice/Text.pdf” -blacktext -o “/Users/Marja/Splice/Blacktext.pdf”
# Copy images from same source pdf file using Ghostscript, rasterize images using K2pdfopt
# Due to compatibility issues, dumping to ~/Splice/Images.pdf
/usr/local/bin/gs -sDEVICE=pdfimage24 -dFILTERTEXT -dCompatibilityLevel=1.4
-g800x1080 -r150 -dPDFFitPage
-sstdout=%sstderr -dNOPAUSE -dQUIET -dBATCH -sOutputFile="/Users/Marja/Splice/Images.pdf" “$f” &&
~/Applications/k2pdfopt -ui -mode copy -x -o “/Users/Marja/Splice/RGBImages_copy.pdf” “/Users/Marja/Splice/Images.pdf” $@ &&
# Splice files using qpdf
suffix="-SplicedColor.pdf"
base=basename "$f" .pdf
outputfile=$base$suffix
/usr/local/bin/qpdf --collate “/Users/Marja/Splice/RGBCover_copy.pdf” --pages “/Users/Marja/Splice/RGBCover_copy.pdf” “/Users/Marja/Splice/RGBImages_copy.pdf” “/Users/Marja/Splice/Blacktext.pdf” – “$outputfile”
done

MarjaE · September 14, 2021, 12:44am

Or if I don’t want to split images and text to alternate pages, I can run Ghostscript with -r72. For some reason this removes the gray rectangles.