Announcement

Collapse
No announcement yet.

"conversion error" with PDF plugin.

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • "conversion error" with PDF plugin.

    I'm trying to compare data sheets with the PDF addon. I use to just get a text comparison. After updating to the latest BC3 version and installing the plugin all I get is a conversion error when I try to compare PDF files. Did this break or is it that the PDF format has been updated once again...

  • #2
    PDF support is built in to BC3. You don't need to install anything to compare PDF files.

    To compare a PDF file in BC3, open it in the Text Compare.

    Check the "Tools > File Formats" dialog. The file format for PDF files is included with BC3 and should be listed in the bottom half of the format list. Check the top of the list for any new file formats you might have added that map to the *.PDF file extension. If there are any new PDF settings at the top of the list, uncheck the box next to the format to disable it.
    Chris K Scooter Software

    Comment


    • #3
      PDF documents not converting for Text Comparison

      I am having the same problem - PDF's don't convert.
      Email Support advised that I check the Security of the documents in Adobe Reader.
      "Page Extraction" is set to not permitted, and I can't change it.
      The documents were created in Nuance Power PDF 2.x and when they are opened in that application, there is no Security restriction for page extraction.

      Workarounds I have tried include:

      1. Copy the document and try to compare on that - Support said that would get rid of the 'lock'. Didn't work.
      2. Print to XPS - didn't work and use those documents instead of the originals - didn't work
      3. Print to Microsoft PDF - didn't work either

      Questions:
      1.When I open the documents in Nuance Power PDF v2.x, and I open “Security, Manage Security, the Page Extraction shows as “Allowed”.
      So, why are they showing as “not allowed” in Adobe, and is this really the reason they aren’t converting?
      Given that the copy/paste of the file to create a new copy didn’t work…and given that the copy was made from a “Save As’ copy of the original in Nuance Power PDF, the workaround isn’t happening.
      I thought I could change the Security in Nuance Power PDF, but they are already open…everything is ‘allowed’.


      I can't contact Nuance support because I can't remember my username and their website has no mechanism for recovering it. I will have to call Monday and hope they don't try to charge me.

      Other workarounds I've looked at include use 3rd party s/w such as CuteFTP but I'm reluctant to muck around anymore until I really understand what the problem is. Support suggests Linux, but I don't have it installed on my Windows machine and do not know how to use it.

      Support quotes:

      "Beyond Compare can open any file that has "Extractable text: True" when you check its properties, using Acrobat Reader. Or try our Linux product if that is available."

      "Open the pdf in a reader, such as Adobe Acrobat and check File -> Properties -> Security.
      The crucial property is Page Extraction. For example, in the screenshot below, the text cannot be extracted by Beyond Compare's 3rd party tool"

      "Are the pdf extractable? Many are locked and Beyond Compare for Windows or MacOS cannot touch the text inside. This means the third party module in our pipeline that converts .pdf to .txt failed. This is almost always because the owners of the pdf have locked it with a password. Nothing can be done except to find a copy of the file that is not locked. This problem has not been reported on Linux versions of our software as far as I know.?

      NEED HELP QUICKLY!

      Comment


      • #4
        Hello,

        BC4 uses a command line utility called PDF2Text, while Adobe itself is the official PDF application. This sounds like a bug with Nuance Power PDF if Adobe, PDF2Text, and Microsoft Printing all see the Security set to prevent extraction. It looks like Nuance is the only utility reporting that extraction is allowed.

        To clarify, if you open the pdf file in Adobe Acrobat or Reader, and use the File menu -> Save As (or Export) -> as Plain Text (.txt) what error message is shown? If the official Adobe application blocks this, then I would expect most other applications to also fail, including BC4. If this process works, then we should get a copy of the problem PDFs emailed into support along with your BCSupport.zip (Help menu -> Support; Export) so we can test directly against them. If you can reply to your original email thread, and include a link to this forum thread, it would allow us to join these reports and better track the issue (what we've done and what we can still try to do).
        Aaron P Scooter Software

        Comment


        • #5
          I should add that when contacting Nuance, they are likely going to be concerned if you can show the behavior issue in an official Adobe product, so you'll still want to run the above Save As Text test in Adobe Reader to see if extraction is allowed and what error is presented.
          Aaron P Scooter Software

          Comment


          • #6
            Is there another Third Party extraction tool that ignores the Page Extraction setting? I am finding that MOST PDF files do not allow page extraction, and it's a real nuisance. I will need to turn off the PDF format recognition and just compare those files as if they are binary, which only tells me if they're equal, and offers no intelligence about what the differences are.

            Comment


            • #7
              Hello,

              I'm not familiar with any specific software to circumvent the PDF security settings. In general, it would be better to contact the provider generating the PDFs with the security settings enabled, as that is not default behavior and they are enabling it on purpose.
              Aaron P Scooter Software

              Comment


              • #8
                Hi Aaron,

                Thanks for your reply. I think that the page extraction prohibition is a default in most PDF generators, and most people who create PDF files have no clue what the settings are, and never even think of adjusting them. In fact, I am the one who produced some of the PDF files that I am having problems with, and I have no idea how to correct them. A few of these files were produced by scanning hundreds of pages of text, and the individual page scans are now long gone, as well as the paper that produced them. I don't even know what software I used to spool all these pages into PDF form, it was so long ago... Anyway, given that there are hundreds of thousands of people spewing PDF files today, I think it's impossible to get them all to learn and apply a best practice for making files that can be compared via BC! :-) That being said, I still love your software. Adobe is at fault for claiming that they are open, but never has a company been so closed! I mean, come on, "page extraction" as a disallowed access right while printing and viewing are still allowed? Totally ridiculous. I think that Adobe does silly things simply because their SW engineers have their management supine over a barrel, and because "They Can." Sorry about my ranting and raving.
                -JB

                Comment


                • #9
                  FYI, since printing is still allowed from a PDF file that disallows page extraction, I tried printing to another PDF file using "PrimoPDF". I guess you can think of this as a PDF "proxy" when one PDF file feeds into another. Unfortunately, it copied the settings over from the previous file, so the file printed as PDF still had page extraction disabled. Not only is page extraction a stupid attribute, but it's one that's communicable. I think I need to contact the CDC.

                  Comment


                  • #10
                    Thanks, and you are right, I was mixing up Page Extraction with Security Method. Page Extraction shouldn't block BC4's Text Compare from extracting and viewing the text of the PDF. What error are you seeing, and are there any other Properties set on the PDF (check Document Properties from within Adobe)?
                    Aaron P Scooter Software

                    Comment


                    • #11
                      Hi Aaron, thanks for asking for more info. I don't know how to expand the Error message in BC4 to get more information about it. Please see the attachments... . 1) Security properties of left side document. 2) Security properties of right side document. 3) BC4 window showing the error You can use the filenames shown in the headings to infer where you can download your own copies of the documents if you wish to reproduce the error yourself. My version of BC4 is 4.2.9, build 23626, 64-bit, running on Windows 7. FYI, in "Tools-->File Formats", I loaded Factory Defaults. Thanks! -JB
                      Attached Files

                      Comment


                      • #12
                        Hello,

                        PdfToText.exe is the conversion program we're using in the background. You can run the command line directly to test:
                        pdftotext.exe -enc UTF-8 -table -nopgbrk "c:\temp\sourcefile.pdf" "c:\temp\output.txt"

                        Are there errors on the command line?

                        Also, from the screenshot, it looks like those PDFs (at least start with) a picture. The Text Compare can only compare text data within the PDF files (selectable and exportable text). You can also use Adobe's File menu -> Save As Text to see which text data is in the PDF. If the files are entirely scanned pictures (of text), then the output is actually empty, which would return a conversion error.
                        Aaron P Scooter Software

                        Comment


                        • #13
                          Hi Aaron,

                          I think you nailed it on the head! I ran the command using the exact options you specified, and there were no errors, with a ZERO length file produced as the output.

                          I recall now that you actually told me once that PDF comparisons are text only. For some reason I thought that BC would perform an image comparison of the embedded images. Now that I think about it, that's actually very difficult because you'd have to first find all the bounding boxes inside the whole PDF file, then properly wrap each one in a valid PDF envelope, and render them to a temp directory. THEN finally compare, and present them back within the BC4 viewer, interleaved with any text, while preserving the original page order. This sounds like a long and arduous development process!

                          May I suggest that in a future update, you return the message "NO TEXT IN PDF" instead of Conversion Error?

                          RESULTS:

                          [Wed 02/06/2019 16:50] (E:\HTDocs\MTHS1969.COM\ODRANOEL_1969\PDF): fortune

                          -----------------------------------------------------------------------------
                          Tell a man that there are 300 billion stars in the universe, and he'll believe
                          you.... Tell him that a bench has wet paint upon it and he'll have to touch it
                          to be sure.
                          -----------------------------------------------------------------------------

                          [Wed 02/06/2019 16:50] (E:\HTDocs\MTHS1969.COM\ODRANOEL_1969\PDF): dir
                          Volume in drive E is E_Corsair
                          Volume Serial Number is 66D2-A0AD

                          Directory of E:\HTDocs\MTHS1969.COM\ODRANOEL_1969\PDF

                          02/06/2019 04:49 PM <DIR> .
                          02/06/2019 04:49 PM <DIR> ..
                          07/29/2018 09:00 AM 555,244,475 Odranoel_1969_Yearbook.pdf
                          07/29/2018 09:07 AM 1,105,923,489 Odranoel_1969_Yearbook_HiRes.pdf
                          07/29/2018 08:53 AM 15,505,345 Odranoel_1969_Yearbook_LowRes.pdf
                          3 File(s) 1,676,673,309 bytes
                          2 Dir(s) 1,243,772,272,640 bytes free

                          [Wed 02/06/2019 16:50] (E:\HTDocs\MTHS1969.COM\ODRANOEL_1969\PDF): pdftotext -enc UTF-8 -table -nopgbrk Odranoel_1969_Yearbook.pdf Odranoel_1969_Yearbook.txt

                          [Wed 02/06/2019 16:51] (E:\HTDocs\MTHS1969.COM\ODRANOEL_1969\PDF): type Odranoel_1969_Yearbook.txt

                          [Wed 02/06/2019 16:51] (E:\HTDocs\MTHS1969.COM\ODRANOEL_1969\PDF): dir
                          Volume in drive E is E_Corsair
                          Volume Serial Number is 66D2-A0AD

                          Directory of E:\HTDocs\MTHS1969.COM\ODRANOEL_1969\PDF

                          02/06/2019 04:51 PM <DIR> .
                          02/06/2019 04:51 PM <DIR> ..
                          07/29/2018 09:00 AM 555,244,475 Odranoel_1969_Yearbook.pdf
                          02/06/2019 04:51 PM 0 Odranoel_1969_Yearbook.txt
                          07/29/2018 09:07 AM 1,105,923,489 Odranoel_1969_Yearbook_HiRes.pdf
                          07/29/2018 08:53 AM 15,505,345 Odranoel_1969_Yearbook_LowRes.pdf
                          4 File(s) 1,676,673,309 bytes
                          2 Dir(s) 1,243,772,272,640 bytes free

                          [Wed 02/06/2019 16:51] (E:\HTDocs\MTHS1969.COM\ODRANOEL_1969\PDF): fortune

                          -----------------------------------------------------------------------------
                          I have as much authority as the Pope, I just don't have as many people who
                          believe it.
                          -- George Carlin
                          -----------------------------------------------------------------------------



                          THANKS!

                          -John

                          Comment


                          • #14
                            Thanks. It's a little tricky to handle the 0 size output without an error code, since the file could be legitimately empty appearing, but improving this feedback is something on our wishlist.
                            Aaron P Scooter Software

                            Comment

                            Working...
                            X