Largest file size supported by binary comparison

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • grindax
    Fanatic
    • Feb 2011
    • 173

    Largest file size supported by binary comparison

    What is the largest file size that can be compared using binary comparison?

    I tried comparing 2 identical zip files (each one containing 2 files: a small text file and a 30 GB data file).

    When opening the comparison and trying to force a CRC comparison or a Rules-based comparison, BC would simply indicate instantly that all the files are a binary match. This is impossible for it to know because the comparison takes no time at all, and one of the zip files is on an internal drive and the other one on an external USB drive). Force refreshing would result in the same thing: instantly I'm told that they're binary matches.

    But then if I try to force a Binary Comparison (instead of CRC), then it actually starts doing some work. But after a few minutes, I just get an error and the comparison process fails. So I guess there is some kind of limit.

    I'm running v4.1.9 64-bit on a Windows 10 system with 8GB RAM. In settings, I have configured the buffer size for binary compare to 33554432.
  • Aaron
    Team Scooter
    • Oct 2007
    • 15997

    #2
    Hello,

    There's no known upper limit and we use smaller blocks when scanning so the amount of RAM doesn't increase with the file size. A Rules-based scan might fail depending on the file extension, as certain session types have upper limits on the file size. CRC and Binary scans should complete successfully after a scan period. Note: a binary scan can return unequal very quickly if the size is different or if it finds the difference early in the file. An equal scan, however, will take longer and scan the entire file. On my own machine, I quickly generated and tested 30 and 40gb files without seeing the same problem.

    If you reboot your machine and power cycle the external hdd, does this impact the issue? If you compare a pair of test folders that are only on the external drive, do they show the same issue? Or a sample pair of files that exist only on the local drive?
    Aaron P Scooter Software

    Comment

    • grindax
      Fanatic
      • Feb 2011
      • 173

      #3
      Hi Aaron

      Internal vs external drive makes no difference, and rebooting makes no difference.

      It is still the case that CRC Comparison instantly says the files inside the ZIP "folder" are binary matches. And Binary Comparison still fails after a few minutes.

      30/11/2016 18:36:32 Load comparison: F:\DellE7470_Factory_USB_Recovery_Media_20161114.z ip <-> D:\DellE7470_Factory_USB_Recovery_Media_20161114.z ip
      30/11/2016 18:40:36 Unable to retrieve D:\DellE7470_Factory_USB_Recovery_Media_20161114.z ip\DellE7470_Factory_USB_Recovery_Media_20161114.b in: Invalid size or check sum of file
      30/11/2016 18:40:36 Unable to retrieve F:\DellE7470_Factory_USB_Recovery_Media_20161114.z ip\DellE7470_Factory_USB_Recovery_Media_20161114.b in: Invalid size or check sum of file
      30/11/2016 18:40:36 Background content comparison completed in 4 minutes, 5 seconds


      I believe it is the ZIP file itself that is causing problems for Beyond Compare. Let me tell you more about the file.

      I created a system restore USB flash drive for my laptop, so that it can be restored to factory settings by booting from the USB flash drive. That process copied about 10GB of uncompressible data to the 32GB USB flash drive. I then used a popular utility program called ImageUSB to create an image file of that USB flash drive. The file it created is 32GB in size, which I guess is mostly empty blocks. I then used WinRAR to create a zip file of that 32GB mostly-empty file, together with a small text file describing the image, which resulted in a 10GB zip file.

      So maybe you can replicate this scenario on your side? Perhaps partially fill up a USB flash drive, then use ImageUSB to create an image of it (which will include empty blocks), and then use WinRAR to create a zip file containing that image file. Then make 2 copies of that ZIP file and try to compare them in Beyond Compare.

      Comment

      • Aaron
        Team Scooter
        • Oct 2007
        • 15997

        #4
        Ah hah! I missed "zip" in the first post. Sorry about that. I've created the large archive with very large content and I'm seeing the same behavior you are. I'll make a tracker entry to investigate. If you use the right-click Compare Contents command to perform a foreground scan, how does this work for you?
        Aaron P Scooter Software

        Comment

        • grindax
          Fanatic
          • Feb 2011
          • 173

          #5
          Originally posted by Aaron
          If you use the right-click Compare Contents command to perform a foreground scan, how does this work for you?
          It works like I described:
          CRC Comparison instantly says the files are binary matches. And Binary Comparison fails after a few minutes (see the log file contents I posted in the previous message).

          When you say you are seeing the same behavior, do you mean that you see both of these incorrect things happen?

          Comment

          • Aaron
            Team Scooter
            • Oct 2007
            • 15997

            #6
            Hello,

            Chatting with a dev, I've learned something new: when working with a .zip the CRC values are stored as part of the zip and we use those for both CRC and Rules-based scan results (if CRC equal). This will return nearly instantly. A binary scan will ignore the stored CRC and will take a long time to scan the files. This is all intended behavior; does your scenario differ in any way?
            Aaron P Scooter Software

            Comment

            • grindax
              Fanatic
              • Feb 2011
              • 173

              #7
              OK, but are you seeing the problem where the Binary Comparison fails after a few minutes, like I do where it fails after 4 minutes and 5 seconds in my example?

              Concerning the other problem, i.e. the CRC Comparison, I think it's a very bad idea to rely on the CRC in the ZIP when trying to compare the actual contents. Sometimes the contents of a zip file are corrupt, and then the content won't match the CRC saved in the metadata of the zip. Just the other day I was extracting some RAR files, and WinRAR warned me when extracting that some of the files did not match their stored CRC values. Somehow, the contents of the archive had become corrupt. That is why archive tools usually also have a Test function where they compare the contents to the stored checksums.

              When using Beyond Compare to compare the actual contents of an archive, the expectation is that it is comparing the actual contents by calculating CRC values. Surely it should be doing pretty much the same thing that WinRAR/Winzip do when extracting/testing, i.e. check the actual contents to find out whether there's any corruption/difference in the actual data.

              Currently, if using Beyond Compare's CRC comparison to compare a good ZIP with a corrupt ZIP, it will report that they're identical because of course the metadata will still match.

              Comment

              • Aaron
                Team Scooter
                • Oct 2007
                • 15997

                #8
                Hello,

                No, the binary scan is finishing without error and without needing to alter the binary buffer option. If you extract these items out of the zip, do you encounter any errors during a binary scan of those contents?

                Trusting CRC values is something BC4 will do if CRC is selected, while Binary does not compute CRC and would bypass these values. If CRC corruption is a concern, then you would want to use Binary (and we'll troubleshoot getting this working).
                Aaron P Scooter Software

                Comment

                • grindax
                  Fanatic
                  • Feb 2011
                  • 173

                  #9
                  Originally posted by Aaron
                  No, the binary scan is finishing without error and without needing to alter the binary buffer option.
                  It took a lot of back-and-forth but I'm glad you now finally see that I was reporting 2 separate problems.

                  Originally posted by Aaron
                  If you extract these items out of the zip, do you encounter any errors during a binary scan of those contents?
                  If I extract the items out of the zip and do a binary scan on them, there are no errors.

                  Originally posted by Aaron
                  Trusting CRC values is something BC4 will do if CRC is selected
                  Can you please ask the developers to read this thread in its entirety? I still strongly disagree that even when explicitly choosing to Compare Contents via the Rules dialog, or by right-clicking on a specific file within an expanded ZIP and choosing to Compare Contents, the contents by default are not actually compared at all. At best, it could be called 'Compare metadata'. In the current implementation the little icon in the middle that shows a binary match is highly misleading, due to the contents not having been compared at all by Beyond Compare. Any corrupt content within zip files will be shown by Beyond Compare to have a little binary match icon. I need to be able to trust that there is no possibility that my file comparison tool is showing me false information. For other file types, Beyond Compare calculates the CRC values itself. I bet that not many people know or expect that when they ask Beyond Compare to compare their archive files' contents, by default Beyond Compare is not comparing the contents at all, and this can surely get people into trouble. Relying on an archive's built in CRC values in its metadata is fine for a quick comparison, but is of no use when doing a content comparison by any method. It should not be necessary to have to choose Binary Comparison. CRC Comparison should be reliable too.

                  Comment

                  • Aaron
                    Team Scooter
                    • Oct 2007
                    • 15997

                    #10
                    Hello,

                    Sorry, I missed mentioning that I expected foreground Binary scan to work when I suggested you try it. Binary scan was always functioning for me, and I was attempting to troubleshoot your "instant results".

                    A dev has reviewed this, hence the my post that it's expected behavior and updating my explanation with what they told me. The Binary scan is designed for full data verification, while CRC and Rules-based are by design made to be quicker or ignore differences. If CRC always ignores the stored CRC code in the Zip, then it's simply a slower, less reliable Binary scan and affords no advantage. CRC scan is provided specifically in scenarios where you can trust CRC codes, so scans against zip files or FTP servers that provide xCRC codes can be done quickly. Similarly, a corruption that occurs outside of visible data would not be caught by a Rules-based scan. If you need verify an exact match and expect to deal with corrupt sources, then you need to use the Binary scan.

                    If you are dealing with corrupt archives, this could be part of why the binary scan is failing. If you attempt to run the WinZip "Test" function both archives (left and right) do either of them report failure? If you work with new, test archives you create by manually zipping up some large files, do all archives crash similarly?
                    Aaron P Scooter Software

                    Comment

                    • grindax
                      Fanatic
                      • Feb 2011
                      • 173

                      #11
                      Originally posted by Aaron
                      If you are dealing with corrupt archives, this could be part of why the binary scan is failing.
                      There is no corruption involved in the particular case we are discussing in this thread. It is a healthy ZIP file.

                      So the summary is:
                      There's nothing wrong with the zip. It's a new, healthy archive, and extracting it works fine.
                      If I use Beyond Compare to do a binary scan of the extracted contents, everything is fine.
                      If I use Beyond Compare to do a binary scan of the contents of the ZIP, while they are still in the ZIP, it fails with the error messages I quoted previously.

                      I wonder if you tried the steps I outlined a few posts back, where I described exactly how I'd created this zip file.

                      Comment

                      • Aaron
                        Team Scooter
                        • Oct 2007
                        • 15997

                        #12
                        Hello,

                        In my original testing, I did not have a system image, so I used another set of very large test data (30gb nearly empty file, and a 30gb random file), and compressed with the latest release of WinRAR 64bit Windows. These neither hung the application or threw an error. I've taken the same data, placed it on a USB drive, captured it with ImageUSB, re-zipped, and I do see a logged failure after several minutes. I'll pass this on to our developers on Monday to investigate what is triggering it.

                        As for the concentration on the corruption, you asked that I pass on your concerns to our developers, which I did, and this is the reasoning and response for our design choices with CRC vs Rules-based vs Binary. My follow-up was to determine if the same archive files were exhibiting a mix of both failures, one, or the other.
                        Aaron P Scooter Software

                        Comment

                        • grindax
                          Fanatic
                          • Feb 2011
                          • 173

                          #13
                          Originally posted by Aaron
                          I've taken the same data, placed it on a USB drive, captured it with ImageUSB, re-zipped, and I do see a logged failure after several minutes. I'll pass this on to our developers on Monday to investigate what is triggering it.
                          What were their findings?

                          Comment

                          • Aaron
                            Team Scooter
                            • Oct 2007
                            • 15997

                            #14
                            Hello,

                            Investigating. How much disc space is available on your C:\? BC4 would need to fully extract both items into your temp location when comparing within an archive; C:\ may have run out of disc space during the operation and we have a non-descriptive error message.
                            Aaron P Scooter Software

                            Comment

                            • grindax
                              Fanatic
                              • Feb 2011
                              • 173

                              #15
                              30GB free on my C:\
                              Last edited by grindax; 07-Dec-2016, 04:47 PM.

                              Comment

                              Working...