Is UTF8 broken?

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • johnm
    Enthusiast
    • Nov 2007
    • 31

    Is UTF8 broken?

    On BC3 a file with this name displays correctly in an FTP directory:

    %vc29¬1.50

    On BC4 build 17451 it displays as:

    %vc29¬1.50

    The login onto BC4 is as follows:

    28/12/2013 12:08:21 Username: GJ\johnm
    28/12/2013 12:08:22 Stat> Connected.
    28/12/2013 12:08:22 Recv> 220 %vv831 Version 4.0
    28/12/2013 12:08:22 Sent> HOST yak
    28/12/2013 12:08:22 Recv> 530 Not logged in
    28/12/2013 12:08:22 Sent> USER johnm
    28/12/2013 12:08:22 Recv> 331 Send password
    28/12/2013 12:08:22 Sent> PASS ********
    28/12/2013 12:08:22 Recv> 230 Logged in
    28/12/2013 12:08:22 Sent> FEAT
    28/12/2013 12:08:22 Recv> 211-Extras:
    28/12/2013 12:08:22 Recv> MLST MODIFY*;SIZE*;TYPE*;
    28/12/2013 12:08:22 Recv> XCRC
    28/12/2013 12:08:22 Recv> UTF8
    28/12/2013 12:08:22 Recv> 211 End
    28/12/2013 12:08:22 Sent> OPTS UTF8 ON
    28/12/2013 12:08:23 Recv> 200 Enabling UTF8
    28/12/2013 12:08:23 Sent> TYPE A
    28/12/2013 12:08:23 Recv> 200 Type set to A
    28/12/2013 12:08:23 Sent> SYST
    28/12/2013 12:08:23 Recv> 215 UNIX
    28/12/2013 12:08:23 Sent> TYPE A
    28/12/2013 12:08:23 Recv> 200 Type set to A
    28/12/2013 12:08:23 Sent> PWD
    28/12/2013 12:08:23 Recv> 257 "/"
    28/12/2013 12:08:23 Sent> REST 1
    28/12/2013 12:08:23 Recv> 502 Command not handled: REST 1
    28/12/2013 12:08:23 Sent> REST 0
    28/12/2013 12:08:23 Recv> 502 Command not handled: REST 0
    28/12/2013 12:08:23 Sent> PORT 10,101,68,205,199,246
    28/12/2013 12:08:23 Recv> 200 Port 51190 at 10.101.68.205
    28/12/2013 12:08:23 Sent> MLSD
    28/12/2013 12:08:23 Recv> 150 Opening data connection to 51190 at 10.101.68.205 for list
    28/12/2013 12:08:23 Recv> 226 List sent
    28/12/2013 12:08:23 Load comparison: ftp://johnm@yak:1021/ <->
    28/12/2013 12:08:23 Background content comparison completed in 1.83 seconds
  • Zoë
    Team Scooter
    • Oct 2007
    • 2666

    #2
    Which OS are you running BC on, and what does your FTP profile have selected for the filename encoding ("Server" tab, I think)?
    Zoë P Scooter Software

    Comment

    • johnm
      Enthusiast
      • Nov 2007
      • 31

      #3
      I've seen this on 32-bit Windows Vista and on 64-bit Windows 8.1.

      The FTP profile had Encoding set to "Detect", but changing this to UTF-8 or to ANSI didn't make a difference to what I'm seeing.

      Comment

      • Zoë
        Team Scooter
        • Oct 2007
        • 2666

        #4
        When you connect using BC3, can you check the log to see if this string is anywhere in it?

        Received invalid UTF-8 sequence, switching to ANSI encoding.
        Zoë P Scooter Software

        Comment

        • Zoë
          Team Scooter
          • Oct 2007
          • 2666

          #5
          To expand on that a bit: No, BC4's UTF-8 support is not broken. Your FTP server is returning invalid UTF-8 sequences and BC3 had a workaround that I haven't been able to bring forward.

          It generally affects Unix FTP servers, which just treat filenames as arbitrary byte sequences and don't try to validate them. Older Unix servers didn't say they supported UTF-8, so BC and other Windows FTP clients would assume that they were using an ANSI encoding instead, and send filenames that way. When the FTP server was upgraded it started saying it supported UTF-8, but it still sends those filenames back as-is, even though they aren't actually UTF-8. If you were to load the directory listing using Samba or a web server you should see the same behavior.

          BC3 detects that case and switches back to using ANSI all the time. It still ends up uploading invalid UTF-8 sequences, and is limited on what characters it can support, but keeps compatibility as long as you're only accessing the site through FTP. Unfortunately, as part of the BC4 upgrade we significantly changed the way we handle Unicode internally, and I haven't been able to figure out how to re-implement that workaround.

          If you have telnet/ssh access to the server, the best fix is to use the convmv command to re-encode the filenames as UTF-8.
          Zoë P Scooter Software

          Comment

          • Zoë
            Team Scooter
            • Oct 2007
            • 2666

            #6
            I see what you're saying about using ANSI not working though. It looks like if the server says it uses UTF-8 BC4 will always use that even if the profile is configured to use something different. I'll see if I can fix that for the next release.
            Zoë P Scooter Software

            Comment

            • johnm
              Enthusiast
              • Nov 2007
              • 31

              #7
              As far as I can tell my FTP server is correctly UTF-8 encoding the filenames. I've even used Microsoft Network Monitor to capture what it sends to BC4.

              With debug logging enabled in BC4 here's what it reports:

              15/01/2014 20:17:00 Sent> MLSD
              15/01/2014 20:17:00 Recv> 150 Go ahead
              15/01/2014 20:17:00 Recv> 226 List sent
              15/01/2014 20:17:00 MODIFY=20100408194726;SIZE=1;TYPE=file; http¬1.0 ~ DEV
              15/01/2014 20:17:00 FTP ParserID: MLST

              The actual filename on the server is http¬1.0 ~ DEV

              The ¬ character is ASCII 172 decimal, which is AC hex.

              Converted to UTF-8 it becomes two characters, hex values C2 and AC.

              Network Monitor confirms that I am sending that hex sequence.

              So I don't understand why BC4 is displaying the C2,AC sequence as two separate characters ¬ rather than deriving the single character ¬ by treating it as a UTF-8 sequence.

              And to answer an earlier question you asked, no, my BC3 log shows no sign of a message "Received invalid UTF-8 sequence, switching to ANSI encoding."

              Comment

              • Zoë
                Team Scooter
                • Oct 2007
                • 2666

                #8
                Ugh. Ok, I've reproduced the issue here. I must have been using an old build when I was testing it earlier. Yes, there's at least one thing wrong with BC's UTF-8 support, and I'll look into it. The fix may not make it into the next release.
                Zoë P Scooter Software

                Comment

                • johnm
                  Enthusiast
                  • Nov 2007
                  • 31

                  #9
                  Any news on this? I can't do any more testing in the context we primarily use BC in until it's fixed.

                  John

                  Comment

                  • Zoë
                    Team Scooter
                    • Oct 2007
                    • 2666

                    #10
                    I found the issue, and it's specific to MLSD responses. If you can use LIST temporarily, that should work around the problem. I should be able to fix it for the next release.
                    Zoë P Scooter Software

                    Comment

                    • johnm
                      Enthusiast
                      • Nov 2007
                      • 31

                      #11
                      Thanks Craig. Clearing the MLSD checkbox on the profile has allowed me to work around the issue.

                      John

                      Comment

                      Working...