<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p>Ces VLC, Ryan:<br>

    </p>

    <div class="moz-cite-prefix">On 2020-07-03 04:00, Ryan Schmidt

      wrote:<br>

    </div>

    <blockquote type="cite"

      cite="mid:065F6164-8947-40E1-8344-2A264432D5C4@macports.org">

      <pre class="moz-quote-pre" wrap="">On Jul 3, 2020, at 04:53, Ces VLC wrote:

</pre>

      <blockquote type="cite">

        <pre class="moz-quote-pre" wrap="">…in filenames that have UTF international characters, I often hit the problem of rsync deleting a file and then rewriting it again, just because the UTF normalization is not the same in both disks…. 

…People suggest to use the --iconv flag, but... does this mean that you need to use different iconv settings depending on whether your transfer is APFS->HFS+ or HFS+->APFS? If affirmative, it would be a bit clumsy, IMHO (first detect the disk FS, then choose proper flags). 

Isn't there some way for dealing with this more conveniently, in a way that you don't need to check the disk FS before invoking rsync?

</pre>

      </blockquote>

    </blockquote>

    <blockquote type="cite"

      cite="mid:065F6164-8947-40E1-8344-2A264432D5C4@macports.org">

      <pre class="moz-quote-pre" wrap="">

The issue I'm familiar with is that there can be several valid ways to represent certain strings of UTF-8 characters….</pre>

    </blockquote>

    <p>I don't know nearly enough about rsync, so I hope Ces VLC finds a

      good answer and that I can use it too.   I don't know nearly as

      much as Ryan about macports, and I am grateful for all Ryan's work

      on macports. However, I do know a bit about Unicode, and I have

      recently read up a bit on filenames in APFS, HFS+, and ext3/4 of

      Linux. Let me try to explain the difference between filenames

      which I suspect Ces is encountering. I will say something similar

      to Ryan, but with important differences in terminology related to

      Unicode. I may get details of the file systems wrong. And, none of

      my examples are tested, so some of them may be incorrect.<br>

    </p>

    <p>Fundamental question: when is a filename {Na} on file system A

      the "same" as filename {Nb} on filesystem B? The answer is

      complex.<br>

    </p>

    <p>Fundamental fact: different filesystems store filenames as

      different data structures, with different semantics attached to

      the data. Comparing filename {Na} to {Nb} requires converting {Na}

      to the data structures used in filesystem B, and doing the

      appropriate kind of comparison.</p>

    <ul>

      <li>HFS+ stores filenames as 16-bit code units with UTF-16BE

        semantics. The file system API receives filenames as an array of

        Unicode characters. It normalises the name to NFD(-ish) before

        writing. IIRC an HFS+ file system can be case-insensitive (more

        common) or case-sensitive.<br>

      </li>

      <li>APFS stores filenames as 8-bit code units with UTF-8

        semantics, and also as a 22-bit hash. The file system API

        receives filenames as an array of Unicode characters. It does

        not normalise the name when writing; the filename's characters

        are preserved in the filesystem. It also computes the 22-bit

        hash from the filename. However, the filesystem can be

        configured to normalise the filename before using it to compute

        the hash. Thus the filesystem API can do

        normalisation-insensitive comparison of filenames, by comparing

        their hash values but not the filename code units.  See <<a

          moz-do-not-send="true"

href="https://developer.apple.com/support/downloads/Apple-File-System-Reference.pdf">https://developer.apple.com/support/downloads/Apple-File-System-Reference.pdf</a>>,

        section "j_drec_hashed_key_t". <br>

      </li>

      <li>ext 3/4 stores filenames as 8-bit code units with no semantics

        (except that byte values 0x00 and 0x2F '/' are special). The

        Posix file system API receives filenames as 8-bit code units and

        writes them as is. The filename's bytes are preserved in the

        filesystem. Filename comparisons are 8-bit code unit to code

        unit, with no interpretation as Unicode, or Unicode

        normalisation. Thus comparisons are normalisation-sensitive.</li>

      <li>I suspect (but haven't confirmed) that rsync transmits

        filenames as sequences of bytes, possibly converted to UTF-8

        code units via --iconv, but without any normalisation. <br>

      </li>

    </ul>

    <p>Terminology: <br>

    </p>

    <ul>

      <li>Unicode character: an abstract concept named by a integer

        value between 0 and about 1.1 million (0x10FFFF).</li>

      <li>Code unit: a unit of storage for characters. Unicode defines

        8-bit, 16-bit, and 32-bit code units. The 16-bit and 32-bit code

        units have variants which map to bytes in "big-endian" (BE) and

        "little-endian" (LE) forms.<br>

      </li>

      <li>UTF (Unicode Transformation Format): an algorithm for mapping

        between Unicode characters and code unit sequences of various

        lengths. UTF-8 maps between Unicode characters and sequences of

        1-4 8-bit code units. UTF-16BE maps between Unicode characters

        and sequences of 1-2 16-bit big-endian code units. UTF-32BE maps

        between Unicode characters and single 32-bit big-endian code

        units.</li>

      <li>Normalisation: an algorithm for taking arbitrary Unicode

        character sequences and removing some differences in

        representation, so that they are more useful for certain

        operations. One of these operations is comparison for equality.

        Background: Unicode provides multiple ways to represent the same

        user-perceived writing system unit. e.g. U+2128 Angstrom Sign

        (Å), U+00C5 Latin Capital Letter A with Ring Above (Å), and

        U+0041 Latin Capital Letter A  U+030A Combining Ring Above (Å,

        i.e. A˚) are different for some purposes, but the same for other

        purposes, including normalisation. See UAX #15 <i>Unicode

          Normalization Forms</i> <<a moz-do-not-send="true"

          href="http://www.unicode.org/reports/tr15/">http://www.unicode.org/reports/tr15/</a>>.</li>

      <li>NFD: a normalisation algorithm which mostly decomposes

        compound characters: U+00C5 (Å) becomes U+0041 U+030A (Å, i.e.

        A˚).</li>

      <li>Sensitive and insensitive: whether a difference between

        characters is significant or not significant when testing for

        "is the same". File systems can be case-sensitive, in which case

        Case.txt and cAsE.Txt are different; or they can be

        case-insensitive, in which case the two names are the same.

        Similarly, file systems can be normalisation-sensitive, in which

        case 5Å.svg and 5Å.svg are different, or they can be

        normalisation-insensitive, in which case they are the same.</li>

      <li>Preserving and not-preserving: whether a difference between

        characters, present when names are written to a file system, is

        still present when the file names are read back out of the file

        system. DOS 8.3 FAT filesystems are case-insenstive and case-non-preserving:

        write "case.txt", and you get back "CASE.TXT". Similarly, file

        systems can be normalisation-preserving or

        normalisation-non-preserving. If you write 5Å.svg and 5Å.svg to

        HFS+, which is normalisation-insenstive and

        normalisation-non-preserving, you get back 5Å.svg. If you write

        them to APFS, which is normalisation-<i>preserving</i>, though

        normalisation-insenstive, you get back the same 5Å.svg and 5Å.svg.</li>

    </ul>

    <p>So, the challenge which Ces VLC is giving to rsync is, read a

      filename {Na} from APFS filesystem A as Unicode characters from

      A's API, convert the name to UTF-8 code units, don't touch

      normalisation, convert to a name {Nb} on HFS+ filesystem B using

      B's API, and save the file in B. HFS+ on B normalises it to

      {Nb_norm}. Later, read filename {Na} from A, convert it to UTF-8,

      convert it to name {Nb} on B; is this the "same" as the existing

      filename {Nb_norm} on B?</p>

    <p>There is an Rsync FAQ which might be relevant: "rsync recopies

      the same files" <<a moz-do-not-send="true"

        href="https://rsync.samba.org/FAQ.html#2">https://rsync.samba.org/FAQ.html#2</a>>.

      it suggests fixing the problem using --iconv to specify filename

      conversions. I haven't looked into rsync enough to know if it will

      solve the problem. The impression I get is that rsync will not

      "first detect the disk FS, then choose proper flags". I suspect

      you will have to do that when you invoke rsync, using your

      knowledge of the source and destination filesystems.<br>

    </p>

    <p>I'm sorry this is so wordy, and I hope you see the distinctions

      I'm trying to explain. And, I hope this helps you figure out a

      solution. Please let the list know what you find out.</p>

    <p>Best regards,<br>

           —Jim DeLaHunt, software engineer, Vancouver, Canada<br>

    </p>

  </body>

</html>