rsync and different UTF normalization in APFS vs HFS+ (macports-users Digest, Vol 167, Issue 3)

Sat Jul 4 00:42:56 UTC 2020

Ces VLC, Ryan:

On 2020-07-03 04:00, Ryan Schmidt wrote:
> On Jul 3, 2020, at 04:53, Ces VLC wrote:
>
>> …in filenames that have UTF international characters, I often hit the problem of rsync deleting a file and then rewriting it again, just because the UTF normalization is not the same in both disks….
>>
>> …People suggest to use the --iconv flag, but... does this mean that you need to use different iconv settings depending on whether your transfer is APFS->HFS+ or HFS+->APFS? If affirmative, it would be a bit clumsy, IMHO (first detect the disk FS, then choose proper flags).
>>
>> Isn't there some way for dealing with this more conveniently, in a way that you don't need to check the disk FS before invoking rsync?
> The issue I'm familiar with is that there can be several valid ways to represent certain strings of UTF-8 characters….

I don't know nearly enough about rsync, so I hope Ces VLC finds a good 
answer and that I can use it too.   I don't know nearly as much as Ryan 
about macports, and I am grateful for all Ryan's work on macports. 
However, I do know a bit about Unicode, and I have recently read up a 
bit on filenames in APFS, HFS+, and ext3/4 of Linux. Let me try to 
explain the difference between filenames which I suspect Ces is 
encountering. I will say something similar to Ryan, but with important 
differences in terminology related to Unicode. I may get details of the 
file systems wrong. And, none of my examples are tested, so some of them 
may be incorrect.

Fundamental question: when is a filename {Na} on file system A the 
"same" as filename {Nb} on filesystem B? The answer is complex.

Fundamental fact: different filesystems store filenames as different 
data structures, with different semantics attached to the data. 
Comparing filename {Na} to {Nb} requires converting {Na} to the data 
structures used in filesystem B, and doing the appropriate kind of 
comparison.

  * HFS+ stores filenames as 16-bit code units with UTF-16BE semantics.
    The file system API receives filenames as an array of Unicode
    characters. It normalises the name to NFD(-ish) before writing. IIRC
    an HFS+ file system can be case-insensitive (more common) or
    case-sensitive.
  * APFS stores filenames as 8-bit code units with UTF-8 semantics, and
    also as a 22-bit hash. The file system API receives filenames as an
    array of Unicode characters. It does not normalise the name when
    writing; the filename's characters are preserved in the filesystem.
    It also computes the 22-bit hash from the filename. However, the
    filesystem can be configured to normalise the filename before using
    it to compute the hash. Thus the filesystem API can do
    normalisation-insensitive comparison of filenames, by comparing
    their hash values but not the filename code units.  See
    <https://developer.apple.com/support/downloads/Apple-File-System-Reference.pdf>,
    section "j_drec_hashed_key_t".
  * ext 3/4 stores filenames as 8-bit code units with no semantics
    (except that byte values 0x00 and 0x2F '/' are special). The Posix
    file system API receives filenames as 8-bit code units and writes
    them as is. The filename's bytes are preserved in the filesystem.
    Filename comparisons are 8-bit code unit to code unit, with no
    interpretation as Unicode, or Unicode normalisation. Thus
    comparisons are normalisation-sensitive.
  * I suspect (but haven't confirmed) that rsync transmits filenames as
    sequences of bytes, possibly converted to UTF-8 code units via
    --iconv, but without any normalisation.

Terminology:

  * Unicode character: an abstract concept named by a integer value
    between 0 and about 1.1 million (0x10FFFF).
  * Code unit: a unit of storage for characters. Unicode defines 8-bit,
    16-bit, and 32-bit code units. The 16-bit and 32-bit code units have
    variants which map to bytes in "big-endian" (BE) and "little-endian"
    (LE) forms.
  * UTF (Unicode Transformation Format): an algorithm for mapping
    between Unicode characters and code unit sequences of various
    lengths. UTF-8 maps between Unicode characters and sequences of 1-4
    8-bit code units. UTF-16BE maps between Unicode characters and
    sequences of 1-2 16-bit big-endian code units. UTF-32BE maps between
    Unicode characters and single 32-bit big-endian code units.
  * Normalisation: an algorithm for taking arbitrary Unicode character
    sequences and removing some differences in representation, so that
    they are more useful for certain operations. One of these operations
    is comparison for equality. Background: Unicode provides multiple
    ways to represent the same user-perceived writing system unit. e.g.
    U+2128 Angstrom Sign (Å), U+00C5 Latin Capital Letter A with Ring
    Above (Å), and U+0041 Latin Capital Letter A  U+030A Combining Ring
    Above (Å, i.e. A˚) are different for some purposes, but the same
    for other purposes, including normalisation. See UAX #15 /Unicode
    Normalization Forms/ <http://www.unicode.org/reports/tr15/>.
  * NFD: a normalisation algorithm which mostly decomposes compound
    characters: U+00C5 (Å) becomes U+0041 U+030A (Å, i.e. A˚).
  * Sensitive and insensitive: whether a difference between characters
    is significant or not significant when testing for "is the same".
    File systems can be case-sensitive, in which case Case.txt and
    cAsE.Txt are different; or they can be case-insensitive, in which
    case the two names are the same. Similarly, file systems can be
    normalisation-sensitive, in which case 5Å.svg and 5Å.svg are
    different, or they can be normalisation-insensitive, in which case
    they are the same.
  * Preserving and not-preserving: whether a difference between
    characters, present when names are written to a file system, is
    still present when the file names are read back out of the file
    system. DOS 8.3 FAT filesystems are case-insenstive and
    case-non-preserving: write "case.txt", and you get back "CASE.TXT".
    Similarly, file systems can be normalisation-preserving or
    normalisation-non-preserving. If you write 5Å.svg and 5Å.svg to
    HFS+, which is normalisation-insenstive and
    normalisation-non-preserving, you get back 5Å.svg. If you write
    them to APFS, which is normalisation-/preserving/, though
    normalisation-insenstive, you get back the same 5Å.svg and 5Å.svg.

So, the challenge which Ces VLC is giving to rsync is, read a filename 
{Na} from APFS filesystem A as Unicode characters from A's API, convert 
the name to UTF-8 code units, don't touch normalisation, convert to a 
name {Nb} on HFS+ filesystem B using B's API, and save the file in B. 
HFS+ on B normalises it to {Nb_norm}. Later, read filename {Na} from A, 
convert it to UTF-8, convert it to name {Nb} on B; is this the "same" as 
the existing filename {Nb_norm} on B?

There is an Rsync FAQ which might be relevant: "rsync recopies the same 
files" <https://rsync.samba.org/FAQ.html#2>. it suggests fixing the 
problem using --iconv to specify filename conversions. I haven't looked 
into rsync enough to know if it will solve the problem. The impression I 
get is that rsync will not "first detect the disk FS, then choose proper 
flags". I suspect you will have to do that when you invoke rsync, using 
your knowledge of the source and destination filesystems.

I'm sorry this is so wordy, and I hope you see the distinctions I'm 
trying to explain. And, I hope this helps you figure out a solution. 
Please let the list know what you find out.

Best regards,
      —Jim DeLaHunt, software engineer, Vancouver, Canada

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.macports.org/pipermail/macports-users/attachments/20200703/c52b317d/attachment.htm>