rsync and different UTF normalization in APFS vs HFS+
jmr at macports.org
Mon Jul 6 17:56:41 UTC 2020
Ces VLC wrote:
> On Sat, Jul 4, 2020 at 2:43 AM Jim DeLaHunt <list+macports-users at jdlh.com>
>> [...] I hope you see the distinctions I'm trying to explain. And, I hope
> this helps you figure out a solution. Please let the list know what you
> find out.
> Thanks a lot, Ryan and Jim, for your messages and for the great information
> you provided. It's very complete, and, yes, Jim, what you described is the
> cause of the problem: rsync just transmits file names as verbatim raw
> sequences of bytes with no conversion at all.
> IMHO, the correct way of fixing this shouldn't be by manually converting
> the encodings yourself with the '--iconv' flag, but actually with a flag
> for performing the check after normalization, which AFAIK doesn't exist (it
> wouldn't matter what normalization, just apply the same normalization to
> all file names before comparing them, and then discard the normalization).
> What I mean is, what's the purpose of rsync considering as different two
> files whose name is identical when being displayed in a terminal? Two
> identical text strings can be normalized in different ways (for example:
> accents in separated codes, or in composed codes), but they are the same
> text. So, if the text is the same, why consider them as different file
Sounds like a perfectly valid feature request for the rsync project.
> I don't understand why such '--normalize-before-compare' flag doesn't exist
> (I insist: no need to specify the normalization algorithm, just apply the
> same algorithm to all file names). It would fix all these problems in an
> elegant and clean way, and, BTW, this would be the behaviour everybody
> expects, if I'm not missing any point here.
It probably just didn't come up before APFS became widespread on macOS.
And still doesn't come up if all your filenames are ASCII.
This behaviour has the slight disadvantage of being technically
incorrect on normalization-sensitive filesystems. On your typical Linux
system, it's entirely possible to have two filenames that differ only in
normalization. And you know if it's possible, then someone somewhere has
a workflow that depends on it.
It might make sense to have normalize-before-compare turned on by
default on Darwin, and off by default elsewhere, with a flag to enable
or disable as needed. As you say, it could sometimes be preferable
behaviour even on normalization-sensitive systems.
More information about the macports-users