inserting -stdlib=libstdc++ into cxxflags

Fri Mar 22 17:00:04 UTC 2024

On Sunday March 17 2024 03:44:00 Sergey Fedorov wrote:

>> but if libc++ 5 was maybe still a bit faster overall than libstc++ the
>situation is now rather reversed though differences remain small

I take that back, the differences aren't always small!

I realised that the so-called "native" benchmark from the libcxx source tree could be used with the libstdc++ from (currently) port:libgcc13. It took some time to figure out how to inject the `-stdlib=macports-libstdc++` argument properly but once I got that working on Linux it transferred without further ado to Mac.

These results just in. Libc++ and all benchmarking code built with `clang++-mp-12 -O3 -march=native -flto`.

libstdc++ is indeed consistently faster. Usually by not much (though the differences in kernel time spent can be relatively important):

```
> build/libcxx/benchmarks/algorithms.partition_point.libcxx.out
Run on (4 X 2700 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x2)
  L1 Instruction 32 KiB (x2)
  L2 Unified 256 KiB (x2)
  L3 Unified 4096 KiB (x1)
Load Average: 1.13, 1.27, 1.34
<snip>
89.198 user_cpu 0.907 kernel_cpu 1:30.11 total_time 99.9%CPU {93360128M 0F 226542R 0I 0O 0k 0w 445c}
> /build/libcxx/benchmarks/algorithms.partition_point.native.out
Run on (4 X 2700 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x2)
  L1 Instruction 32 KiB (x2)
  L2 Unified 256 KiB (x2)
  L3 Unified 4096 KiB (x1)
Load Average: 1.42, 1.35, 1.34
<snip>
75.612 user_cpu 0.911 kernel_cpu 1:16.53 total_time 99.9%CPU {102424576M 0F 229228R 0I 0O 0k 0w 504c}
```

```
> build/libcxx/benchmarks/algorithms.libcxx.out --benchmark_repetitions=1 --benchmark_filter='_262144$'
Run on (4 X 2700 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x2)
  L1 Instruction 32 KiB (x2)
  L2 Unified 256 KiB (x2)
  L3 Unified 4096 KiB (x1)
Load Average: 1.63, 1.75, 1.70
<snip>
220.386 user_cpu 2.961 kernel_cpu 3:43.50 total_time 99.9%CPU {79626240M 0F 542726R 0I 9O 0k 154w 3502c}
> build/libcxx/benchmarks/algorithms.native.out --benchmark_repetitions=1 --benchmark_filter='_262144$'
Run on (4 X 2700 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x2)
  L1 Instruction 32 KiB (x2)
  L2 Unified 256 KiB (x2)
  L3 Unified 4096 KiB (x1)
Load Average: 1.66, 1.79, 1.67

<snip> 
190.844 user_cpu 2.615 kernel_cpu 3:13.50 total_time 99.9%CPU {89800704M 0F 504812R 0I 9O 0k 149w 1942c}
```

But observe this, as far as I understand a "small" version of the above benchmark that seems to highlight a huge overhead in libc++:

```
> build/libcxx/benchmarks/algorithms.libcxx.out --benchmark_repetitions=1 --benchmark_filter='_1$'
Run on (4 X 2700 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x2)
  L1 Instruction 32 KiB (x2)
  L2 Unified 256 KiB (x2)
  L3 Unified 4096 KiB (x1)
Load Average: 1.61, 1.37, 1.36
<snip>
1947.681 user_cpu 221.754 kernel_cpu 36:10.22 total_time 99.9%CPU {78262272M 0F 1503675R 0I 9O 0k 152w 22829c}
> build/libcxx/benchmarks/algorithms.native.out --benchmark_repetitions=1 --benchmark_filter='_1$'
Run on (4 X 2700 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x2)
  L1 Instruction 32 KiB (x2)
  L2 Unified 256 KiB (x2)
  L3 Unified 4096 KiB (x1)
Load Average: 1.63, 1.42, 1.36
<snip>
1056.593 user_cpu 8.435 kernel_cpu 17:45.51 total_time 99.9%CPU {78917632M 0F 1458187R 0I 9O 0k 154w 12805c}
```

Here the library from the "bloated" GCC is twice as fast overall, and uses almost 30x less kernel CPU time!

I see the same on Linux.

This makes me wonder if shouldn't try building llvm+clang against macports-libstdc++ . I have already managed to do so with lld-17 (only depends on libc++ via libxml2, and turns out to be "safe to mingle"). Newer clang versions build against their own libc++ even on Linux (when building with clang) so that suggests the code has been designed to separate the possibly 2 C++ runtime versions that get linked. It would probably be impossible to use the resulting libLLVM or libclang in dependent ports but maybe the performance increase might be worth it.

R