Added Apple Silicon Mac support #164

cielavenir · 2020-11-26T12:59:21Z

As the further work of #162 , I was able to assemble aarch64 code for Apple Silicon Mac.
I needed slight modification but the assembled Android binary still works.

How I tested my work:

make -f Makefile.unx CC=arm64-apple-darwin-gcc AR=arm64-apple-darwin-ar arch=aarch64 host_cpu=aarch64 DEFINES="-fno-stack-check" lib programs/igzip -j8

where

arm64-apple-darwin-gcc
XCODE=$HOME/Downloads/Xcode.app
SYSROOT=$XCODE/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX11.0.sdk
$XCODE/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/clang -isysroot $SYSROOT -arch arm64 "$@"

arm64-apple-darwin-ar
XCODE=$HOME/Downloads/Xcode.app
SYSROOT=$XCODE/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX11.0.sdk
$XCODE/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ar "$@"

Signed-off-by: Taiju Yamada [email protected]

Signed-off-by: Taiju Yamada <[email protected]>

gbtucker · 2020-12-03T22:16:19Z

Thanks for the submission. This has a lot of ifdefs for __MACH__. I'd like to get @yuhaoth to comment.

For the dispatcher it seems cleaner and easier to read as:

#if defined(__MACH__) && defined(__aarch64__)
   return (crc16_t10dif_pmull);

#else // Determine dynamically
    unsigned long auxval = getauxval(AT_HWCAP);
    if (auxval & HWCAP_PMULL)
        return PROVIDER_INFO(crc16_t10dif_pmull);
    return PROVIDER_BASIC(crc16_t10dif);
#endif

It would be good to test a native compile.

yuhaoth · 2020-12-04T06:39:34Z

crc/aarch64/crc_aarch64_dispatcher.c

@@ -34,6 +34,8 @@ DEFINE_INTERFACE_DISPATCHER(crc16_t10dif)
 	unsigned long auxval = getauxval(AT_HWCAP);
 	if (auxval & HWCAP_PMULL)
 		return PROVIDER_INFO(crc16_t10dif_pmull);
+#elif defined(__aarch64__)


My suggestion is add it like

#ifndef __MACH__ unsigned long auxval=getauxval(AT_HWCAP); if(auxval & HWCAP_PMULL) return PROVIDER_INFO(crc16_t10dif_pmull); return PROVIDER_BASIC(crc16_t10dif); #else return PROVIDER_INFO(crc16_t10dif_pmull); #endif

And another thing I must confirm with you . If the transparent layer can be remove , I think this file should not be compiled in Apple Silicon Mac.

I might not be answering correct question, but removing __MACH__ causes getauxval undefined.

I have rewritten dispatchers

aarch64's neon is spec

my very first assumption was Apple would not ever sell aarch64 CPU without pmull. In this way I don't need dispatcher.

in above condition, removing 12575f5 solves "aarch64_multibinary.h issue" in workaround way.

yuhaoth · 2020-12-04T06:43:05Z

crc/aarch64/crc_aarch64_dispatcher.c

@@ -81,6 +87,8 @@ DEFINE_INTERFACE_DISPATCHER(crc32_iscsi)
 	if (auxval & HWCAP_PMULL) {
 		return PROVIDER_INFO(crc32_iscsi_refl_pmull);
 	}
+#elif defined(__aarch64__)
+	return PROVIDER_INFO(crc32_iscsi_refl_pmull);


this function might not be best choice . As I know, crc32_iscsi_crc_ext or crc32_iscsi_3crc_fold should be better choice.

You can test the real performance and pick up the best one

Well, I'm sorry.

Although I have asked an acquaintance of mine to test the same binary on ARM mac (it worked, so "it does support ARM mac"), my primary machine is Intel mac and my test environment is iPad (well, jailbroken and sshd enabled).

Also, at least crc32 instruction caused SIGILL on my iPad (from libslz).

The worse thing, as far as I know, there are no runtime cpu feature detection API on Darwin. That's why I adjust the dispatcher to middle-range...

It seems undocumented _get_cpu_capabilities can be used as "runtime cpu feature detection API".

Now my concern is https://developer.apple.com/documentation/xcode/writing_arm64_code_for_apple_platforms The platforms reserve register x18. Don’t use this register. Allowing CRC32 instruction could get into this codepath...

x18 problem should be fix . I will raise an issue later.

And it looks runtime cpu feature detection should be added in Apple. As I known , M1 (mac mini arm64 ) are available now. I guess it has more feature support.

Could you review aarch64_multibinary.h ? I am not sure if there are anything that can not match Apple spec.

yuhaoth · 2020-12-04T06:44:03Z

crc/aarch64/crc_aarch64_dispatcher.c

@@ -105,6 +113,8 @@ DEFINE_INTERFACE_DISPATCHER(crc32_gzip_refl)

 	if (auxval & HWCAP_PMULL)
 		return PROVIDER_INFO(crc32_gzip_refl_pmull);
+#elif defined(__aarch64__)


same as above comment. crc32_gzip_refl_crc_ext and crc32_gzip_refl_3crc_fold are better choice.

crc/aarch64/crc_aarch64_dispatcher.c

crc/aarch64/crc32_gzip_refl_pmull.h

yuhaoth · 2020-12-04T06:51:57Z

crc/aarch64/crc32_ieee_norm_pmull.h

 	.align	4
 	.set	.lanchor_crc_tab,. + 0
+#ifndef __MACH__


Is it requirement from Clang? If yes , I think clang is better choice.

it seems controlled by llvm::MCAsmInfo::HasDotTypeDotSizeDirective (https://llvm.org/doxygen/classllvm_1_1MCAsmInfo.html#a7c3b8692b75d4808f7c888e61f01e1c8) and it is false in Darwin (https://github.com/llvm/llvm-project/blob/release/9.x/llvm/lib/MC/MCAsmInfoDarwin.cpp#L89).

Well, if ARM Windows support will be added as well, #if !defined(__MACH__) && !defined(__WIN32__) is more proper.

Yes , you are right.

But now , there are no enough information about run time cpu feature detection on WIN32 . That's same with Apple .

erasure_code/aarch64/gf_4vect_dot_prod_neon.S

igzip/aarch64/data_struct_aarch64.h

yuhaoth · 2020-12-04T07:11:12Z

@cielavenir , Thanks your hard work.

How about re-org #162 and #164 ? It looks this PR includes patches for clang .
And also I want to know the work Clang release version.

About Apple Silicon Mac support. I have some questions.

Is there any runtime cpu feature detection API in Mac OS ?
- x86_64 provides it with cpu_id instructions and arm64-linux support it with getauxval(AT_HWCAP)
- If no, I prefer change aarch64/*_multibinary.S directly.
About procedure call standard , is there any difference with aapcs64
- If yes , we should modify include/aarch64_multibinary.h .

igzip/aarch64/igzip_decode_huffman_code_block_aarch64.S

include/aarch64_multibinary.h

Signed-off-by: Taiju Yamada <[email protected]>

crc/aarch64/crc32_gzip_refl_pmull.h

include/aarch64_multibinary.h

yuhaoth

Please modify the commit message .
It looks my name appear in commit message :)

yuhaoth · 2020-12-07T08:41:15Z

And please re-org the patches . It looks some commit can not pass CI tests .
Anyway the PR can pass CI tests .

yuhaoth · 2020-12-08T10:06:46Z

I re-org this PR into #168 .
@cielavenir I need your approve to merge #168

cielavenir · 2020-12-09T02:49:56Z

WIP until aarch_multibinary.h issue is cleared.

cielavenir · 2020-12-11T01:04:21Z

by the way I report here as well that reading ID_AA64ISAR0_EL1 in user mode is prohibited (causes SIGILL) on iDevice.

Signed-off-by: Taiju Yamada <[email protected]>

cielavenir · 2022-07-22T14:34:20Z

@kirbyzhou @rhpvorderman checked compilation by my (intel) macbook

kirbyzhou · 2022-07-26T03:35:23Z

checked compilation by my macbook (apple m1)

% gh pr checkout 164
% git log --oneline
225b6bd (HEAD -> fix_mach) Fix q_fold_const load
855112d Merge remote-tracking branch 'ciel/fix_mach' into HEAD
acd48c0 fix ASM_DEF_RODATA include
b878e6d Merge remote-tracking branch 'origin/master' into fix_mach
2bcbaf4 (origin/master, origin/HEAD) doc: Add security policy file
...
% brew install gcc@11
% gcc-11 --version
gcc-11 (Homebrew GCC 11.3.0_2) 11.3.0
% ./configure CC=gcc-11 
       isa-l 2.30.0
        =====

        prefix:                 /usr
        sysconfdir:             ${prefix}/etc
        libdir:                 ${exec_prefix}/lib
        includedir:             ${prefix}/include

        compiler:               gcc-11
        cflags:                 -g -O2
        ldflags:                

        debug:                  no
% make -j
...
  CCLD     programs/igzip

```

kirbyzhou · 2022-07-26T03:57:01Z

But in my benchmark, the isa-l version of ungzip is much slower than cloudflare version https://github.com/cloudflare/zlib

isa-l % time ./programs/igzip -d < ~/ranger-3.0.0-SNAPSHOT-admin.tar.gz > /dev/null
./programs/igzip -d <  > /dev/null  2.39s user 0.05s system 99% cpu 2.454 total

cloudflare-zlib % time ./minigzip -d < ~/ranger-3.0.0-SNAPSHOT-admin.tar.gz > /dev/null
./minigzip -d <  > /dev/null  1.92s user 0.08s system 99% cpu 2.005 total

kirbyzhou · 2022-07-26T04:05:29Z

Under a linux arm host.

 isa-l]$ time ./programs/igzip -d < ~/xxx.gz > /dev/null

real	0m0.313s
user	0m0.283s
sys	0m0.021s

cloudflare-zlib]$ time ./minigzip -d < ~/xxx.gz > /dev/null

real	0m0.233s
user	0m0.212s
sys	0m0.021s

cdevers-es · 2022-10-13T02:06:37Z

Any news on this? From the last couple of comments from @kirbyzhou , it sounds like this version builds & runs, but the performance seems to have a regression?

rhpvorderman · 2022-10-14T06:39:22Z

@cdevers-es I can confirm that the performance regression that @kirbyzhou mentions also happens on an Olimex Olinuxino A64 development board. This is only for decompression (compression sitll is faster). As such it is a common aarch64 issue, not a Mac specific one. Therefore it is my opinion that this PR should be merged as soon as it is ready. The performance regression can be handled later by the people who understand the arm64 code.

gbtucker · 2022-10-27T01:39:52Z

Therefore it is my opinion that this PR should be merged as soon as it is ready. The performance regression can be handled later by the people who understand the arm64 code.

I get an illegal instruction exception on this when I cross compile when I don't on main branch. Perhaps it's breaking something in the existing dispatcher.

make -f Makefile.unx -j 4 host_cpu=aarch64 CC=aarch64-linux-gcc LDFLAGS=-static 'D=NO_SVE2=1' crc16_t10dif_test
qemu-aarch64-static -g 5000 ./crc16_t10dif_test

─── Output/messages ─[111/111]

Program received signal SIGILL, Illegal instruction.
0x0000000000401648 in ?? ()
─── Assembly 
 0x0000000000401648  ? .inst    0x87e70000 ; undefined
 0x000000000040164c  ? udf    #0
 0x0000000000401650  ? tbnz    w0, #3, 0x3fb650
 0x0000000000401654  ? udf    #0
 0x0000000000401658  ? add    x0, x0, #0x40
 0x000000000040165c  ? sub    x3, x3, #0x40
 0x0000000000401660  ? cmp    x3, #0x3f
 0x0000000000401664  ? ldp    q28, q29, [x0, #-64]
 0x0000000000401668  ? ldp    q30, q31, [x0, #-32]
 0x000000000040166c  ? prfm    pldl2strm, [x0, #102

kirbyzhou · 2022-10-27T02:37:59Z

@cdevers-es I can confirm that the performance regression that @kirbyzhou mentions also happens on an Olimex Olinuxino A64 development board. This is only for decompression (compression sitll is faster). As such it is a common aarch64 issue, not a Mac specific one. Therefore it is my opinion that this PR should be merged as soon as it is ready. The performance regression can be handled later by the people who understand the arm64 code.

compression is faster but compression ratio is worse.

cielavenir · 2022-10-27T09:30:51Z

oh this fixes the test actually

diff --git a/crc/aarch64/crc16_t10dif_pmull.S b/crc/aarch64/crc16_t10dif_pmull.S
index 2ae3fb7..29af534 100644
--- a/crc/aarch64/crc16_t10dif_pmull.S
+++ b/crc/aarch64/crc16_t10dif_pmull.S
@@ -201,13 +201,7 @@ v_tmp1_x3          .req    v27
 q_fold_const           .req    q17
 v_fold_const           .req    v17
 
-        ldr q_fold_const, fold_constant
-
-fold_constant:
-       .word 0x87e70000
-       .word 0x00000000
-       .word 0x371d0000
-       .word 0x00000000
+        ldr q_fold_const, =0x371d00000000000087e70000;
 
        .align 2
 .crc_fold_loop:

could someone tell me why (my) this fold_constant does not work?

cielavenir · 2022-10-27T09:47:18Z

I'm terribly sorry, cielavenir@825d080 works. I feel so ashame.

Let me check other parts soon.

Signed-off-by: Taiju Yamada <[email protected]>

cielavenir · 2022-10-27T09:56:08Z

fixed crc16_t10dif_copy_pmull.S as well. It seems those two are ldr = users.

cielavenir · 2022-10-27T10:15:55Z

checked

for test in *_test;do echo $test; qemu-aarch64-static ./$test;done

checksum32_funcs_test
crc16_t10dif_copy_test
crc16_t10dif_test
crc32_funcs_test
crc64_funcs_test
crc_simple_test
erasure_code_test
erasure_code_base_test
erasure_code_update_test
gf_inverse_test
gf_vect_dot_prod_base_test
gf_vect_dot_prod_test
gf_vect_mad_test
gf_vect_mul_base_test
gf_vect_mul_test
igzip_rand_test
mem_zero_detect_test
pq_check_test
pq_gen_test
xor_check_test
xor_gen_test

igzip_wrapper_hdr_test did not pass but master is similarly not passing.

gbtucker · 2022-10-28T00:55:06Z

Thanks @cielavenir. I was able to rebase and I don't see any more issues so I should be able to integrate as soon as I can push through our internal CI.

gbtucker · 2022-10-28T15:52:31Z

Integrated

cielavenir · 2022-10-30T08:58:43Z

damn, 825d080 and 33a7a42 did break the compilation:

crc/aarch64/crc16_t10dif_pmull.S:222:9: error: unknown AArch64 fixup kind!
        ldr q_fold_const, fold_constant
        ^

let me think how to fix it (I'm not good at asm, I'll be glad if someone can think of good idea)

cielavenir · 2022-10-30T09:00:05Z

I mean broke the compilation on arm64 macos

But without 825d080 and 33a7a42, the generated binary is incorrect anyway...

cielavenir · 2022-10-30T09:48:10Z

should be fixed by #226

cielavenir added 6 commits November 22, 2020 01:51

Fixed clang as assembly

c7328c8

Signed-off-by: Taiju Yamada <[email protected]>

fixed clang as build

d6ec9e6

Signed-off-by: Taiju Yamada <[email protected]>

Fixed addressing assembly

e904c34

Signed-off-by: Taiju Yamada <[email protected]>

It should be fine to enable pmull always on Apple Silicon

afa64d0

Signed-off-by: Taiju Yamada <[email protected]>

Fixed assembly (compared with objdump)

6b59dac

Signed-off-by: Taiju Yamada <[email protected]>

Merge branch 'fix_clang_as' into fix_mach

75115c8