Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Apple Silicon Mac support #164

Closed
wants to merge 27 commits into from
Closed

Conversation

cielavenir
Copy link
Contributor

@cielavenir cielavenir commented Nov 26, 2020

As the further work of #162 , I was able to assemble aarch64 code for Apple Silicon Mac.
I needed slight modification but the assembled Android binary still works.

How I tested my work:

make -f Makefile.unx CC=arm64-apple-darwin-gcc AR=arm64-apple-darwin-ar arch=aarch64 host_cpu=aarch64 DEFINES="-fno-stack-check" lib programs/igzip -j8

where

arm64-apple-darwin-gcc
XCODE=$HOME/Downloads/Xcode.app
SYSROOT=$XCODE/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX11.0.sdk
$XCODE/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/clang -isysroot $SYSROOT -arch arm64 "$@"

arm64-apple-darwin-ar
XCODE=$HOME/Downloads/Xcode.app
SYSROOT=$XCODE/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX11.0.sdk
$XCODE/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ar "$@"

Signed-off-by: Taiju Yamada [email protected]

@gbtucker
Copy link
Contributor

gbtucker commented Dec 3, 2020

Thanks for the submission. This has a lot of ifdefs for __MACH__. I'd like to get @yuhaoth to comment.

For the dispatcher it seems cleaner and easier to read as:

#if defined(__MACH__) && defined(__aarch64__)
   return (crc16_t10dif_pmull);

#else // Determine dynamically
    unsigned long auxval = getauxval(AT_HWCAP);
    if (auxval & HWCAP_PMULL)
        return PROVIDER_INFO(crc16_t10dif_pmull);
    return PROVIDER_BASIC(crc16_t10dif);
#endif

It would be good to test a native compile.

@@ -34,6 +34,8 @@ DEFINE_INTERFACE_DISPATCHER(crc16_t10dif)
unsigned long auxval = getauxval(AT_HWCAP);
if (auxval & HWCAP_PMULL)
return PROVIDER_INFO(crc16_t10dif_pmull);
#elif defined(__aarch64__)
Copy link
Contributor

@yuhaoth yuhaoth Dec 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion is add it like

           #ifndef __MACH__
                        unsigned long auxval=getauxval(AT_HWCAP);
                        if(auxval & HWCAP_PMULL)
                                 return PROVIDER_INFO(crc16_t10dif_pmull);
                        return PROVIDER_BASIC(crc16_t10dif);
             #else
                        return PROVIDER_INFO(crc16_t10dif_pmull);
             #endif

And another thing I must confirm with you . If the transparent layer can be remove , I think this file should not be compiled in Apple Silicon Mac.

Copy link
Contributor Author

@cielavenir cielavenir Dec 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might not be answering correct question, but removing __MACH__ causes getauxval undefined.

Copy link
Contributor Author

@cielavenir cielavenir Dec 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have rewritten dispatchers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • aarch64's neon is spec
  • my very first assumption was Apple would not ever sell aarch64 CPU without pmull. In this way I don't need dispatcher.

in above condition, removing 12575f5 solves "aarch64_multibinary.h issue" in workaround way.

@@ -81,6 +87,8 @@ DEFINE_INTERFACE_DISPATCHER(crc32_iscsi)
if (auxval & HWCAP_PMULL) {
return PROVIDER_INFO(crc32_iscsi_refl_pmull);
}
#elif defined(__aarch64__)
return PROVIDER_INFO(crc32_iscsi_refl_pmull);
Copy link
Contributor

@yuhaoth yuhaoth Dec 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function might not be best choice . As I know, crc32_iscsi_crc_ext or crc32_iscsi_3crc_fold should be better choice.

You can test the real performance and pick up the best one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I'm sorry.

Although I have asked an acquaintance of mine to test the same binary on ARM mac (it worked, so "it does support ARM mac"), my primary machine is Intel mac and my test environment is iPad (well, jailbroken and sshd enabled).

Also, at least crc32 instruction caused SIGILL on my iPad (from libslz).

The worse thing, as far as I know, there are no runtime cpu feature detection API on Darwin. That's why I adjust the dispatcher to middle-range...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems undocumented _get_cpu_capabilities can be used as "runtime cpu feature detection API".

Now my concern is https://developer.apple.com/documentation/xcode/writing_arm64_code_for_apple_platforms The platforms reserve register x18. Don’t use this register. Allowing CRC32 instruction could get into this codepath...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

x18 problem should be fix . I will raise an issue later.

And it looks runtime cpu feature detection should be added in Apple. As I known , M1 (mac mini arm64 ) are available now. I guess it has more feature support.

Could you review aarch64_multibinary.h ? I am not sure if there are anything that can not match Apple spec.

@@ -105,6 +113,8 @@ DEFINE_INTERFACE_DISPATCHER(crc32_gzip_refl)

if (auxval & HWCAP_PMULL)
return PROVIDER_INFO(crc32_gzip_refl_pmull);
#elif defined(__aarch64__)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above comment. crc32_gzip_refl_crc_ext and crc32_gzip_refl_3crc_fold are better choice.

.align 4
.set .lanchor_crc_tab,. + 0
#ifndef __MACH__
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it requirement from Clang? If yes , I think clang is better choice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems controlled by llvm::MCAsmInfo::HasDotTypeDotSizeDirective (https://llvm.org/doxygen/classllvm_1_1MCAsmInfo.html#a7c3b8692b75d4808f7c888e61f01e1c8) and it is false in Darwin (https://github.com/llvm/llvm-project/blob/release/9.x/llvm/lib/MC/MCAsmInfoDarwin.cpp#L89).

Well, if ARM Windows support will be added as well, #if !defined(__MACH__) && !defined(__WIN32__) is more proper.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes , you are right.

But now , there are no enough information about run time cpu feature detection on WIN32 . That's same with Apple .

@yuhaoth
Copy link
Contributor

yuhaoth commented Dec 4, 2020

@cielavenir , Thanks your hard work.

How about re-org #162 and #164 ? It looks this PR includes patches for clang .
And also I want to know the work Clang release version.

About Apple Silicon Mac support. I have some questions.

  • Is there any runtime cpu feature detection API in Mac OS ?
    • x86_64 provides it with cpu_id instructions and arm64-linux support it with getauxval(AT_HWCAP)
    • If no, I prefer change aarch64/*_multibinary.S directly.
  • About procedure call standard , is there any difference with aapcs64
    • If yes , we should modify include/aarch64_multibinary.h .

Copy link
Contributor

@yuhaoth yuhaoth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please modify the commit message .
It looks my name appear in commit message :)

@yuhaoth
Copy link
Contributor

yuhaoth commented Dec 7, 2020

And please re-org the patches . It looks some commit can not pass CI tests .
Anyway the PR can pass CI tests .

@yuhaoth yuhaoth mentioned this pull request Dec 8, 2020
@yuhaoth
Copy link
Contributor

yuhaoth commented Dec 8, 2020

I re-org this PR into #168 .
@cielavenir I need your approve to merge #168

@cielavenir cielavenir changed the title Added Apple Silicon Mac support [WIP] Added Apple Silicon Mac support Dec 9, 2020
@cielavenir
Copy link
Contributor Author

WIP until aarch_multibinary.h issue is cleared.

@cielavenir
Copy link
Contributor Author

by the way I report here as well that reading ID_AA64ISAR0_EL1 in user mode is prohibited (causes SIGILL) on iDevice.

@cielavenir
Copy link
Contributor Author

@kirbyzhou @rhpvorderman checked compilation by my (intel) macbook

@kirbyzhou
Copy link

checked compilation by my macbook (apple m1)

% gh pr checkout 164
% git log --oneline
225b6bd (HEAD -> fix_mach) Fix q_fold_const load
855112d Merge remote-tracking branch 'ciel/fix_mach' into HEAD
acd48c0 fix ASM_DEF_RODATA include
b878e6d Merge remote-tracking branch 'origin/master' into fix_mach
2bcbaf4 (origin/master, origin/HEAD) doc: Add security policy file
...
% brew install gcc@11
% gcc-11 --version
gcc-11 (Homebrew GCC 11.3.0_2) 11.3.0
% ./configure CC=gcc-11 
       isa-l 2.30.0
        =====

        prefix:                 /usr
        sysconfdir:             ${prefix}/etc
        libdir:                 ${exec_prefix}/lib
        includedir:             ${prefix}/include

        compiler:               gcc-11
        cflags:                 -g -O2
        ldflags:                

        debug:                  no
% make -j
...
  CCLD     programs/igzip

```

@kirbyzhou
Copy link

But in my benchmark, the isa-l version of ungzip is much slower than cloudflare version https://github.com/cloudflare/zlib

isa-l % time ./programs/igzip -d < ~/ranger-3.0.0-SNAPSHOT-admin.tar.gz > /dev/null
./programs/igzip -d <  > /dev/null  2.39s user 0.05s system 99% cpu 2.454 total

cloudflare-zlib % time ./minigzip -d < ~/ranger-3.0.0-SNAPSHOT-admin.tar.gz > /dev/null
./minigzip -d <  > /dev/null  1.92s user 0.08s system 99% cpu 2.005 total

@kirbyzhou
Copy link

Under a linux arm host.

 isa-l]$ time ./programs/igzip -d < ~/xxx.gz > /dev/null

real	0m0.313s
user	0m0.283s
sys	0m0.021s

cloudflare-zlib]$ time ./minigzip -d < ~/xxx.gz > /dev/null

real	0m0.233s
user	0m0.212s
sys	0m0.021s

@cdevers-es
Copy link

Any news on this? From the last couple of comments from @kirbyzhou , it sounds like this version builds & runs, but the performance seems to have a regression?

@rhpvorderman
Copy link
Contributor

@cdevers-es I can confirm that the performance regression that @kirbyzhou mentions also happens on an Olimex Olinuxino A64 development board. This is only for decompression (compression sitll is faster). As such it is a common aarch64 issue, not a Mac specific one. Therefore it is my opinion that this PR should be merged as soon as it is ready. The performance regression can be handled later by the people who understand the arm64 code.

@gbtucker
Copy link
Contributor

Therefore it is my opinion that this PR should be merged as soon as it is ready. The performance regression can be handled later by the people who understand the arm64 code.

I get an illegal instruction exception on this when I cross compile when I don't on main branch. Perhaps it's breaking something in the existing dispatcher.

make -f Makefile.unx -j 4 host_cpu=aarch64 CC=aarch64-linux-gcc LDFLAGS=-static 'D=NO_SVE2=1' crc16_t10dif_test
qemu-aarch64-static -g 5000 ./crc16_t10dif_test

─── Output/messages ─[111/111]

Program received signal SIGILL, Illegal instruction.
0x0000000000401648 in ?? ()
─── Assembly 
 0x0000000000401648  ? .inst    0x87e70000 ; undefined
 0x000000000040164c  ? udf    #0
 0x0000000000401650  ? tbnz    w0, #3, 0x3fb650
 0x0000000000401654  ? udf    #0
 0x0000000000401658  ? add    x0, x0, #0x40
 0x000000000040165c  ? sub    x3, x3, #0x40
 0x0000000000401660  ? cmp    x3, #0x3f
 0x0000000000401664  ? ldp    q28, q29, [x0, #-64]
 0x0000000000401668  ? ldp    q30, q31, [x0, #-32]
 0x000000000040166c  ? prfm    pldl2strm, [x0, #102

@kirbyzhou
Copy link

@cdevers-es I can confirm that the performance regression that @kirbyzhou mentions also happens on an Olimex Olinuxino A64 development board. This is only for decompression (compression sitll is faster). As such it is a common aarch64 issue, not a Mac specific one. Therefore it is my opinion that this PR should be merged as soon as it is ready. The performance regression can be handled later by the people who understand the arm64 code.

compression is faster but compression ratio is worse.

@cielavenir
Copy link
Contributor Author

cielavenir commented Oct 27, 2022

oh this fixes the test actually

diff --git a/crc/aarch64/crc16_t10dif_pmull.S b/crc/aarch64/crc16_t10dif_pmull.S
index 2ae3fb7..29af534 100644
--- a/crc/aarch64/crc16_t10dif_pmull.S
+++ b/crc/aarch64/crc16_t10dif_pmull.S
@@ -201,13 +201,7 @@ v_tmp1_x3          .req    v27
 q_fold_const           .req    q17
 v_fold_const           .req    v17
 
-        ldr q_fold_const, fold_constant
-
-fold_constant:
-       .word 0x87e70000
-       .word 0x00000000
-       .word 0x371d0000
-       .word 0x00000000
+        ldr q_fold_const, =0x371d00000000000087e70000;
 
        .align 2
 .crc_fold_loop:

could someone tell me why (my) this fold_constant does not work?

@cielavenir
Copy link
Contributor Author

I'm terribly sorry, cielavenir@825d080 works. I feel so ashame.

Let me check other parts soon.

@cielavenir
Copy link
Contributor Author

fixed crc16_t10dif_copy_pmull.S as well. It seems those two are ldr = users.

@cielavenir
Copy link
Contributor Author

checked

for test in *_test;do echo $test; qemu-aarch64-static ./$test;done

checksum32_funcs_test
crc16_t10dif_copy_test
crc16_t10dif_test
crc32_funcs_test
crc64_funcs_test
crc_simple_test
erasure_code_test
erasure_code_base_test
erasure_code_update_test
gf_inverse_test
gf_vect_dot_prod_base_test
gf_vect_dot_prod_test
gf_vect_mad_test
gf_vect_mul_base_test
gf_vect_mul_test
igzip_rand_test
mem_zero_detect_test
pq_check_test
pq_gen_test
xor_check_test
xor_gen_test

igzip_wrapper_hdr_test did not pass but master is similarly not passing.

@gbtucker
Copy link
Contributor

Thanks @cielavenir. I was able to rebase and I don't see any more issues so I should be able to integrate as soon as I can push through our internal CI.

@gbtucker
Copy link
Contributor

Integrated

@gbtucker gbtucker closed this Oct 28, 2022
@cielavenir
Copy link
Contributor Author

damn, 825d080 and 33a7a42 did break the compilation:

crc/aarch64/crc16_t10dif_pmull.S:222:9: error: unknown AArch64 fixup kind!
        ldr q_fold_const, fold_constant
        ^

let me think how to fix it (I'm not good at asm, I'll be glad if someone can think of good idea)

@cielavenir
Copy link
Contributor Author

I mean broke the compilation on arm64 macos

But without 825d080 and 33a7a42, the generated binary is incorrect anyway...

@cielavenir
Copy link
Contributor Author

should be fixed by #226

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants