Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SLP] Improvement of reordering for consts, splats and ops breaks vectorization of XOR instructions #109725

Closed
ivankelarev opened this issue Sep 23, 2024 · 0 comments
Assignees

Comments

@ivankelarev
Copy link
Contributor

It appears that #87091 change partially breaks vectorization of XOR instructions for this code:

target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define i32 @a() #0 {
  br label %1

1:
  %2 = phi i8 [ 0, %0 ], [ %40, %1 ]
  %3 = phi i8 [ 0, %0 ], [ %28, %1 ]
  %4 = phi i8 [ 0, %0 ], [ %16, %1 ]
  %5 = phi i8 [ 0, %0 ], [ %6, %1 ]
  %6 = load i8, ptr null, align 4
  %7 = xor i8 %6, %3
  %8 = xor i8 %7, %4
  %9 = xor i8 %8, %5
  store i8 %9, ptr null, align 4
  %10 = xor i8 %6, %2
  %11 = xor i8 %10, %5
  %12 = add i64 0, 1
  %13 = getelementptr i8, ptr null, i64 %12
  store i8 %11, ptr %13, align 1
  %14 = add i64 0, 1
  %15 = getelementptr i8, ptr null, i64 %14
  %16 = load i8, ptr %15, align 1
  %17 = xor i8 %16, %2
  %18 = xor i8 %17, %3
  %19 = xor i8 %18, %4
  %20 = add i64 0, 2
  %21 = getelementptr i8, ptr null, i64 %20
  store i8 %19, ptr %21, align 2
  %22 = xor i8 %16, %6
  %23 = xor i8 %22, %4
  %24 = add i64 0, 3
  %25 = getelementptr i8, ptr null, i64 %24
  store i8 %23, ptr %25, align 1
  %26 = add i64 0, 2
  %27 = getelementptr i8, ptr null, i64 %26
  %28 = load i8, ptr %27, align 2
  %29 = xor i8 %28, %6
  %30 = xor i8 %29, %2
  %31 = xor i8 %30, %3
  %32 = add i64 0, 4
  %33 = getelementptr i8, ptr null, i64 %32
  store i8 %31, ptr %33, align 4
  %34 = xor i8 %28, %16
  %35 = xor i8 %34, %3
  %36 = add i64 0, 5
  %37 = getelementptr i8, ptr null, i64 %36
  store i8 %35, ptr %37, align 1
  %38 = add i64 0, 3
  %39 = getelementptr i8, ptr null, i64 %38
  %40 = load i8, ptr %39, align 1
  %41 = xor i8 %40, %16
  %42 = xor i8 %41, %6
  %43 = xor i8 %42, %2
  %44 = add i64 0, 6
  %45 = getelementptr i8, ptr null, i64 %44
  store i8 %43, ptr %45, align 2
  %46 = xor i8 %40, %28
  %47 = xor i8 %46, %2
  %48 = add i64 0, 7
  %49 = getelementptr i8, ptr null, i64 %48
  store i8 %47, ptr %49, align 1
  br label %1
}

attributes #0 = { "target-cpu"="core-avx2" }

Before the change all the XOR instructions were vectorized:

...
  %5 = load <4 x i8>, ptr null, align 4
  %6 = shufflevector <4 x i8> %5, <4 x i8> poison, <4 x i32> <i32 poison, i32 poison, i32 0, i32 1>
  %7 = shufflevector <2 x i8> %3, <2 x i8> poison, <4 x i32> <i32 0, i32 1, i32 poison, i32 poison>
  %8 = shufflevector <4 x i8> %6, <4 x i8> %7, <4 x i32> <i32 4, i32 5, i32 2, i32 3>
  %9 = xor <4 x i8> %5, %8
...

Now 4 out of 20 XOR instructions are not vectorized:

...
  %6 = load <4 x i8>, ptr null, align 4
  %7 = extractelement <4 x i8> %6, i32 3
  %8 = extractelement <4 x i8> %6, i32 2
  %9 = extractelement <4 x i8> %6, i32 0
  %10 = xor i8 %9, %3
  %11 = extractelement <4 x i8> %6, i32 1
  %12 = xor i8 %11, %2
  %13 = xor i8 %8, %9
  %14 = xor i8 %7, %11
...

This leads to a significant performance degradations in one of our benchmarks. @alexey-bataev, could you please take a look if the vectorizations can be restored for this code?

augusto2112 pushed a commit to augusto2112/llvm-project that referenced this issue Sep 26, 2024
…rdering

When doing the repeated instructions analysis, better to make the
reordering non-profitable, if the number of unique instructions is not
power-of-2. In this case better to keep power-of-2 elements as this
allows better vectorization.

Fixes llvm#109725
xgupta pushed a commit to xgupta/llvm-project that referenced this issue Oct 4, 2024
…rdering

When doing the repeated instructions analysis, better to make the
reordering non-profitable, if the number of unique instructions is not
power-of-2. In this case better to keep power-of-2 elements as this
allows better vectorization.

Fixes llvm#109725
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants