[SLP] Improvement of reordering for consts, splats and ops breaks vectorization of XOR instructions #109725

ivankelarev · 2024-09-23T22:43:00Z

It appears that #87091 change partially breaks vectorization of XOR instructions for this code:

target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define i32 @a() #0 {
  br label %1

1:
  %2 = phi i8 [ 0, %0 ], [ %40, %1 ]
  %3 = phi i8 [ 0, %0 ], [ %28, %1 ]
  %4 = phi i8 [ 0, %0 ], [ %16, %1 ]
  %5 = phi i8 [ 0, %0 ], [ %6, %1 ]
  %6 = load i8, ptr null, align 4
  %7 = xor i8 %6, %3
  %8 = xor i8 %7, %4
  %9 = xor i8 %8, %5
  store i8 %9, ptr null, align 4
  %10 = xor i8 %6, %2
  %11 = xor i8 %10, %5
  %12 = add i64 0, 1
  %13 = getelementptr i8, ptr null, i64 %12
  store i8 %11, ptr %13, align 1
  %14 = add i64 0, 1
  %15 = getelementptr i8, ptr null, i64 %14
  %16 = load i8, ptr %15, align 1
  %17 = xor i8 %16, %2
  %18 = xor i8 %17, %3
  %19 = xor i8 %18, %4
  %20 = add i64 0, 2
  %21 = getelementptr i8, ptr null, i64 %20
  store i8 %19, ptr %21, align 2
  %22 = xor i8 %16, %6
  %23 = xor i8 %22, %4
  %24 = add i64 0, 3
  %25 = getelementptr i8, ptr null, i64 %24
  store i8 %23, ptr %25, align 1
  %26 = add i64 0, 2
  %27 = getelementptr i8, ptr null, i64 %26
  %28 = load i8, ptr %27, align 2
  %29 = xor i8 %28, %6
  %30 = xor i8 %29, %2
  %31 = xor i8 %30, %3
  %32 = add i64 0, 4
  %33 = getelementptr i8, ptr null, i64 %32
  store i8 %31, ptr %33, align 4
  %34 = xor i8 %28, %16
  %35 = xor i8 %34, %3
  %36 = add i64 0, 5
  %37 = getelementptr i8, ptr null, i64 %36
  store i8 %35, ptr %37, align 1
  %38 = add i64 0, 3
  %39 = getelementptr i8, ptr null, i64 %38
  %40 = load i8, ptr %39, align 1
  %41 = xor i8 %40, %16
  %42 = xor i8 %41, %6
  %43 = xor i8 %42, %2
  %44 = add i64 0, 6
  %45 = getelementptr i8, ptr null, i64 %44
  store i8 %43, ptr %45, align 2
  %46 = xor i8 %40, %28
  %47 = xor i8 %46, %2
  %48 = add i64 0, 7
  %49 = getelementptr i8, ptr null, i64 %48
  store i8 %47, ptr %49, align 1
  br label %1
}

attributes #0 = { "target-cpu"="core-avx2" }

Before the change all the XOR instructions were vectorized:

...
  %5 = load <4 x i8>, ptr null, align 4
  %6 = shufflevector <4 x i8> %5, <4 x i8> poison, <4 x i32> <i32 poison, i32 poison, i32 0, i32 1>
  %7 = shufflevector <2 x i8> %3, <2 x i8> poison, <4 x i32> <i32 0, i32 1, i32 poison, i32 poison>
  %8 = shufflevector <4 x i8> %6, <4 x i8> %7, <4 x i32> <i32 4, i32 5, i32 2, i32 3>
  %9 = xor <4 x i8> %5, %8
...

Now 4 out of 20 XOR instructions are not vectorized:

...
  %6 = load <4 x i8>, ptr null, align 4
  %7 = extractelement <4 x i8> %6, i32 3
  %8 = extractelement <4 x i8> %6, i32 2
  %9 = extractelement <4 x i8> %6, i32 0
  %10 = xor i8 %9, %3
  %11 = extractelement <4 x i8> %6, i32 1
  %12 = xor i8 %11, %2
  %13 = xor i8 %8, %9
  %14 = xor i8 %7, %11
...

This leads to a significant performance degradations in one of our benchmarks. @alexey-bataev, could you please take a look if the vectorizations can be restored for this code?

The text was updated successfully, but these errors were encountered:

…rdering When doing the repeated instructions analysis, better to make the reordering non-profitable, if the number of unique instructions is not power-of-2. In this case better to keep power-of-2 elements as this allows better vectorization. Fixes llvm#109725

ivankelarev assigned alexey-bataev Sep 23, 2024

github-actions bot added the new issue label Sep 23, 2024

EugeneZelenko added llvm:SLPVectorizer and removed new issue labels Sep 23, 2024

alexey-bataev closed this as completed in d7dd31e Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SLP] Improvement of reordering for consts, splats and ops breaks vectorization of XOR instructions #109725

[SLP] Improvement of reordering for consts, splats and ops breaks vectorization of XOR instructions #109725

ivankelarev commented Sep 23, 2024

[SLP] Improvement of reordering for consts, splats and ops breaks vectorization of XOR instructions #109725

[SLP] Improvement of reordering for consts, splats and ops breaks vectorization of XOR instructions #109725

Comments

ivankelarev commented Sep 23, 2024