Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[llvm-mca] Add optional identifier field to mca::Instruction #97867

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

chinmaydd
Copy link
Contributor

While using llvm-mca as a library (in MCA Daemon), we've been having trouble uniquely identifying instructions that go through the pipeline. Inspired by the discussion here, I believe it may make sense to add an optional identifier to each mca::Instruction.

A possible instance of use could be the InstrBuilder incrementing the Identifier by 1 after each instruction is created. I'm happy to add that too.

@llvmbot
Copy link
Collaborator

llvmbot commented Jul 5, 2024

@llvm/pr-subscribers-tools-llvm-mca

Author: Chinmay Deshpande (chinmaydd)

Changes

While using llvm-mca as a library (in MCA Daemon), we've been having trouble uniquely identifying instructions that go through the pipeline. Inspired by the discussion here, I believe it may make sense to add an optional identifier to each mca::Instruction.

A possible instance of use could be the InstrBuilder incrementing the Identifier by 1 after each instruction is created. I'm happy to add that too.


Full diff: https://github.com/llvm/llvm-project/pull/97867.diff

1 Files Affected:

  • (modified) llvm/include/llvm/MCA/Instruction.h (+5)
diff --git a/llvm/include/llvm/MCA/Instruction.h b/llvm/include/llvm/MCA/Instruction.h
index e48a70164bec6..09822e43a827d 100644
--- a/llvm/include/llvm/MCA/Instruction.h
+++ b/llvm/include/llvm/MCA/Instruction.h
@@ -643,6 +643,8 @@ class Instruction : public InstructionBase {
   // True if this instruction has been optimized at register renaming stage.
   bool IsEliminated;
 
+  std::optional<uint64_t> Identifier;
+
 public:
   Instruction(const InstrDesc &D, const unsigned Opcode)
       : InstructionBase(D, Opcode), Stage(IS_INVALID),
@@ -690,6 +692,9 @@ class Instruction : public InstructionBase {
   bool isRetired() const { return Stage == IS_RETIRED; }
   bool isEliminated() const { return IsEliminated; }
 
+  std::optional<uint64_t> getIdentifier() const { return Identifier; }
+  void setIdentifier(uint64_t Id) { Identifier = Id; }
+
   // Forces a transition from state IS_DISPATCHED to state IS_EXECUTED.
   void forceExecuted();
   void setEliminated() { IsEliminated = true; }

Copy link
Contributor

@boomanaiden154 boomanaiden154 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have more information on how you intend to use this? I'm struggling to see how this would be particularly useful, especially with instruction recycling where you would end up picking up the identifier of a previous instruction that has been recycled.

@chinmaydd
Copy link
Contributor Author

chinmaydd commented Jul 9, 2024

Hi @boomanaiden154, thanks for your comment. We are using llvm-mca as a library in MCAD. MCAD tries to tweak the timings of individual instructions based on dynamic information from an emulator-like environment (if available). For instance, we have basic support for identifying aliasing loads and stores based on information retrieved from QEMU. We store metadata relating to address, and size of access with the key as the index of the instruction in the larger trace.

As you can see, the same instruction might have different metadata at different points in the program. For recycled instructions, we ensure to tag them appropriately when they are instantiated. Having a unique identifier to refer to instructions in the pipeline seems useful to me.

@michaelmaitland
Copy link
Contributor

michaelmaitland commented Jul 9, 2024

As you can see, the same instruction might have different metadata at different points in the program. For recycled instructions, we ensure to tag them appropriately when they are instantiated. Having a unique identifier to refer to instructions in the pipeline seems useful to me.

Could you please help me understand what new functionality this would allow in MCAD? It looks like we have a way to identify them.

Edit: I didn't see that the link above was a different branch from the main branch.

I don't have any strong objections to this change. In our private MCAD repo, we just removed the Metadata side of things to get it working with upstream LLVM. This change would allow us to get closer to using upstream LLVM with all MCAD functionality, which is desirable to me. One alternative is that you could create a subclass of mca::Instruction in MCAD that adds the identifier. Another alternative is to hash the mca::Instruction in MCAD.

@chinmaydd
Copy link
Contributor Author

Commenting for bump.

@boomanaiden154
Copy link
Contributor

Commenting for bump.

Can you reply to Michael's questions above? I'd be interest to know why the listed alternatives don't work.

The change seems fine to me. It does increase the size of mca::Instruction, but given it's already heap allocated and large, that's probably not too big of an issue.

@chinmaydd
Copy link
Contributor Author

chinmaydd commented Jul 11, 2024

Thanks for the response @boomanaiden154. I agree with @michaelmaitland that a central reason for this change is to be rid of the custom LLVM dependency that we have been dragging around with MCAD for a couple of years now.

Subclassing mca::Instruction is possible but it will require us to redefine even more llvm-mca functionality within MCAD. For instance, the entire Pipeline works with the InstRef class that holds a pointer to mca::Instruction. We are also looking to use hardware events generated by MCA (such as StallEvent, PressureEvent) to better inform the consumer. This requires that the unique identifier (and therefore access to the relevant metadata) persist through the various stages of the pipeline.

I'm not too confident that hashing will work reliably here. For instance the CyclesLeft field will be updated at different stages of the pipeline -- therefore changing the hash. I could hash the opcode and the operands from InstructionBase but that is not going to help in uniquely identifying instructions. I dont see any other member combination that would yield a unique identifier for differentiating Instructions A and B, that have the same opcode and operands, in the pipeline. Happy to hear suggestions.

@boomanaiden154
Copy link
Contributor

Adding an identifier seems like it might not be a bad option given the current design. If it helps multiple downstream consumers, then it seems reasonable enough to me. We don't need any of this sort of functionality in our downstream usage (so far), but I can see how it would potentially be useful for others.

I'll defer to the primary llvm-mca reviewers here as I'm not really qualified to review changes like this in this area of the project.

@chinmaydd
Copy link
Contributor Author

@michaelmaitland sorry for the ping. Could you weigh in please ? Thanks!

@michaelmaitland
Copy link
Contributor

@mshockwave added metadata to llvm-mcad but it did not have a significant impact on improving accuracy or precision. With that knowledge, I am not sure how important it is to modify upstream like this. SiFive has ripped out usage of metadata in our fork of llvm-mcad and integrated with no changes to upstream llvm.

I am curious whether @chinmaydd has a strong motivation to support metadata in llvm-mcad that would warrant us to take this change here.

I am a supporter of making llvm-mcad work with upstream LLVM. The question remains what approach to take.

@chinmaydd
Copy link
Contributor Author

chinmaydd commented Jul 17, 2024

@michaelmaitland thanks for your comments. I understand that getting MCAD to work with upstream LLVM without the metadata changes is possible. In fact, the version we have been maintaining has a branch that implements this.

But, we have been working on other use cases that may benefit from not providing, but receiving, metadata from MCAD. Consider the following disassembly sample -

image

This is part of the BinaryNinja (a reverse engineering framework) UI annotated with llvm-mca provided cycle counts. The red annotations highlight instructions that llvm-mca reports a StallEvent for. To facilitate this, we have the Broker register a HardwareEvent listener. In the listener, we get the identifier for the instruction through the InstRef and update local storage to reflect this information.

We've had multiple folks be interested in using MCAD like this. Integrating timing information as part of a disassembler UI aids in quickly iterating over and understanding timing changes for different versions of the generated binary (maybe optimizations, patches, etc.)

@chinmaydd
Copy link
Contributor Author

@michaelmaitland sorry for the ping. Any thoughts?

@michaelmaitland
Copy link
Contributor

michaelmaitland commented Jul 24, 2024

It sounds like the value here is for llvm-mcad to explore whether receiving metadata is valuable. IIRC, the original providing metadata did not turn out to be useful.

I have two ideas here:

  1. We hold off on taking this change until there is proof that llvm-mcad benefits from receiving metadata. Otherwise, there are no known instances where this change is valuable. llvm-mcad can cherry pick this change on its own llvm branch during experiments.
  2. We take this change since it is a small addition to the size of the object and non-invasive otherwise? Perhaps we can mark it as experimental, in the case that llvm-mcad receiving metadata proves to not be useful (I.e. no known users of the field exist in the wild).

I lean towards (1), but I wouldn’t mind (2) if someone felt strongly towards it. What do others think?

@chinmaydd
Copy link
Contributor Author

Hi @michaelmaitland, thank you for continuing to engage with this PR. I appreciate your time.

It sounds like the value here is for llvm-mcad to explore whether receiving metadata is valuable

That is correct, but like my previous comment highlights -- we are also using MCAD to "provide" metadata rather than receiving it. This way, the client of the Broker can leverage hardware events, such as stalls and pipeline pressure events, which we store as a mapping from {InstructionIdentifier: vector<HardwareEvents>}.

@michaelmaitland
Copy link
Contributor

the value here is for llvm-mcad to explore whether receiving metadata is valuable. IIRC, the original providing metadata did not turn out to be useful

Sorry, I had it backwards with receiving and providing. My two "ideas" still hold if we swap receiving and providing around:

  1. We hold off on taking this change until there is proof that llvm-mcad benefits from providing metadata. Otherwise, there are no known instances where this change is valuable. llvm-mcad can cherry pick this change on its own llvm branch during experiments.
  2. We take this change since it is a small addition to the size of the object and non-invasive otherwise? Perhaps we can mark it as experimental, in the case that llvm-mcad providing metadata proves to not be useful (I.e. no known users of the field exist in the wild).

@chinmaydd
Copy link
Contributor Author

Thats fair.

  1. Apart from this one change, MCAD is now completely compatible with upstream. We are no longer maintaining a custom fork. We are currently shipping MCAD with a patch file that applies this change.
  2. I'm okay with marking it as experimental for a while. If the community decides to remove it, I would appreciate a ping / mention so that we can update our build instructions.

Thanks !

@michaelmaitland
Copy link
Contributor

@mshockwave it would be great if you could weigh in on the two options.

@RKSimon
Copy link
Collaborator

RKSimon commented Jul 29, 2024

CC @adibiagio

@adibiagio
Copy link
Collaborator

@mshockwave it would be great if you could weigh in on the two options.

I also would like to hear @mshockwave's opinions on this.

Is there a reason why an instruction identifier needs to be a std::optional?
Have you considered using uint32_t (instead of uint64_t) for the underlying type?

We could probably save some space by placing the new identifier field in a smart way. If we better pack fields, we may avoid excessive padding due to alignment requirements, and reduce the total sizeof.

In principle, I don't particularly like the idea of adding fields to mca::instruction which are only used by downstream implementations. However, I don't think this change is problematic, and I don't consider the size change a big issue in practice. I'll let @michaelmaitland and @mshockwave decide on this.

-Andrea

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants