-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NPUW: Support NF4 DCOFF for CW models #27518
NPUW: Support NF4 DCOFF for CW models #27518
Conversation
…into at/npuw-nf4-dcoff-support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Probably the OV's built-in to dequantize a single NF4 is the bottleneck here, but an optimization round can be done on top of this. Let's go CW DQ next! (Should be really trivial)
@@ -229,7 +229,8 @@ std::shared_ptr<ov::ICompiledModel> Plugin::compile_model(const std::shared_ptr< | |||
ov::element::Type_t::f32, | |||
ov::element::Type_t::f64, | |||
ov::element::Type_t::boolean, | |||
ov::element::Type_t::string}; | |||
ov::element::Type_t::string, | |||
ov::element::Type_t::nf4}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure the sorting order is correct here
auto cvt = std::static_pointer_cast<ov::op::v0::Convert>(matched_convrt); | ||
auto matmul = std::static_pointer_cast<ov::op::v0::MatMul>(matched_matmul); | ||
|
||
// NB: In case convert and matmul types don't match | ||
cvt->set_destination_type(matmul->inputs()[1].get_element_type()); | ||
|
||
matched_matmul->input(1).replace_source_output(matched_convrt); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
am I right that you may end up with
Parameter(f16) -> Convert(f32) -> MatMul(f32)
?
For DCOFF it probably still work though.
void unpack_nf4f16_scale(const ov::SoPtr<ov::ITensor>& from, | ||
const ov::SoPtr<ov::ITensor>& scale, | ||
const ov::SoPtr<ov::ITensor>& to, | ||
const ov::npuw::util::UnpackOptions& unpack_options) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need here in _scale
prefix as you take scale
as an input (assuming such overloads of the unpack assume scale
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CPU part LGTM
No description provided.