Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable serialize prepacked weights into data file #22256

Merged
merged 45 commits into from
Oct 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
59fca4a
test
frank-dong-ms Aug 30, 2024
b34b3d0
serialize prepack initializers to onnx data file
frank-dong-ms Sep 27, 2024
57c5c58
sync and merge changes
frank-dong-ms Sep 27, 2024
acc23f4
fix matmul_nbits kernel
frank-dong-ms Sep 28, 2024
fe9c81b
code clean up
frank-dong-ms Sep 28, 2024
c7f19ca
bug fix
frank-dong-ms Sep 28, 2024
327cb1c
fix lint style
frank-dong-ms Sep 28, 2024
c6f8b4e
fix CI failure in Linux
frank-dong-ms Sep 30, 2024
46b9bac
fix CI failure in Android
frank-dong-ms Sep 30, 2024
ee818ce
fix test failures
frank-dong-ms Oct 1, 2024
4520d83
disbale test for non-CPU and non-PC env, fix several tests
frank-dong-ms Oct 1, 2024
f77d479
more code clean up
frank-dong-ms Oct 1, 2024
d58a024
Merge branch 'main' into frdong/prepack_1
frank-dong-ms Oct 2, 2024
7bf0cd7
fix CI errors regarding memory leak
frank-dong-ms Oct 8, 2024
694a2bc
sync and merge
frank-dong-ms Oct 8, 2024
18b079c
fix CI failures
frank-dong-ms Oct 8, 2024
81eb968
fix CI errors
frank-dong-ms Oct 8, 2024
e48d808
fix traning pipeline for session states
frank-dong-ms Oct 8, 2024
600a79e
avoid use smart pointer to wrap tensor
frank-dong-ms Oct 9, 2024
f121fb9
refine memory use
frank-dong-ms Oct 10, 2024
1f96556
fix CI errors
frank-dong-ms Oct 10, 2024
3e0da5c
fix error in mini build
frank-dong-ms Oct 10, 2024
5c2e38c
Merge branch 'main' of https://github.com/Microsoft/onnxruntime into …
frank-dong-ms Oct 10, 2024
1dba2f9
fix CI failures
frank-dong-ms Oct 10, 2024
4681689
disable prepack serialization on x86
frank-dong-ms Oct 11, 2024
83be7dd
take comments, use inlined containers
frank-dong-ms Oct 12, 2024
3c63aad
fix mini build error
frank-dong-ms Oct 12, 2024
394745c
fix comments
frank-dong-ms Oct 15, 2024
a61f1ef
change back to use std containers
frank-dong-ms Oct 15, 2024
3cb8b97
Merge branch 'main' of https://github.com/Microsoft/onnxruntime into …
frank-dong-ms Oct 15, 2024
85356b1
merge main and fix
frank-dong-ms Oct 15, 2024
3910ef6
Merge branch 'main' of https://github.com/Microsoft/onnxruntime into …
frank-dong-ms Oct 15, 2024
d35a7ef
take comments
frank-dong-ms Oct 21, 2024
54c2eab
Merge branch 'main' of https://github.com/Microsoft/onnxruntime into …
frank-dong-ms Oct 21, 2024
c4a7b4a
enhance document
frank-dong-ms Oct 22, 2024
679dd0d
fix mini build
frank-dong-ms Oct 22, 2024
d591a6e
fix comments
frank-dong-ms Oct 23, 2024
00262b2
Merge branch 'main' of https://github.com/Microsoft/onnxruntime into …
frank-dong-ms Oct 23, 2024
390cde5
sync main
frank-dong-ms Oct 23, 2024
ead9032
fix CI
frank-dong-ms Oct 23, 2024
c87f420
take comments
frank-dong-ms Oct 24, 2024
5d0d07d
merge main and fix comments
frank-dong-ms Oct 24, 2024
e6b86e6
fix lint and build issue on web CI
frank-dong-ms Oct 24, 2024
a3e7314
fix API
frank-dong-ms Oct 24, 2024
b832ce9
split test with private API and public api
frank-dong-ms Oct 25, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions include/onnxruntime/core/framework/op_kernel.h
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,7 @@ class OpKernel {
// the allocator tied to the session if the kernel owns the pre-packed buffer or an
// allocator shared between sessions if the pre-packed buffer is to be shared across sessions
// (i.e.) the kernel does not own the buffer.
// @param save_prepacked_initializers: Set it to true if intend to save prepacked initializers to external data file.
// @param is_packed: Set it to true if the kernel packed the tensor or to false
// The kernel is responsible for keeping the packed data and related metadata if is_packed is true,
// and the original initialized constant tensor will be released and not accessible anymore in
Expand All @@ -88,6 +89,7 @@ class OpKernel {

virtual Status
PrePack(const Tensor& /*tensor*/, int /*input_idx*/, AllocatorPtr /*alloc*/,
bool, /*save_prepacked_initializers*/
frank-dong-ms marked this conversation as resolved.
Show resolved Hide resolved
/*out*/ bool& is_packed, /*out*/ PrePackedWeights* /*prepacked_weights*/) {
is_packed = false;
return Status::OK();
Expand Down Expand Up @@ -129,6 +131,26 @@ class OpKernel {
return Status::OK();
}

// Override this function to get pre-packed tensors from this kernel.
// Only useful for models run on PC with CPU so ORT could load prepacked weights directly from
// ONNX data file with mmap and no need to do prepacking on fly to save a lot of heap memory.
// @param input_idx : The index of input we prepacked before and intend to get packed tensor back.
// Please refer to matmul_nbits kernel for a complete example.
virtual std::optional<Tensor> GetPrePackTensor(int /*input_idx*/) {
return std::nullopt;
}

// Override this function to set pre-packed tensors to this kernel and restore prepacked weight buffer.
// Only useful for models run on PC with CPU so ORT could load prepacked weights directly from
// ONNX data file with mmap and no need to do prepacking on fly to save a lot of heap memory.
// Please refer to matmul_nbits kernel for a complete example.
// @param input_idx : The input index of the tensor in this kernel.
// @param pre_packed_tensor: The prepacked tensor read from onnx data file and use the prepacked tensor
// to restore prepacked weight buffer.
virtual Status SetPrePackTensor(int /*input_idx*/, const Tensor& /*pre_packed_tensor*/) {
return Status::OK();
}

const OrtDevice GetDevice(OrtMemType mem_type) const;
const OpKernelInfo& Info() const {
return *op_kernel_info_;
Expand Down
29 changes: 27 additions & 2 deletions include/onnxruntime/core/graph/graph.h
Original file line number Diff line number Diff line change
Expand Up @@ -1148,6 +1148,11 @@
void FinalizeFuseSubGraph(const IndexedSubGraph& sub_graph, Node& fused_node);
#endif

// Since one constant initializer could be used by different kernels
// and prepacked differently, use an unordered_map to store prepacked
// initializer in format of <[initializer_name], <[node_name], [prepacked_initializer]>>
typedef std::unordered_map<std::string, std::unordered_map<std::string, ONNX_NAMESPACE::TensorProto>> PrePackedTensorProtoToSave;

Check warning on line 1154 in include/onnxruntime/core/graph/graph.h

View workflow job for this annotation

GitHub Actions / Optional Lint C++

[cpplint] reported by reviewdog 🐶 Lines should be <= 120 characters long [whitespace/line_length] [2] Raw Output: include/onnxruntime/core/graph/graph.h:1154: Lines should be <= 120 characters long [whitespace/line_length] [2]

#if !defined(ORT_MINIMAL_BUILD)
/** Gets the GraphProto representation of this Graph. */
const ONNX_NAMESPACE::GraphProto& ToGraphProto();
Expand Down Expand Up @@ -1182,18 +1187,26 @@
@param initializer_size_threshold initializers larger or equal to this threshold (in bytes) are saved
in the external file. Initializer smaller than this threshold are included in the onnx file.
@param align_info offset alignment info.
@param save_prepacked_constant_initializers whether to save prepacked initializer into external data file.
If set false to this boolean, prepacked initializer will not be saved into onnxruntime data file,
we keep constant initializer as it is.
@param pre_packed_initializers struct used to store all the prepacked initializers.
@returns GraphProto serialization of the graph.
*/
ONNX_NAMESPACE::GraphProto ToGraphProtoWithExternalInitializers(const std::filesystem::path& external_file_path,
const std::filesystem::path& model_file_path,
size_t initializer_size_threshold,
const OffsetAlignmentInfo& align_info) const;
const OffsetAlignmentInfo& align_info,
bool save_prepacked_constant_initializers,
PrePackedTensorProtoToSave& pre_packed_initializers) const;

Check warning on line 1201 in include/onnxruntime/core/graph/graph.h

View workflow job for this annotation

GitHub Actions / Optional Lint C++

[cpplint] reported by reviewdog 🐶 Lines should be <= 120 characters long [whitespace/line_length] [2] Raw Output: include/onnxruntime/core/graph/graph.h:1201: Lines should be <= 120 characters long [whitespace/line_length] [2]

ONNX_NAMESPACE::GraphProto ToGraphProtoWithExternalInitializers(const std::filesystem::path& external_file_path,
const std::filesystem::path& model_file_path,
size_t initializer_size_threshold) const {
OffsetAlignmentInfo default_options;
return ToGraphProtoWithExternalInitializers(external_file_path, model_file_path, initializer_size_threshold, default_options);
PrePackedTensorProtoToSave pre_packed_initializers;
return ToGraphProtoWithExternalInitializers(external_file_path, model_file_path, initializer_size_threshold, default_options,

Check warning on line 1208 in include/onnxruntime/core/graph/graph.h

View workflow job for this annotation

GitHub Actions / Optional Lint C++

[cpplint] reported by reviewdog 🐶 Lines should be <= 120 characters long [whitespace/line_length] [2] Raw Output: include/onnxruntime/core/graph/graph.h:1208: Lines should be <= 120 characters long [whitespace/line_length] [2]
false, pre_packed_initializers);
}

/** Gets the ISchemaRegistry instances being used with this Graph. */
Expand Down Expand Up @@ -1508,6 +1521,18 @@
private:
void InitializeStateFromModelFileGraphProto();

// Private method used to setup external initializer properly during model save,
// this external initializer could be oroginal initializer or prepacked initializer.
static void SetUpExternalInitializer(const Graph::OffsetAlignmentInfo& align_info,
size_t tensor_bytes_size,
int64_t& external_offset,
std::ofstream& external_stream,
gsl::span<const uint8_t> raw_data,
ONNX_NAMESPACE::TensorProto& output_proto,
const std::filesystem::path& external_file_path,
const ONNX_NAMESPACE::TensorProto& initializer,
bool is_prepacked);

// Add node with specified <node_proto>.
Node& AddNode(const ONNX_NAMESPACE::NodeProto& node_proto,
const ArgNameToTypeMap& name_to_type);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -246,6 +246,12 @@ static const char* const kOrtSessionOptionsDisableCPUEPFallback = "session.disab
static const char* const kOrtSessionOptionsOptimizedModelExternalInitializersFileName =
"session.optimized_model_external_initializers_file_name";

// Use this config when save prepacked constant initializers to onnx external data file.
frank-dong-ms marked this conversation as resolved.
Show resolved Hide resolved
// Default is not save prepacked initializers to onnx data file.
// Sample usage: sess_options.add_session_config_entry('session.save_prepacked_constant_initializers', "1")
static const char* const kOrtSessionOptionsSavePrePackedConstantInitializers =
"session.save_prepacked_constant_initializers";

// Use this config to control the minimum size of the initializer when externalizing it during serialization
static const char* const kOrtSessionOptionsOptimizedModelExternalInitializersMinSizeInBytes =
"session.optimized_model_external_initializers_min_size_in_bytes";
Expand Down
2 changes: 2 additions & 0 deletions onnxruntime/contrib_ops/cpu/bert/attention.cc
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
Status Compute(OpKernelContext* context) const override;

Status PrePack(const Tensor& tensor, int input_idx, AllocatorPtr alloc,
bool save_prepacked_initializers,
/*out*/ bool& is_packed,
/*out*/ PrePackedWeights* prepacked_weights) override;

Expand Down Expand Up @@ -101,6 +102,7 @@

template <typename T>
Status Attention<T>::PrePack(const Tensor& weights, int input_idx, AllocatorPtr alloc,
bool /*save_prepacked_initializers*/,

Check warning on line 105 in onnxruntime/contrib_ops/cpu/bert/attention.cc

View workflow job for this annotation

GitHub Actions / Optional Lint C++

[cpplint] reported by reviewdog 🐶 Do not indent within a namespace. [whitespace/indent_namespace] [4] Raw Output: onnxruntime/contrib_ops/cpu/bert/attention.cc:105: Do not indent within a namespace. [whitespace/indent_namespace] [4]
/*out*/ bool& is_packed,
/*out*/ PrePackedWeights* prepacked_weights) {
/* The PrePack() massages the weights to speed up Compute(), there is an option to
Expand Down
2 changes: 2 additions & 0 deletions onnxruntime/contrib_ops/cpu/quantization/attention_quant.cc
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
Status Compute(OpKernelContext* context) const override;

Status PrePack(const Tensor& tensor, int input_idx, AllocatorPtr alloc,
bool save_prepacked_initializers,
bool& /*out*/ is_packed,
/*out*/ PrePackedWeights* prepacked_weights) override;

Expand Down Expand Up @@ -58,6 +59,7 @@

template <typename T>
Status QAttention<T>::PrePack(const Tensor& weights, int input_idx, AllocatorPtr alloc,
bool /*save_prepacked_initializers*/,

Check warning on line 62 in onnxruntime/contrib_ops/cpu/quantization/attention_quant.cc

View workflow job for this annotation

GitHub Actions / Optional Lint C++

[cpplint] reported by reviewdog 🐶 Do not indent within a namespace. [whitespace/indent_namespace] [4] Raw Output: onnxruntime/contrib_ops/cpu/quantization/attention_quant.cc:62: Do not indent within a namespace. [whitespace/indent_namespace] [4]
/*out*/ bool& is_packed,
/*out*/ PrePackedWeights* prepacked_weights) {
if (1 != input_idx) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
DynamicQuantizeLSTM(const OpKernelInfo& info) : OpKernel(info), LSTMBase(info) {}

Status PrePack(const Tensor& tensor, int input_idx,
AllocatorPtr alloc, /*out*/ bool& is_packed,
AllocatorPtr alloc, bool save_prepacked_initializers, /*out*/ bool& is_packed,
/*out*/ PrePackedWeights* prepacked_weights) override;

Status UseSharedPrePackedBuffers(std::vector<BufferUniquePtr>& prepacked_buffers,
Expand Down Expand Up @@ -91,6 +91,7 @@
}

Status DynamicQuantizeLSTM::PrePack(const Tensor& tensor, int input_idx, AllocatorPtr alloc,
bool /*save_prepacked_initializers*/,

Check warning on line 94 in onnxruntime/contrib_ops/cpu/quantization/dynamic_quantize_lstm.cc

View workflow job for this annotation

GitHub Actions / Optional Lint C++

[cpplint] reported by reviewdog 🐶 Do not indent within a namespace. [whitespace/indent_namespace] [4] Raw Output: onnxruntime/contrib_ops/cpu/quantization/dynamic_quantize_lstm.cc:94: Do not indent within a namespace. [whitespace/indent_namespace] [4]
/*out*/ bool& is_packed,
/*out*/ PrePackedWeights* prepacked_weights) {
is_packed = false;
Expand Down
56 changes: 56 additions & 0 deletions onnxruntime/contrib_ops/cpu/quantization/matmul_nbits.cc
Original file line number Diff line number Diff line change
Expand Up @@ -98,12 +98,19 @@
Status Compute(OpKernelContext* context) const override;

Status PrePack(const Tensor& tensor, int input_idx, AllocatorPtr alloc,
bool save_prepacked_initializers,
/*out*/ bool& is_packed,
/*out*/ PrePackedWeights* prepacked_weights) override;

void ConvertPrepackWeightIntoTensor(const onnxruntime::Tensor& tensor, int input_idx);

Status UseSharedPrePackedBuffers(std::vector<BufferUniquePtr>& prepacked_buffers, int input_idx,
/*out*/ bool& used_shared_buffers) override;

std::optional<Tensor> GetPrePackTensor(int /*input_idx*/) override;

Status SetPrePackTensor(int input_idx, const Tensor& pre_packed_tensor) override;

private:
const size_t K_;
const size_t N_;
Expand All @@ -119,6 +126,8 @@
size_t packed_b_size_{0};
IAllocatorUniquePtr<float> scales_fp32_{};
IAllocatorUniquePtr<float> bias_fp32_{};
std::optional<Tensor> packed_tensor_{std::nullopt};
MLDataType prepack_tensor_data_type_;

bool has_zp_input_{false};

Expand Down Expand Up @@ -148,8 +157,22 @@
}
};

template <typename T1>
void MatMulNBits<T1>::ConvertPrepackWeightIntoTensor(const onnxruntime::Tensor& tensor, int input_idx) {
if (input_idx == InputIndex::B) {
prepack_tensor_data_type_ = tensor.DataType();
}

TensorShapeVector weights_dims = {static_cast<int64_t>((packed_b_size_ - 1) / prepack_tensor_data_type_->Size()) + 1};
packed_tensor_ = Tensor(prepack_tensor_data_type_,
frank-dong-ms marked this conversation as resolved.
Show resolved Hide resolved
TensorShape(weights_dims),
packed_b_.get(),
OrtMemoryInfo(CPU, OrtAllocatorType::OrtDeviceAllocator));
}

template <typename T1>
Status MatMulNBits<T1>::PrePack(const Tensor& tensor, int input_idx, /*out*/ AllocatorPtr alloc,
bool save_prepacked_initializers,

Check warning on line 175 in onnxruntime/contrib_ops/cpu/quantization/matmul_nbits.cc

View workflow job for this annotation

GitHub Actions / Optional Lint C++

[cpplint] reported by reviewdog 🐶 Do not indent within a namespace. [whitespace/indent_namespace] [4] Raw Output: onnxruntime/contrib_ops/cpu/quantization/matmul_nbits.cc:175: Do not indent within a namespace. [whitespace/indent_namespace] [4]
/*out*/ bool& is_packed,
/*out*/ PrePackedWeights* prepacked_weights) {
ORT_UNUSED_PARAMETER(prepacked_weights);
Expand Down Expand Up @@ -185,11 +208,16 @@
#endif // MLAS_TARGET_AMD64_IX86
}

if (save_prepacked_initializers) {
ConvertPrepackWeightIntoTensor(tensor, input_idx);
}

return Status::OK();
}

template <>
Status MatMulNBits<MLFloat16>::PrePack(const Tensor& tensor, int input_idx, /*out*/ AllocatorPtr alloc,
bool save_prepacked_initializers,

Check warning on line 220 in onnxruntime/contrib_ops/cpu/quantization/matmul_nbits.cc

View workflow job for this annotation

GitHub Actions / Optional Lint C++

[cpplint] reported by reviewdog 🐶 Do not indent within a namespace. [whitespace/indent_namespace] [4] Raw Output: onnxruntime/contrib_ops/cpu/quantization/matmul_nbits.cc:220: Do not indent within a namespace. [whitespace/indent_namespace] [4]
/*out*/ bool& is_packed,
/*out*/ PrePackedWeights* prepacked_weights) {
ORT_UNUSED_PARAMETER(prepacked_weights);
Expand Down Expand Up @@ -239,6 +267,34 @@
#endif // MLAS_TARGET_AMD64_IX86
}

if (save_prepacked_initializers) {
ConvertPrepackWeightIntoTensor(tensor, input_idx);
}

return Status::OK();
}

template <typename T1>
std::optional<Tensor> MatMulNBits<T1>::GetPrePackTensor(int input_idx) {
// For this kernel, prepack is performed on input_B, and possibly scales, zeros_points.
// During compute process, scales and zeros_points will keep as it is and only use prepacked
// buffer to replace input_B.
// Inorder to cope with this logic, we need to return latest prepacked buffer and only serialize
// the latest one. So, we need to always return packed_tensor_ here not only for input_B.
ORT_UNUSED_PARAMETER(input_idx);
return std::move(packed_tensor_);
}

template <typename T1>
Status MatMulNBits<T1>::SetPrePackTensor(int input_idx, const Tensor& pre_packed_tensor) {
if (input_idx == 1) {
// pre_packed_tensor is constant initialized tensor and its lifecycle is managed by session_state,
// session_state will release memory from pre_packed_tensor. packed_b_ will not release memory so
// pass empty/default buffer deleter here.
// const_cast here is temporary, will fix in follow up PR.
packed_b_ = BufferUniquePtr(const_cast<void*>(pre_packed_tensor.DataRaw()), BufferDeleter());
}

return Status::OK();
}

Expand Down
1 change: 1 addition & 0 deletions onnxruntime/contrib_ops/cpu/skip_layer_norm.cc
Original file line number Diff line number Diff line change
Expand Up @@ -278,6 +278,7 @@

template <typename T, bool simplified>
Status SkipLayerNorm<T, simplified>::PrePack(const Tensor& tensor, int input_idx, AllocatorPtr alloc,
bool /*save_prepacked_initializers*/,

Check warning on line 281 in onnxruntime/contrib_ops/cpu/skip_layer_norm.cc

View workflow job for this annotation

GitHub Actions / Optional Lint C++

[cpplint] reported by reviewdog 🐶 Do not indent within a namespace. [whitespace/indent_namespace] [4] Raw Output: onnxruntime/contrib_ops/cpu/skip_layer_norm.cc:281: Do not indent within a namespace. [whitespace/indent_namespace] [4]
bool& is_packed, PrePackedWeights* prepacked_weights) {
ORT_UNUSED_PARAMETER(prepacked_weights);

Expand Down
2 changes: 1 addition & 1 deletion onnxruntime/contrib_ops/cpu/skip_layer_norm.h
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ class SkipLayerNorm final : public OpKernel {
SkipLayerNorm(const OpKernelInfo& op_kernel_info);
Status Compute(OpKernelContext* p_op_kernel_context) const override;

Status PrePack(const Tensor& tensor, int input_idx, AllocatorPtr alloc,
Status PrePack(const Tensor& tensor, int input_idx, AllocatorPtr alloc, bool save_prepacked_initializers,
bool& is_packed, PrePackedWeights* prepacked_weights) override;

private:
Expand Down
1 change: 1 addition & 0 deletions onnxruntime/contrib_ops/cuda/diffusion/group_norm.cc
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,7 @@
}

Status GroupNorm::PrePack(const Tensor& tensor, int input_idx, AllocatorPtr /*alloc*/,
bool /*save_prepacked_initializers*/,

Check warning on line 98 in onnxruntime/contrib_ops/cuda/diffusion/group_norm.cc

View workflow job for this annotation

GitHub Actions / Optional Lint C++

[cpplint] reported by reviewdog 🐶 Do not indent within a namespace. [whitespace/indent_namespace] [4] Raw Output: onnxruntime/contrib_ops/cuda/diffusion/group_norm.cc:98: Do not indent within a namespace. [whitespace/indent_namespace] [4]
bool& is_packed, PrePackedWeights* /*prepacked_weights*/) {
is_packed = false;

Expand Down
1 change: 1 addition & 0 deletions onnxruntime/contrib_ops/cuda/diffusion/group_norm.h
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ class GroupNorm final : public CudaKernel {
Status ComputeInternal(OpKernelContext* context) const override;

Status PrePack(const Tensor& tensor, int input_idx, AllocatorPtr alloc,
bool save_prepacked_initializers,
bool& is_packed, PrePackedWeights* prepacked_weights) override;

private:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,7 @@ Status QOrderedAttention::PutIntoMergedBias(const Tensor& tensor, AllocatorPtr a
}

Status QOrderedAttention::PrePack(const Tensor& tensor, int input_idx, /*out*/ AllocatorPtr alloc,
bool /*save_prepacked_initializers*/,
/*out*/ bool& is_packed,
/*out*/ PrePackedWeights* /*prepacked_weights*/) {
is_packed = false;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ class QOrderedAttention final : public CudaKernel, public AttentionBase {

public:
Status PrePack(const Tensor& tensor, int input_idx, AllocatorPtr alloc,
bool save_prepacked_initializers,
/*out*/ bool& is_packed,
/*out*/ PrePackedWeights* prepacked_weights) override;

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@ QOrderedMatMul::QOrderedMatMul(const OpKernelInfo& info) : CudaKernel(info) {
}

Status QOrderedMatMul::PrePack(const Tensor& tensor, int input_idx, AllocatorPtr alloc,
bool /*save_prepacked_initializers*/,
/*out*/ bool& is_packed,
/*out*/ PrePackedWeights* /* prepacked_weights */) {
is_packed = false;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ class QOrderedMatMul final : public CudaKernel {
Status ComputeInternal(OpKernelContext* context) const override;

Status PrePack(const Tensor& tensor, int input_idx, AllocatorPtr alloc,
bool save_prepacked_initializers,
/*out*/ bool& is_packed,
/*out*/ PrePackedWeights* prepacked_weights) override;

Expand Down
6 changes: 6 additions & 0 deletions onnxruntime/core/framework/session_options.h
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,11 @@ struct SessionOptions {
// enable profiling for this session.
bool enable_profiling = false;

frank-dong-ms marked this conversation as resolved.
Show resolved Hide resolved
// save pre-packed constant external initializers instead of original initializers to onnxruntime data file.
// Only useful for models run on PC with CPU so ORT could load prepacked weights directly from
// ONNX data file with mmap and no need to do prepacking on fly to save a lot of heap memory.
bool save_prepacked_constant_initializers = false;

// Non empty filepath enables serialization of the transformed optimized model to the specified filepath.
//
// Set session config value for ORT_SESSION_OPTIONS_CONFIG_SAVE_MODEL_FORMAT to 'ORT' or 'ONNX' to explicitly
Expand Down Expand Up @@ -191,6 +196,7 @@ inline std::ostream& operator<<(std::ostream& os, const SessionOptions& session_
<< " execution_mode:" << session_options.execution_mode
<< " execution_order:" << session_options.execution_order
<< " enable_profiling:" << session_options.enable_profiling
<< " save_prepacked_constant_initializers:" << session_options.save_prepacked_constant_initializers
<< " optimized_model_filepath:" << ORT_TSTR_CONVERT_TO_PRINTABLE_STRING(session_options.optimized_model_filepath)
<< " enable_mem_pattern:" << session_options.enable_mem_pattern
<< " enable_mem_reuse:" << session_options.enable_mem_reuse
Expand Down
Loading
Loading