Skip to content

Commit

Permalink
Rename.
Browse files Browse the repository at this point in the history
  • Loading branch information
trivialfis committed Oct 16, 2024
1 parent 69f3c17 commit ce64ac9
Show file tree
Hide file tree
Showing 11 changed files with 68 additions and 54 deletions.
26 changes: 17 additions & 9 deletions doc/tutorials/external_memory.rst
Original file line number Diff line number Diff line change
Expand Up @@ -120,8 +120,11 @@ the ``hist`` tree method is employed. For a GPU device, the main memory is the d
memory, whereas the external memory can be either a disk or the CPU memory. XGBoost stages
the cache on CPU memory by default. Users can change the backing storage to disk by
specifying the ``on_host`` parameter in the :py:class:`~xgboost.DataIter`. However, using
the disk is not recommended. It's likely to make the GPU slower than the CPU. The option is
here for experimental purposes only.
the disk is not recommended as it's likely to make the GPU slower than the CPU. The option
is here for experimental purposes only. In addition,
:py:class:`~xgboost.ExtMemQuantileDMatrix` parameters ``max_num_device_pages``,
``min_cache_page_bytes``, and ``max_quantile_batches`` can help control the data placement
and memory usage.

Inputs to the :py:class:`~xgboost.ExtMemQuantileDMatrix` (through the iterator) must be on
the GPU. This is a current limitation we aim to address in the future.
Expand Down Expand Up @@ -157,12 +160,17 @@ the GPU. This is a current limitation we aim to address in the future.
evals=[(Xy_train, "Train"), (Xy_valid, "Valid")]
)
It's crucial to use `RAPIDS Memory Manager (RMM) <https://github.com/rapidsai/rmm>`__ for
all memory allocation when training with external memory. XGBoost relies on the memory
pool to reduce the overhead for data fetching. In addition, the open source `NVIDIA Linux
driver
It's crucial to use `RAPIDS Memory Manager (RMM) <https://github.com/rapidsai/rmm>`__ with
an asynchronous memory resource for all memory allocation when training with external
memory. XGBoost relies on the asynchronous memory pool to reduce the overhead of data
fetching. In addition, the open source `NVIDIA Linux driver
<https://developer.nvidia.com/blog/nvidia-transitions-fully-towards-open-source-gpu-kernel-modules/>`__
is required for ``Heterogeneous memory management (HMM)`` support.
is required for ``Heterogeneous memory management (HMM)`` support. Usually, users need not
to change :py:class:`~xgboost.ExtMemQuantileDMatrix` parameters ``max_num_device_pages``
and ``min_cache_page_bytes``, they are automatically configured based on the device and
don't change model accuracy. However, the ``max_quantile_batches`` can be useful if
:py:class:`~xgboost.ExtMemQuantileDMatrix` is running out of device memory during
construction, see :py:class:`~xgboost.QuantileDMatrix` for more info.

In addition to the batch-based data fetching, the GPU version supports concatenating
batches into a single blob for the training data to improve performance. For GPUs
Expand All @@ -181,7 +189,7 @@ concatenation can be enabled by:
param = {
"device": "cuda",
"extmem_concat_pages": true,
"extmem_single_page": true,
'subsample': 0.2,
'sampling_method': 'gradient_based',
}
Expand All @@ -200,7 +208,7 @@ interconnect between the CPU and the GPU. With the host memory serving as the da
XGBoost can retrieve data with significantly lower overhead. When the input data is dense,
there's minimal to no performance loss for training, except for the initial construction
of the :py:class:`~xgboost.ExtMemQuantileDMatrix`. The initial construction iterates
through the input data twice, as a result, the most significantly overhead compared to
through the input data twice, as a result, the most significant overhead compared to
in-core training is one additional data read when the data is dense. Please note that
there are multiple variants of the platform and they come with different C2C
bandwidths. During initial development of the feature, we used the LPDDR5 480G version,
Expand Down
49 changes: 27 additions & 22 deletions include/xgboost/c_api.h
Original file line number Diff line number Diff line change
Expand Up @@ -308,35 +308,40 @@ XGB_DLL int XGDMatrixCreateFromCudaArrayInterface(char const *data, char const *
* used by JVM packages. It uses `XGBoostBatchCSR` to accept batches for CSR formated
* input, and concatenate them into 1 final big CSR. The related functions are:
*
* - \ref XGBCallbackSetData
* - \ref XGBCallbackDataIterNext
* - \ref XGDMatrixCreateFromDataIter
* - @ref XGBCallbackSetData
* - @ref XGBCallbackDataIterNext
* - @ref XGDMatrixCreateFromDataIter
*
* Another set is used by external data iterator. It accept foreign data iterators as
* Another set is used by external data iterator. It accepts foreign data iterators as
* callbacks. There are 2 different senarios where users might want to pass in callbacks
* instead of raw data. First it's the Quantile DMatrix used by hist and GPU Hist. For
* this case, the data is first compressed by quantile sketching then merged. This is
* particular useful for distributed setting as it eliminates 2 copies of data. 1 by a
* `concat` from external library to make the data into a blob for normal DMatrix
* initialization, another by the internal CSR copy of DMatrix. The second use case is
* external memory support where users can pass a custom data iterator into XGBoost for
* loading data in batches. There are short notes on each of the use cases in respected
* DMatrix factory function.
* instead of raw data. First it's the Quantile DMatrix used by the hist and GPU-based
* hist tree method. For this case, the data is first compressed by quantile sketching
* then merged. This is particular useful for distributed setting as it eliminates 2
* copies of data. First one by a `concat` from external library to make the data into a
* blob for normal DMatrix initialization, another one by the internal CSR copy of
* DMatrix.
*
* The second use case is external memory support where users can pass a custom data
* iterator into XGBoost for loading data in batches. For both cases, the iterator is only
* used during the construction of the DMatrix and can be safely freed after construction
* finishes. There are short notes on each of the use cases in respected DMatrix factory
* function.
*
* Related functions are:
*
* # Factory functions
* - \ref XGDMatrixCreateFromCallback for external memory
* - \ref XGQuantileDMatrixCreateFromCallback for quantile DMatrix
* - @ref XGDMatrixCreateFromCallback for external memory
* - @ref XGQuantileDMatrixCreateFromCallback for quantile DMatrix
* - @ref XGExtMemQuantileDMatrixCreateFromCallback for External memory Quantile DMatrix
*
* # Proxy that callers can use to pass data to XGBoost
* - \ref XGProxyDMatrixCreate
* - \ref XGDMatrixCallbackNext
* - \ref DataIterResetCallback
* - \ref XGProxyDMatrixSetDataCudaArrayInterface
* - \ref XGProxyDMatrixSetDataCudaColumnar
* - \ref XGProxyDMatrixSetDataDense
* - \ref XGProxyDMatrixSetDataCSR
* - @ref XGProxyDMatrixCreate
* - @ref XGDMatrixCallbackNext
* - @ref DataIterResetCallback
* - @ref XGProxyDMatrixSetDataCudaArrayInterface
* - @ref XGProxyDMatrixSetDataCudaColumnar
* - @ref XGProxyDMatrixSetDataDense
* - @ref XGProxyDMatrixSetDataCSR
* - ... (data setters)
*
* @{
Expand Down Expand Up @@ -515,7 +520,7 @@ XGB_DLL int XGQuantileDMatrixCreateFromCallback(DataIterHandle iter, DMatrixHand
*
* @since 3.0.0
*
* @note This is still under development, not ready for test yet.
* @note This is experimental and subject to change.
*
* @param iter A handle to external data iterator.
* @param proxy A DMatrix proxy handle created by @ref XGProxyDMatrixCreate.
Expand Down
13 changes: 7 additions & 6 deletions python-package/xgboost/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -1574,12 +1574,13 @@ class QuantileDMatrix(DMatrix):
applied to the validation/test data
max_quantile_batches :
For GPU-based inputs, XGBoost handles incoming batches with multiple growing
substreams. This parameter sets the maximum number of batches before XGBoost can
cut the sub-stream and create a new one. This can help bound the memory
usage. By default, XGBoost grows new sub-streams exponentially until batches are
exhausted. Only used for the training dataset and the default is None
(unbounded).
For GPU-based inputs from an iterator, XGBoost handles incoming batches with
multiple growing substreams. This parameter sets the maximum number of batches
before XGBoost can cut the sub-stream and create a new one. This can help bound
the memory usage. By default, XGBoost grows new sub-streams exponentially until
batches are exhausted. Only used for the training dataset and the default is
None (unbounded). Lastly, if the `data` is a single batch instead of an
iterator, this parameter has no effect.
.. versionadded:: 3.0.0
Expand Down
2 changes: 1 addition & 1 deletion src/common/error_msg.h
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@ inline auto NoCategorical(std::string name) {

inline void NoPageConcat(bool concat_pages) {
if (concat_pages) {
LOG(FATAL) << "`extmem_concat_pages` must be false when there's no sampling or when it's "
LOG(FATAL) << "`extmem_single_page` must be false when there's no sampling or when it's "
"running on the CPU.";
}
}
Expand Down
4 changes: 2 additions & 2 deletions src/tree/hist/param.h
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ struct HistMakerTrainParam : public XGBoostParameter<HistMakerTrainParam> {
constexpr static std::size_t CudaDefaultNodes() { return static_cast<std::size_t>(1) << 12; }

bool debug_synchronize{false};
bool extmem_concat_pages{false};
bool extmem_single_page{false};

void CheckTreesSynchronized(Context const* ctx, RegTree const* local_tree) const;

Expand All @@ -43,7 +43,7 @@ struct HistMakerTrainParam : public XGBoostParameter<HistMakerTrainParam> {
.set_default(NotSet())
.set_lower_bound(1)
.describe("Maximum number of nodes in histogram cache.");
DMLC_DECLARE_FIELD(extmem_concat_pages).set_default(false);
DMLC_DECLARE_FIELD(extmem_single_page).set_default(false);
}
};
} // namespace xgboost::tree
2 changes: 1 addition & 1 deletion src/tree/updater_approx.cc
Original file line number Diff line number Diff line change
Expand Up @@ -278,7 +278,7 @@ class GlobalApproxUpdater : public TreeUpdater {
*sampled = linalg::Empty<GradientPair>(ctx_, gpair->Size(), 1);
auto in = gpair->HostView().Values();
std::copy(in.data(), in.data() + in.size(), sampled->HostView().Values().data());
error::NoPageConcat(this->hist_param_.extmem_concat_pages);
error::NoPageConcat(this->hist_param_.extmem_single_page);
SampleGradient(ctx_, param, sampled->HostView());
}

Expand Down
2 changes: 1 addition & 1 deletion src/tree/updater_gpu_hist.cu
Original file line number Diff line number Diff line change
Expand Up @@ -162,7 +162,7 @@ struct GPUHistMakerDevice {
interaction_constraints(param, static_cast<bst_feature_t>(info.num_col_)),
sampler{std::make_unique<GradientBasedSampler>(
ctx, info.num_row_, batch_param, param.subsample, param.sampling_method,
batch_ptr_.size() > 2 && this->hist_param_->extmem_concat_pages)} {
batch_ptr_.size() > 2 && this->hist_param_->extmem_single_page)} {
if (!param.monotone_constraints.empty()) {
// Copy assigning an empty vector causes an exception in MSVC debug builds
monotone_constraints = param.monotone_constraints;
Expand Down
2 changes: 1 addition & 1 deletion src/tree/updater_quantile_hist.cc
Original file line number Diff line number Diff line change
Expand Up @@ -539,7 +539,7 @@ class QuantileHistMaker : public TreeUpdater {
// Copy gradient into buffer for sampling. This converts C-order to F-order.
std::copy(linalg::cbegin(h_gpair), linalg::cend(h_gpair), linalg::begin(h_sample_out));
}
error::NoPageConcat(this->hist_param_.extmem_concat_pages);
error::NoPageConcat(this->hist_param_.extmem_single_page);
SampleGradient(ctx_, *param, h_sample_out);
auto *h_out_position = &out_position[tree_it - trees.begin()];
if ((*tree_it)->IsMultiTarget()) {
Expand Down
2 changes: 1 addition & 1 deletion tests/cpp/tree/gpu_hist/test_gradient_based_sampler.cu
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ TEST(GradientBasedSampler, NoSamplingExternalMemory) {
[&] {
GradientBasedSampler sampler(&ctx, kRows, param, kSubsample, TrainParam::kUniform, true);
},
GMockThrow("extmem_concat_pages"));
GMockThrow("extmem_single_page"));
}

TEST(GradientBasedSampler, UniformSampling) {
Expand Down
18 changes: 9 additions & 9 deletions tests/cpp/tree/test_gpu_hist.cu
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ void UpdateTree(Context const* ctx, linalg::Matrix<GradientPair>* gpair, DMatrix
ObjInfo task{ObjInfo::kRegression};
std::unique_ptr<TreeUpdater> hist_maker{TreeUpdater::Create("grow_gpu_hist", ctx, &task)};
if (subsample < 1.0) {
hist_maker->Configure(Args{{"extmem_concat_pages", std::to_string(concat_pages)}});
hist_maker->Configure(Args{{"extmem_single_page", std::to_string(concat_pages)}});
} else {
hist_maker->Configure(Args{});
}
Expand Down Expand Up @@ -240,31 +240,31 @@ TEST(GpuHist, PageConcatConfig) {

auto learner = std::unique_ptr<Learner>(Learner::Create({p_fmat}));
learner->SetParam("device", ctx.DeviceName());
learner->SetParam("extmem_concat_pages", "true");
learner->SetParam("extmem_single_page", "true");
learner->SetParam("subsample", "0.8");
learner->Configure();

learner->UpdateOneIter(0, p_fmat);
learner->SetParam("extmem_concat_pages", "false");
learner->SetParam("extmem_single_page", "false");
learner->Configure();
// GPU Hist rebuilds the updater after configuration. Training continues
learner->UpdateOneIter(1, p_fmat);

learner->SetParam("extmem_concat_pages", "true");
learner->SetParam("extmem_single_page", "true");
learner->SetParam("subsample", "1.0");
ASSERT_THAT([&] { learner->UpdateOneIter(2, p_fmat); }, GMockThrow("extmem_concat_pages"));
ASSERT_THAT([&] { learner->UpdateOneIter(2, p_fmat); }, GMockThrow("extmem_single_page"));

// Throws error on CPU.
{
auto learner = std::unique_ptr<Learner>(Learner::Create({p_fmat}));
learner->SetParam("extmem_concat_pages", "true");
ASSERT_THAT([&] { learner->UpdateOneIter(0, p_fmat); }, GMockThrow("extmem_concat_pages"));
learner->SetParam("extmem_single_page", "true");
ASSERT_THAT([&] { learner->UpdateOneIter(0, p_fmat); }, GMockThrow("extmem_single_page"));
}
{
auto learner = std::unique_ptr<Learner>(Learner::Create({p_fmat}));
learner->SetParam("extmem_concat_pages", "true");
learner->SetParam("extmem_single_page", "true");
learner->SetParam("tree_method", "approx");
ASSERT_THAT([&] { learner->UpdateOneIter(0, p_fmat); }, GMockThrow("extmem_concat_pages"));
ASSERT_THAT([&] { learner->UpdateOneIter(0, p_fmat); }, GMockThrow("extmem_single_page"));
}
}

Expand Down
2 changes: 1 addition & 1 deletion tests/python-gpu/test_gpu_data_iterator.py
Original file line number Diff line number Diff line change
Expand Up @@ -115,7 +115,7 @@ def test_concat_pages_invalid() -> None:
"device": "cuda",
"subsample": 0.5,
"sampling_method": "gradient_based",
"extmem_concat_pages": True,
"extmem_single_page": True,
"objective": "reg:absoluteerror",
},
Xy,
Expand Down

0 comments on commit ce64ac9

Please sign in to comment.