-
Notifications
You must be signed in to change notification settings - Fork 3
Training frame work basics
The core of this package is an implementation of the data input into keras. It is assumed that the input data is available as one or multiple ROOT files per physics data set. Each ROOT file contains a TTree, the branches of which can be used as input features for training. Each entry in the TTree (typically corresponding to one collision event in the context of HEP analysis), is understood as an independent sample for training. The main feature of the implementation is that it natively handles the case of multiple input physics data sets.
The most basic conceptual unit is a data set. A data set represents events from a single physics production process with a defined cross section. Each data set is represented by a vbfml.input.sequences.DatasetInfo
object. These objects are typically generated automatically using factory functions such as vbfml.training.input.load_datasets_bucoffea
, which creates data set objects based a folder full of appropriately named ROOT files.
Of course, in reality, a physics data set by itself will not necessarily correspond to a specific class of events in training. For example, we might have data sets Z
, W
and signal
, but in the training, we want to merge Z
and W
into a common class (background
). For this purpose, each data set can be assigned a label
property. By default, the label is equal to the name of the data set (Z
and W
in this case). To achieve a merging of the samples, we would simply set the label property of the both data sets to a new, identical value, e.g.:
datasets['z'].label = 'background'
.
Operations like this one are made easier using vbfml.training.util.select_and_label_datasets
, which supports regular expression based matching of data set names.
Each data set can be assigned a numerical weight
attribute. This weight will be propagated to the per-event weights discussed below.
Data is read from disk with the vbfml.input.sequences.MultiDatasetSequence
class. It inherits from keras.utils.Sequence, which defines the basic interface.
Here's an example that showcases the basic idea behind a Sequence:
mds = MultiDatasetSequence(...)
number_of_training_btaches = len(dms)
# First training batch
features, labels, weights = mds[0]
# Iteration also possible:
for features, labels, weights in mds:
do_something()
In all cases, features
, labels
and weights
are numpy.ndarray
s. In each row (first index), they contain values referring to a separate entry in the original trees. In its columns, the features
array has the values of the different training features (see below) for a given event, labels
has a one-hot representation of the class the event belongs to, and weights
has a single numerical weight that represents the importance of the event.
The main advantage of this interface is that it can directly be understood by the fitting method in keras:
model = some_keras_model()
model.fit(x=mds)
Keras will then take care of looping over and unpacking the training batches.
The implementation of MultiDatasetSequence
takes care of tailoring the interface to fit the specific use case here. When initiating the sequence, we can define what branches we want to read from the input files (e.g. "pt" and "eta"), the batch size (how many events are included in each of events), and whether to shuffle the events when generating batches. To avoid frequent disk reading operations, which are slow, batches of events are not read individually from the input files. Instead, an internal buffering method is used in which multiple batches are read at once and placed in an in-memory buffer. The size of this buffer can be controlled with the batch_buffer_size
parameter, which is specified in units of batches (e.g. in the example below, the buffer would hold 100k batches, which means 50 * 1e5 = 5 million events for a batch size of 50 events). Finally, sample weights can be specified by passing a weight_expression
, which is any mathematical expression understood by uproot. Branch names in the input tree can be used as variables in such an expression.
Here's an example instantiation:
sequence = MultiDatasetSequence(
branches=["pt","eta"],
batch_size=50,
shuffle=True,
batch_buffer_size=int(1e5),
weight_expression="weight_total*xs/sumw",
)
Note that we have not specified what data to read, yet. We do this by adding DatasetInfo
objects to the sequence:
sequence.add_dataset(DatasetInfo(...))
When adding multiple data sets to a single sequence, the resulting batches will consist of a mix between the data sets. The code will attempt to get the contributions of each data set to the batch to be as similar as possible to the ratio of total events in this data set to the total events in all data sets (e.g. if you have dataset A with 10k events and dataset B with 90K events, each batch will contain ~10% of events from data set A). Note that in order to make this easier, batches do not always have exactly the same length.
Again, you instantiate the sequence manually or use a factory function, such as vbfml.training.util.build_sequence
.