Skip to content

Training frame work basics

Andreas Albert edited this page Oct 28, 2021 · 1 revision

The core of this package is an implementation of the data input into keras. It is assumed that the input data is available as one or multiple ROOT files per physics data set. Each ROOT file contains a TTree, the branches of which can be used as input features for training. Each entry in the TTree (typically corresponding to one collision event in the context of HEP analysis), is understood as an independent sample for training. The main feature of the implementation is that it natively handles the case of multiple input physics data sets.

Data sets

The most basic conceptual unit is a data set. A data set represents events from a single physics production process with a defined cross section. Each data set is represented by a vbfml.input.sequences.DatasetInfo object. These objects are typically generated automatically using factory functions such as vbfml.training.input.load_datasets_bucoffea, which creates data set objects based a folder full of appropriately named ROOT files.

From data sets to training class

Of course, in reality, a physics data set by itself will not necessarily correspond to a specific class of events in training. For example, we might have data sets Z, W and signal, but in the training, we want to merge Z and W into a common class (background). For this purpose, each data set can be assigned a label property. By default, the label is equal to the name of the data set (Z and W in this case). To achieve a merging of the samples, we would simply set the label property of the both data sets to a new, identical value, e.g.:

datasets['z'].label = 'background'.

Operations like this one are made easier using vbfml.training.util.select_and_label_datasets, which supports regular expression based matching of data set names.

Data set weights

Each data set can be assigned a numerical weight attribute. This weight will be propagated to the per-event weights discussed below.

Reading input data

Data is read from disk with the vbfml.input.sequences.MultiDatasetSequence class. It inherits from keras.utils.Sequence, which defines the basic interface.

Sequence interface

Here's an example that showcases the basic idea behind a Sequence:

mds = MultiDatasetSequence(...)
number_of_training_btaches = len(dms)

# First training batch
features, labels, weights = mds[0]

# Iteration also possible:
for features, labels, weights in mds:
   do_something()

In all cases, features, labels and weights are numpy.ndarrays. In each row (first index), they contain values referring to a separate entry in the original trees. In its columns, the features array has the values of the different training features (see below) for a given event, labels has a one-hot representation of the class the event belongs to, and weights has a single numerical weight that represents the importance of the event.

The main advantage of this interface is that it can directly be understood by the fitting method in keras:

model = some_keras_model()
model.fit(x=mds)

Keras will then take care of looping over and unpacking the training batches.

MultiDatasetSequence

The implementation of MultiDatasetSequence takes care of tailoring the interface to fit the specific use case here. When initiating the sequence, we can define what branches we want to read from the input files (e.g. "pt" and "eta"), the batch size (how many events are included in each of events), and whether to shuffle the events when generating batches. To avoid frequent disk reading operations, which are slow, batches of events are not read individually from the input files. Instead, an internal buffering method is used in which multiple batches are read at once and placed in an in-memory buffer. The size of this buffer can be controlled with the batch_buffer_size parameter, which is specified in units of batches (e.g. in the example below, the buffer would hold 100k batches, which means 50 * 1e5 = 5 million events for a batch size of 50 events). Finally, sample weights can be specified by passing a weight_expression, which is any mathematical expression understood by uproot. Branch names in the input tree can be used as variables in such an expression.

Here's an example instantiation:

sequence = MultiDatasetSequence(
        branches=["pt","eta"],
        batch_size=50,
        shuffle=True,
        batch_buffer_size=int(1e5),
        weight_expression="weight_total*xs/sumw",
    )

Note that we have not specified what data to read, yet. We do this by adding DatasetInfo objects to the sequence:

sequence.add_dataset(DatasetInfo(...))

When adding multiple data sets to a single sequence, the resulting batches will consist of a mix between the data sets. The code will attempt to get the contributions of each data set to the batch to be as similar as possible to the ratio of total events in this data set to the total events in all data sets (e.g. if you have dataset A with 10k events and dataset B with 90K events, each batch will contain ~10% of events from data set A). Note that in order to make this easier, batches do not always have exactly the same length.

Again, you instantiate the sequence manually or use a factory function, such as vbfml.training.util.build_sequence.