API reference

class freeforestml.Cut(func=None, label=None, **columns)[source]

Representation of an analysis cut. The class can be used to apply event selections based on conditions on columns in a pandas dataframe or derived quantities.

Cuts store the condition to be applied to a dataframe. New cut objects accept all events by default. The selection can be limited by passing a lambda to the constructor.

>>> sel_all = Cut()
>>> sel_pos = Cut(lambda df: df.value > 0)

The cut object lives independently of the dataframe. Calling the cut with a dataframe returns a new dataframe containing only rows which pass the selection criteria.

>>> df = pd.DataFrame([0, 1, -2, -3, 4], columns=["value"])
>>> sel_all(df)
   value
0      0
1      1
2     -2
3     -3
4      4
>>> sel_pos(df)
   value
1      1
4      4

The index array for a given data set is calculated by calling the idx_array() method with a data dataframe.

>>> sel_pos.idx_array(df)
0    False
1     True
2    False
3    False
4     True
Name: value, dtype: bool

Cuts can be used to build logical expression using the bitwise and (&), or (|), xor (^) and not (~).

>>> sel_even = Cut(lambda df: df.value % 2 == 0)
>>> sel_pos_even = sel_pos & sel_even
>>> sel_pos_even(df)
   value
4      4

Equivalently, cuts support logical operations directly using lambdas.

>>> sel_pos_even_lambda = sel_pos & (lambda df: df.value % 2 == 0)
>>> sel_pos_even_lambda(df)
   value
4      4

Cuts might be named by passing the ‘label’ argument to the constructor. Cut names can be used during plotting as labels to specify the plotted region.

>>> sel_sr = Cut(lambda df: df.is_sr == 1, label="Signal Region")
>>> sel_sr.label
'Signal Region'

If the application of a cut requires to change the event weights by a so called scale factors, you can pass additional optional keyword arguments that specify how the new weight should be computed.

>>> sel_sample = Cut(lambda df: df.value % 2 == 0,                          weight=lambda df: df.weight * 2)

The argument name ‘weight’ in this example is arbitrary. It is even possible to add new columns to the returned dataframe in this way, however, this is not recommended.

__and__(other)[source]

Returns a new cut implementing the logical AND of this cut and the other cut. The other cat be a Cut or any callable.

__call__(dataframe)[source]

Applies the internally stored cut to the given dataframe and returns a new dataframe containing only entries passing the event selection.

__init__(func=None, label=None, **columns)[source]

Creates a new cut. The optional func argument is called with the dataframe upon evaluation. The function must return an index array. If the optional function is omitted, Every row in the dataframe is accepted by this cut.

__invert__()[source]

Returns a new cut implementing the logical NOT of this cut.

__or__(other)[source]

Returns a new cut implementing the logical OR of this cut and the other cut. The other cat be a Cut or any callable.

__xor__(other)[source]

Returns a new cut implementing the logical XOR of this cut and the other cut. The other can be a callable.

idx_array(dataframe)[source]

Applies the internally stored cut to the given dataframe and returns an index array, specifying which event passed the event selection.

class freeforestml.Process(label, selection=None, range=None, range_var=None)[source]

This class represents a physics process to be selected during training and plotting. The class stores the cuts to select the process’ events from a dataframe, its style and human-readable name for plotting.

__call__(dataframe)[source]

Returns a dataframe containing only the events of this process.

__init__(label, selection=None, range=None, range_var=None)[source]

Returns a new process object. The process has a human-readable name (potentially using latex) and a selection cut. The selection argument can be a cut object or any callable. Stacking of processes is handled by the plotting method.

>>> Process("Top", lambda d: d.is_top)
<Process 'Top': (func)>
>>> Process("VBF", lambda d: d.is_VBFH)
<Process 'VBF': (func)>

The optional argument range accepts a two-value tuple and is a shortcut to defined a selection cut accepting events whose ‘range_var’ is between (including boundaries) the given values. The range_var can be a string naming a column in the dataframe or a Variable object.

>>> Process("Z\\rightarrow\\ell\\ell", range=(-599, -500))
<Process 'Z\\rightarrow\\ell\\ell': [-599, -500]>

If the range_var argument is omitted, the value of Process.DEFAULT_RANGE_VAR is used, this defaults to ‘fpid’.

A process behaves like a cut in many ways. For example, the call() and idx_array methods are identical.

__repr__()[source]

Returns a string representation of the process.

idx_array(dataframe)[source]

Returns the index array of the given dataframe which selects all events of this process.

class freeforestml.Variable(name, definition, unit=None, blinding=None)[source]

Representation of a quantity derived from the columns of a dataframe. The variable can also directly represent a column of the dataframe.

The variable object defines a human-readable name for the variable and it’s physical unit. The name and the unit are used for plotting and labeling of axes.

>>> Variable("MMC", "ditau_mmc_mlm_m", "GeV")
<Variable 'MMC' [GeV]>
__call__(dataframe)[source]

Returns an array or series of variable computed from the given dataframe. This method does not apply the blinding!

__eq__(other)[source]

Compare if two variables are the same.

__hash__ = None
__init__(name, definition, unit=None, blinding=None)[source]

Returns a new variable object. The first argument is a human-readable name (potentially using latex). The second argument defines the value of the variable. This can be a string naming the column of the dataframe or a callable that computes the value when a dataframe is passed to it.

>>> Variable("MMC", "ditau_mmc_mlm_m", "GeV")
<Variable 'MMC' [GeV]>
>>> Variable("$\\Delta \\eta$", lambda df: df.jet_0_eta - df.jet_1_eta)
<Variable '$\\Delta \\eta$'>

The optional argument unit defines the unit of the variable. This information is used for plotting, especially for labeling axes.

The optional blinding argument accepts a blinding object implementing the blinding strategy.

__repr__()[source]

Returns a string representation.

classmethod load_from_h5(path, key)[source]

Create a new Variable instance from an hdf5 file. ‘path’ is the file path and ‘key’ is the path inside the hdf5 file.

save_to_h5(path, key, overwrite=False)[source]

Save variable definition to a hdf5 file. ‘path’ is the file path and ‘key’ is the path inside the hdf5 file. If overwrite is true then already existing file contents are overwritten.

class freeforestml.variable.BlindingStrategy[source]

The BlindingStrategy class represents a blinding strategy. This is an abstract base class. Sub-classes must implement the __call__ method.

abstract __call__(dataframe, variable, bins, range=None)[source]

Returns the additional selection in order to blind a process. The first argument is the dataframe to operate on. The second argument is the variable whose histogram should be blinded. The arguments bins and range are identical to the ones for the hist method. They might be used in sub-classes to align the blinding cuts to bin borders.

class freeforestml.RangeBlindingStrategy(start, end)[source]

Concrete blinding strategy which removes all events between a certain x-axis range. The range might be extended to match the bin borders.

__call__(variable, bins, range=None)[source]

See base class. Returns the additional selection.

__init__(start, end)[source]

Returns a new RangeBlindingStrategy object. When the object is called, it returns a selection removing all events that lay between start and end. The range might be extended to match bin borders.

class freeforestml.CrossValidator(k, mod_var=None, frac_var=None)[source]

Abstract class of a cross validation method.

__eq__(other)[source]

Compare if two cross validators are the same.

__hash__ = None
__init__(k, mod_var=None, frac_var=None)[source]

Creates a new cross validator. The argument k determines the number of folders. The mod_var specifies a variable whose ‘mod k’ value defines the set. The frac_var specifies a variable whose decimals defines the set. Only one of the two can be used. Both options can be either a string naming the column in the dataframe or a variable object.

classmethod load_from_h5(path, key)[source]

Create a new cross validator instance from an hdf5 file. ‘path’ is the file path and ‘key’ is the path inside the hdf5 file.

retrieve_fold_info(df, cv)[source]

Returns and array of integers to specify which event was used for train/val/test in which fold

save_to_h5(path, key, overwrite=False)[source]

Save cross validator definition to a hdf5 file. ‘path’ is the file path and ‘key’ is the path inside the hdf5 file. If overwrite is true then already existing file contents are overwritten.

select_cv_set(df, cv, fold_i)[source]

Returns the index array to select all events from the cross validator set specified with cv (‘train’, ‘val’, ‘test’) for the given fold.

abstract select_slice(df, slice_id)[source]

Returns the index array to select all events from the dataset of a given slice.

NB: This method is for internal usage only. There might be more than k slices.

abstract select_test(df, fold_i)[source]

Returns the index array to select all test events from the dataset for the given fold.

abstract select_training(df, fold_i)[source]

Returns the index array to select all training events from the dataset for the given fold.

abstract select_validation(df, fold_i)[source]

Returns the index array to select all validation events from the dataset for the given fold.

class freeforestml.ClassicalCV(k, mod_var=None, frac_var=None)[source]

Performs the k-fold cross validation on half of the data set. The other half is designated as the test set.

fold 0: | Tr | Tr | Tr | Tr | Va | Test | fold 1: | Tr | Tr | Tr | Va | Tr | Test | fold 2: | Tr | Tr | Va | Tr | Tr | Test | fold 3: | Tr | Va | Tr | Tr | Tr | Test | fold 4: | Va | Tr | Tr | Tr | Tr | Test |

Va=Validation, Tr=Training

select_slice(df, slice_id)[source]

Returns the index array to select all events from the dataset of a given slice.

NB: This method is for internal usage only. There might be more than k slices.

select_test(df, fold_i)[source]

Returns the index array to select all test events from the dataset for the given fold.

select_training(df, fold_i)[source]

Returns the index array to select all training events from the dataset for the given fold.

select_validation(df, fold_i)[source]

Returns the index array to select all validation events from the dataset for the given fold.

class freeforestml.MixedCV(k, mod_var=None, frac_var=None)[source]

Performs the k-fold cross validation where validation and test sets are both interleaved.

fold 0: | Tr | Tr | Tr | Te | Va | fold 1: | Tr | Tr | Te | Va | Tr | fold 2: | Tr | Te | Va | Tr | Tr | fold 3: | Te | Va | Tr | Tr | Tr | fold 4: | Va | Tr | Tr | Tr | Te |

Va=Validation, Tr=Training, Te=Test

select_slice(df, slice_id)[source]

Returns the index array to select all events from the dataset of a given slice.

NB: This method is for internal usage only. There might be more than k slices.

select_test(df, fold_i)[source]

Returns the index array to select all test events from the dataset for the given fold.

select_training(df, fold_i)[source]

Returns the index array to select all training events from the dataset for the given fold.

select_validation(df, fold_i)[source]

Returns the index array to select all validation events from the dataset for the given fold.

class freeforestml.Normalizer(df, input_list=None)[source]

Abstract normalizer which shift and scales the distribution such that it hash zero mean and unit width.

abstract __call__(df)[source]

Applies the normalized of the input_columns to the given dataframe and returns a normalized copy.

abstract __eq__(other)[source]

Check if two normalizers are the same.

__hash__ = None
abstract __init__(df, input_list=None)[source]

Returns a normalizer object with the normalization moments stored internally. The input_list argument specifies which inputs should be normalized. All other columns are left untouched.

classmethod load_from_h5(path, key)[source]

Create a new normalizer instance from an hdf5 file. ‘path’ is the file path and ‘key’ is the path inside the hdf5 file.

abstract property offsets

Every normalizor must reduce to a simple (offset + scale * x) normalization to be used with lwtnn. This property returns the offset parameters for all variables.

save_to_h5(path, key, overwrite=False)[source]

Save normalizer definition to a hdf5 file. ‘path’ is the file path and ‘key’ is the path inside the hdf5 file. If overwrite is true then already existing file contents are overwritten.

abstract property scales

Every normalizor must reduce to a simple (offset + scale * x) normalization to be used with lwtnn. This property returns the scale parameters for all variables.

class freeforestml.EstimatorNormalizer(df, input_list=None, center=None, width=None)[source]

Normalizer which uses estimators to compute the normalization moments. This method might be lead to sub-optimal results if there are outliers.

__call__(df)[source]

See base class.

__eq__(other)[source]

See base class.

__hash__ = None
__init__(df, input_list=None, center=None, width=None)[source]

See base class.

property offsets

Every normalizor must reduce to a simple (offset + scale * x) normalization to be used with lwtnn. This property returns the offset parameters for all variables.

property scales

Every normalizor must reduce to a simple (offset + scale * x) normalization to be used with lwtnn. This property returns the scale parameters for all variables.

class freeforestml.HepNet(keras_model, cross_validator, normalizer, input_list, output_list)[source]

Meta model of a concrete neural network around the underlying Keras model. The HEP net handles cross validation, normalization of the input variables, the input weights, and the actual Keras model. A HEP net has no free parameters.

__eq__(other)[source]

Check if two models have the same configuration.

__hash__ = None
__init__(keras_model, cross_validator, normalizer, input_list, output_list)[source]

Creates a new HEP model. The keras model parameter must be a class that returns a new instance of the compiled model (The HEP net needs to able to create multiple models, one for each cross validation fold.)

The cross_validator must be a CrossValidator object.

The normalizer must be a Normalizer class that returns a normalizer. Each cross_validation fold uses a separate normalizer with independent normalization weights.

The input and output lists are lists of variables of column names used as input and target of the keras model. The input is normalized.

export(path_base, command='converters/keras2json.py', expression={})[source]

Exports the network such that it can be converted to lwtnn’s json format. The method generate a set of files for each cross validation fold. For every fold, the archtecture, the weights, the input variables and their normalization is exported. To simplify the conversion to lwtnn’s json format, the method also creates a bash script which converts all folds.

The path_base argument should be a path or a name of the network. The names of the generated files are created by appending to path_base.

The optional expression can be used to inject the CAF expression when

the NN is used. The final json file will contain an entry KEY=VALUE if a variable matches the dict key.

fit(df, weight=None, **kwds)[source]

Calls fit() on all folds. All kwds are passed to fit().

classmethod load(path, **kwds)[source]

Restore a model from a hdf5 file.

predict(df, cv='val', retrieve_fold_info=False, **kwds)[source]

Calls predict() on the Keras model. The argument cv specifies the cross validation set to select: ‘train’, ‘val’, ‘test’. Default is ‘val’.

All other keywords are passed to predict.

save(path)[source]

Save the model and all associated components to a hdf5 file.

class freeforestml.plot.HistogramFactory(*args, **kwds)[source]

Short-cut to create multiple histogram with the same set of processes or in the same region.

__call__(*args, **kwds)[source]

Proxy for method to hist(). The positional argument passed to hist() are the positional argument given to the constructor concatenated with the positional argument given to this method. The keyword argument for hist() is the union of the keyword arguments passed to the constructor and this method. The argument passed to this method have precedence.

The method returns the return value of hist.

__init__(*args, **kwds)[source]

Accepts any number of positional and keyword arguments. The arguments are stored internally and use default value for hist(). See __call__().

freeforestml.plot.confusion_matrix(df, x_processes, y_processes, x_label, y_label, weight=None, axes=None, figure=None, atlas=None, info=None, enlarge=1.3, normalize_rows=False, **kwds)[source]

Creates a confusion matrix.

freeforestml.plot.correlation_matrix(df, variables, weight=None, axes=None, figure=None, atlas=None, info=None, enlarge=1.3, normalize_rows=False, **kwds)[source]

Plot the Pearson correlation coefficient matrix. The square matrix is returned as a DataFrame.

freeforestml.plot.hist(dataframe, variable, bins, stacks, selection=None, range=None, blind=None, figure_size=None, weight=None, y_log=False, y_min=None, vlines=[], denominator=0, numerator=-1, ratio_label=None, diff=False, ratio_range=None, atlas=None, info=None, enlarge=1.6, density=False, include_outside=False, return_uhepp=False, **kwds)[source]

Creates a histogram of stacked processes. The first argument is the dataframe to operate on. The ‘variable’ argument defines the x-axis. The variable argument can be a Variable object or a string naming a column in the dataframe.

The ‘bins’ argument can be an integer specifying the number of bins or a list with all bin boundaries. If it is an integer, the argument range is mandatory. The range argument must be a tuple with the lowest and highest bin edge. The properties of a Variable object are used for the x- and y-axis labels.

Stacks must be Stack objects. The plotting style is defined via the stack object.

The optional blind argument controls which stack should be blinded. The argument can be a single stack, a list of stacks or None. By default, no stack is blinded.

This method creates a new figure and axes internally (handled by uhepp). The figure size can be changed with the figure_size argument. If this argument is not None, it must be a tuple of (width, height).

The method returns (figure, axes) which were used during plotting. This might be identical to the figure and axes arguments. If a ratio plot is drawn, the axes return value is a list of main, ratio plot.

The weight is used to weight the entries. Entries have unit weight if omitted. The argument can be a string name of a column or a variable object.

If the y_log argument is set to True, the y axis will be logarithmic. The axis is enlarged on a logarithmic scale to make room for the ATLAS labels. The optional y_min argument can be used to set the lower limit of the y axis. The default is 0 for linear scale, and 1 for logarithmic scale.

The option vlines can be used to draw vertical lines onto the histogram, e.g., a cut line. The argument should be an array, one item for each line. If the item is a number a red line will be drawn at that x-position. If it is a dict, it will take the item ‘x’ to determine the position, all other keywords are passed to matplotlibs plot method.`

The ratio_label option controls the label of the ratio plot.

The ratio_range argument control the y-range of the ratio plot. If set to None, it will scale automatically to include all points. The default is is None.

If diff is set to True, The difference between the ‘numerator’ and the ‘denominator’ is down instead of their ratio.

The module constants ATLAS and INFO are passed to atlasify. Overwrite them to change the badges.

If the density argument is True, the area of each stack is normalized to unity.

If return_uhepp is True, the method return a UHepPlot object.

freeforestml.plot.human_readable(label)[source]

Convert labels to plain ascii strings

freeforestml.plot.roc(df, signal_process, background_process, discriminant, steps=100, selection=None, min=None, max=None, axes=None, weight=None, atlas=None, info=None, enlarge=1.3, return_auc=False)[source]

Creates a ROC.

The method returns a dataframe with the signal efficiency and background rejection columns. The length of the dataframe equals the steps parameter.

If return_auc is True, the method returns a tuple with the area under the curve and an uncertainty estimation on the area.