Glossary#

A#

AUC#

Area Under the Curve

The area under a ROC curve. AUC is a metric that can be used to benchmark different models against each other, with larger values corresponding to better model performance.

See ROC

B#

batch#

A batch is a subset of datapoints in a dataset that is processed in sequence by a neural network before the loss function is evaluated, the gradients are calculated and the model weights are updated using backprop.

The larger the batch size is, the less frequently these updates will happen and therefore, the less time each epoch will take. On the flip side, the model parameters will not be updated very often and the progress from epoch to the next may be slower. Choosing smaller batch sizes on the other hand will lead to more gradient evaluations, more frequent updates of the model weights and therefore a longer duration of each epoch.

bias#

The deviation of the predictions from the true labels in classification and of the predicted values from the true values in regression

binary classification#

Classification problem with only two classes. These are often positive and negative, e.g. when a test for a disease is performed, and the negative class is associated with the null hypothesis, i.e. the hypothesis that is assumed to be valid until evidence of the contrary is presented.

Contrast to multi-class classification.

C#

calibration#

Calibration refers to the process of checking and correcting the score of a model in terms of its possible interpretation as a probability. For example, in binary classification, the sigmoid-activated model output is often interpreted as the probability that the datapoint belongs to the positive class. However, the fact that the sigmoid function returns values in [0, 1] does not yet guarantee that its values accurately quantify the probability for a datapoint to belong to a certain class. These probabilities and their relation to the model score are determined during calibration.

The principle roughly is the following, illustrated with the example of binary classification in mind:

  • Choose a certain threshold for the model score, above which a datapoint is sorted into the positive category

  • Apply this threshold and

    • average the predicted scores over all datapoints that exceed the threshold

    • among the datapoints exceeding the threshold, determine the fraction that truly belong to the positive class

  • Add a point to the calibration plot with the average predicted score on the x-axis and the fraction of true positives on the y-axis

  • Repeat this with a number of different thresholds

This results in a calibration curve, e.g. such as the one in https://scikit-learn.org/stable/modules/calibration.html#calibration-curves.

CL#

confidence level

CNN#

Convolutional Neural Network

A neural network containing one or more convolutional layers.

In a 2D convolutional layer, typically used for images, \(n_f\) two-dimensional filters of size \(f_1 \times f_2\) are slid across the two-dimensional data arrays of size \(n_1 \times n_2\) to create \(n_f\) output data arrays of size \((n_1-f_1+1) \times (n_2-f_2+1)\). Analogous convolutional layers of different dimensionalities exist as well. Convolutional layers are often followed by pooling layers that aggregate neighbouring pixels or voxels by calculating their maximum or average.

By training and adjusting the filters, the neural network can distill particular patterns in the data and feed them to the following dense layers.

confidence level#

The probability of not making a type-1 error, i.e. the probability of not wrongly rejecting the null hypothesis and therefore rightly accepting the null hypothesis.

\[\text{confidence level} = \text{CL} = 1 - \alpha\]

D#

data augmentation#

Techniques for artificially increasing the size of the dataset

For example, in computer vision, images in the training set may be subjected to random shifts, rotations, shearing, horizontal flipping, changes in brightness, contrast, saturation and other properties.

This can help to increase the model performance by allowing for more and longer training while at the same time avoiding overtraining.

dropout#

A dropout layer in a neural network randomly sets some of the values passed into it from the preceding layer to zero, i.e. randomly drops or deactivates some of its inputs. The fraction of dropped nodes, usually called dropout rate, is a model hyperparameter.

The original paper: G.E. Hinton et al: Improving neural networks by preventing co-adaptation of feature detectors, arXiv:1207.0580

E#

EBS#

Amazon Elastic Block Store

EC2#

Amazon Elastic Cloud Compute: Virtual servers

ECR#

Amazon Elastic Container Registry: Repositories for Docker containers

F#

FN#

False Negative: The label of a datapoint is predicted to be negative, but is positive in reality.

Caution

False negatives can be particularly dangerous as e.g. a patient who really has a condition is not detected as sick and therefore is not treated.

FP#

False Positive: The label of a datapoint is predicted to be positive, but is negative in reality.

FPR#

False Positive Rate

Defined as

\[\text{FPR} := \frac{\text{FP}}{\text{TN} + \text{FP}} = 1 - \text{specificity} \approx \alpha\]

where \(\text{FP}\) are the false positives and \(\text{TN}\) the true negatives.

It specifies what fraction of the truly negative datapoints were incorrectly classified / predicted to be positive. Therefore, it is related to the type-1 error and its probability \(\alpha\).

H#

hyperparameter#

Hyperparameters characterise the layout and architecture of the model and its associated functions and algorithms. As such, hyperparameters are not and cannot be changed during training. There are two different types of hyperparameters:

  • model hyperparameters: e.g. the number of hidden layers in a neural network, the dropout rate, the activation function of a specific layer

  • algorithm hyperparameters: e.g. the optimiser, its learning rate, the batch size

Hyperparameters can be searched and optimised to maximise model performance. This process is called hyperparameter tuning or hyperparameter optimisation.

Contrast against model parameter.

hyperparameter optimisation#

see hyperparameter tuning

hyperparameter tuning#

Evaluation of the achievable model performance when trying out different values for one or more hyperparameters. Normally, hyperparameter tuning refers to automated strategies for scanning different hyperparameter values and ranges.

Since evaluating a single point in hyperparameter space involves training and validating a model, hyperparameter tuning can be quite time-consuming and resource-intensive. Therefore, normally, not the full hyperparameter space is scanned for a model, but rather a subset.

Tuning strategies broadly fall into three basic categories:

  • grid searches: All possible combinations of the selected hyperparameters and their values are tried out systematically. For categorical hyperparameters, e.g. the choice of the optimiser, all specified options are tried, and for continuous and ordinal hyperparameters, linearly or logarithmically equidistant points within configured ranges may be tried.

  • random searches: Points in the configured hyperparameter space are picked randomly

  • advanced searches: Advanced searches try to make informed decisions on which hyperparameter point to evaluate next, based on which hyperparameter points were scanned before and how they performed. A typical strategy is Bayesian optimisation together with Gaussian random processes.

I#

IAM#

AWS Identity & Access Management

image augmentation#

In image augmentation, transformations are applied to images before feeding them into a model. These transformations can serve to normalise the images, e.g. by rescaling them with a common factor, as well as to effectively increase the size of the datasets by applying random flips, rotations, brightness changes and other transformations. While these random transformations can help protect against overtraining, they can also help the trained model in generalising to other images. For example, this would be the case with the Fashion-MNIST dataset which, among other types of clothes, contains shoes which are all pointing with their tips to the left.

imbalanced data#

When the data contain significantly more datapoints in one class than the other(s), in binary classification or multi-class classification.

See Binary Classification with the Stroke Prediction Dataset

K#

KMS#

AWS Key Management Service

L#

L1 regularisation#

When L1 (or lasso) regularisation is activated for a layer, a penalty term proportional to the sum of the absolute values of the weights of that layer is added to the loss function. The strength of the regularisation can be adjusted by scaling the penalty term with a factor.

L2 regularisation#

When L2 (or ridge) regularisation is activated for a layer, a penalty term quadratic in the sum of the weights of that layer is added to the loss function. The strength of the regularisation can be adjusted by scaling the penalty term with a factor.

lasso regularisation#

see L1 regularisation

M#

model parameter#

Model parameters are the parameters adjusted during training to minimise the loss function and fit the model to the training data, e.g. the weights of the edges between the nodes in a neural network.

Contrast against hyperparameter.

multi-class classification#

Classification problem involving more than two classes

Contrast to binary classification.

O#

overfitting#

see overtraining

oversampling#

Method for addressing imbalanced data

See Binary Classification with the Stroke Prediction Dataset

overtraining#

Also called overfitting

When the model memorises specific random fluctuations in the training data. Since the validation does not contain the exact same datapoints, but rather others with different random fluctuations, the model fails to generalise to the validation data. Therefore, when overtraining occurs, the model performance is worse during validation than in training.

In training, the predicted values lie close to the true values, but the model fails to generalise beyond the specific datapoints, corresponding to a low bias but high variance.

Overtraining may occur when

  • the model is too complex, i.e. it has too many parameters

  • training continues for too long on a too limited dataset

There are several strategies aimed at avoiding overtraining:

P#

power#

The power of a test or classifier quantifies its capability of detecting a positive result. Therefore, it is related to the probability of the type-2 error \(\beta\) by:

\[\text{power} = 1 - \beta\]

See also: TPR

precision#

Defined as

\[\text{precision} := \frac{\text{TP}}{\text{TP} + \text{FP}}\]

where \(\text{TP}\) are the true positives and \(\text{FP}\) the false positives.

It specifies what fraction of the datapoints that were classified/predicted to be positive are in fact truly positive, i.e. which fraction of the positive classifications/predictions is correct. Therefore, e.g. in the context of medical tests, the precision is of special interest to the tested person or patient because it gives the probability for the positive result to be actually true.

R#

recall#

see True Positive Rate (TPR)

ridge regularisation#

see L2 regularisation

ROC#

Receiver Operator Characteristic

The ROC curve shows the true positive rate (TPR) vs. the false positive rate (FPR) for a given model. So it essentially gives the balance between type-1 and type-2 errors and visualises to what extent decreasing one will increase the other. Choosing a certain threshold value of the activated classifier output (and thereby defining the rule for associating datapoints with classes) corresponds to picking a working point somewhere on a given ROC curve and moving the threshold value scans the ROC curve so that a working point with the desired balance of error rates can be picked.

The more the ROC curve extends to the top left corner, i.e. towards high TPRs at low FPRs, the better the performance of a model. Therefore, the area under the curve (AUC) of a ROC curve can be used to benchmark different models against each other.

S#

sample#

In datascience, sample refers to a single datapoint.

Since I have a background in particle physics, where the term sample usually refers to a set of generated/simulated datapoints, I tend to avoid it and usually prefer datapoint.

The vocabulary ‘confusion matrix’ that translates between data science and particle physics is the following:

object

data science

particle physics

single entity

sample

datapoint

set of entities

dataset

  • if measured: dataset

  • if generated/simulated: (Monte Carlo) sample

score#

The score of a model is the activated output of a model, e.g. the activated output of the last layer in a neural network.

The unactivated outputs are called logits.

Commonly used activation functions are

sensitivity#

see True Positive Rate (TPR)

sigmoid#

Sigmoid functions follow a characteristic ‘S’-shape. In machine learning, sigmoid activation usually refers to using the logistic function

\[f(x) = \frac{1}{1 + e^{-x}}\]

as the activation function.

Since the sigmoid function maps all real numbers to the interval (0, 1), sigmoid activation is typically used in binary classification, with outputs close to 0 associated to one category and outputs close to 1 to the other. The sigmoid- activated network output is also often interpreted as the probability of a datapoint to belong to the second class, but this interpretation has to be taken with a grain of salt, see calibration.

softmax#

The softmax function is typically used as the activation function in multi-class classification problems with one-hot encoded labels. It is defined as

\[\sigma(\vec{x})_i = \frac{e^{x_i}}{\sum_{j=1}^n e^{x_j}}\]

Each of the \(n\) target classes corresponds to one entry in the classification vector \(\vec{x}\) and the softmax function provides a mapping \(\mathbb{R}^n \to [0,1]^n\). Furthermore, it provides a normalisation such that the activated entries of the classification vector sum up to unity, i.e. \(\sum_{i=1}^n \sigma(\vec{x})_i = 1\). This is what is naturally expected for discreet probabilities. However, as long as a classifier is not calibrated, it cannot be guaranteed that the activated output of the last layer gives the probabilities for a datapoint to belong to each of the involved target classes.

specificity#

Defined as

\[\text{specificity} := \frac{\text{TN}}{\text{TN} + \text{FP}} = 1 - \text{FPR}\]

where \(\text{TN}\) are the true negatives and \(\text{FP}\) the false positives.

It specifies what fraction of the truly negative datapoints was correctly classified/predicted to be negative. Therefore, the specificity is related to the FPR.

T#

testing#

The determination of the unbiased model performance.

To this end, the full dataset available during model design, training and development is split up into three distinct parts:

  • the training dataset

  • the validation / hold-out cross-validation or development dataset and

  • the test dataset

While the model parameters are adjusted on the training dataset, the performance of the model during the development phase is estimated from the validation dataset. Between the training runs, the hyperparameters are changed so as to maximise the performance metrics evaluated from the validation dataset. Finally, at the end of the development phase, a specific model and a set of hyperparameters is chosen and afterwards, the model performance is evaluated based on the test dataset. This is an unbiased estimate since the test data were never previously used to make choices regarding the model.

Many times, when getting a proper unbiased estimate of the model performance is not crucial, no separate testing is performed. In such cases, the model performance is simply quantified with the validation results. In practice, this validation stage is then often referred to as ‘testing’.

TN#

True Negative: The label of a datapoint is predicted to be negative and also is positive in reality

TP#

True Positive: The label of a datapoint is predicted to be positive and also is positive in reality

TPR#

True Positive Rate

Defined as

\[\text{TPR} := \frac{\text{TP}}{\text{TP} + \text{FN}} = \text{sensitivity} = \text{recall} \approx 1 - \beta\]

where \(\text{TP}\) are the true positives and \(\text{FN}\) the false negatives.

It specifies what fraction of the truly positive datapoints were correctly classified / predicted to be positive. Therefore, it is related to the type-2 error and its probability \(\beta\), or, more specifically the power \(1 - \beta\).

type-1 error#

The error of wrongly rejecting the null hypothesis and accepting the alternative hypothesis. Its probability is denoted with \(\alpha\):

\[\alpha := P(\text{type-1 error})\]

It is related to the confidence level (CL) by

\[\alpha = 1 - \text{confidence level}\]

In binary classification, where the null hypothesis is usually taken to be

  • a negative test

  • the patient is healthy

  • the absence of new physics effects and the validity of the currently established model

or a similarly normal situation, making type-1 errors results in false positives (FP).

type-2 error#

The error of wrongly accepting the null hypothesis and rejecting the alternative hypothesis. Its probability is denoted with \(\beta\):

\[\beta := P(\text{type-2 error})\]

It is related to the power by

\[\beta = 1 - \text{power}\]

In binary classification, where the null hypothesis is usually taken to be

  • a negative test

  • the patient is healthy

  • the absence of new physics effects and the validity of the currently established model

or a similarly normal situation, making type-2 errors results in false negatives (FN).

U#

underfitting#

see undertraining

undertraining#

Also called underfitting.

When the model fails to learn the characteristic properties of the data during training. It is indicated by a bad model performance in both training and validation and the predicted values deviate from the true values, corresponding to a high bias.

Undertraining may occur when

  • there is not enough training data

  • there is too much noise in the training data, hiding the real characteristics and dependencies

  • training does not continue long enough

  • the model is inadequate and perhaps too simple to capture the characteristics of the data (e.g. as when trying to fit a linear function to datapoints following a sinus function)

Possible strategies:

  • more training data

  • cleaner training data with less noise and statistical fluctuations

  • a more sophisticated or flexible model (e.g. more parameters, different types of layers)

V#

variance#

The amount of variation in the model itself.

For example, in function regression, a model may have very low bias, i.e. approximate the given true values very well, but at the same time oscillate and fluctuate wildly in between those true values. Such a model will generalise poorly to new data, see overtraining.

VPC#

Amazon Virtual Private Cloud