Glossary
========


A
-

.. glossary::

    AUC
        *Area Under the Curve*

        The area under a ROC curve. AUC is a metric that can be used to
        benchmark different models against each other, with larger values
        corresponding to better model performance.

        See :term:`ROC`


B
-

.. glossary::

    batch
        A batch is a subset of datapoints in a dataset that is processed
        in sequence by a neural network before the loss function is evaluated,
        the gradients are calculated and the model weights are updated
        using backprop.
        
        The larger the batch size is, the less frequently
        these updates will happen and therefore, the less time each epoch will
        take. On the flip side, the model parameters will not be updated
        very often and the progress from epoch to the next may be slower.
        Choosing smaller batch sizes on the other hand will lead to more
        gradient evaluations, more frequent updates of the model weights and
        therefore a longer duration of each epoch.

    bias
        The deviation of the predictions from the true labels in classification
        and of the predicted values from the true values in regression

    binary classification
        Classification problem with only two classes. These are often
        *positive* and *negative*, e.g. when a test for a disease is performed,
        and the *negative* class is associated with the null hypothesis, i.e.
        the hypothesis that is assumed to be valid until evidence of the
        contrary is presented.

        Contrast to :term:`multi-class classification`.


C
-

.. glossary::

    calibration
        *Calibration* refers to the process of checking and correcting
        the :term:`score` of a model in terms of its possible interpretation
        as a probability. For example, in :term:`binary classification`, the
        :term:`sigmoid`-activated model output is often interpreted as the
        probability that the datapoint belongs to the *positive* class.
        However, the fact that the :term:`sigmoid` function returns values
        in [0, 1] does not yet guarantee that its values accurately quantify
        the probability for a datapoint to belong to a certain class. These
        probabilities and their relation to the model :term:`score` are
        determined during *calibration*.

        The principle roughly is the following, illustrated with the example
        of :term:`binary classification` in mind:

        - Choose a certain threshold for the model :term:`score`, above which
          a datapoint is sorted into the positive category
        - Apply this threshold and

          - average the predicted scores over all datapoints that exceed the
            threshold
          - among the datapoints exceeding the threshold, determine the
            fraction that truly belong to the *positive* class

        - Add a point to the calibration plot with the average predicted score
          on the x-axis and the fraction of true positives on the y-axis
        - Repeat this with a number of different thresholds

        This results in a *calibration curve*, e.g. such as the one in
        https://scikit-learn.org/stable/modules/calibration.html#calibration-curves.

    CL
        :term:`confidence level`

    CNN
        *Convolutional Neural Network*

        A neural network containing one or more convolutional layers.

        In a 2D convolutional layer, typically used for images, :math:`n_f`
        two-dimensional filters of size :math:`f_1 \times f_2` are slid
        across the two-dimensional data arrays of size :math:`n_1 \times n_2`
        to create :math:`n_f` output data arrays of size
        :math:`(n_1-f_1+1) \times (n_2-f_2+1)`. Analogous convolutional
        layers of different dimensionalities exist as well.
        Convolutional layers are often followed by pooling layers that
        aggregate neighbouring pixels or voxels by calculating their maximum
        or average.

        By training and adjusting the filters, the neural network can
        distill particular patterns in the data and feed them to the
        following dense layers.

    confidence level
        The probability of *not* making a
        :term:`type-1 error`, i.e. the probability of *not* wrongly rejecting
        the null hypothesis and therefore rightly accepting the null hypothesis.

        .. math:: \text{confidence level} = \text{CL} = 1 - \alpha


D
-

.. glossary::

    data augmentation
        Techniques for artificially increasing the size of the dataset

        For example, in computer vision, images in the training set
        may be subjected to random shifts, rotations, shearing, horizontal
        flipping, changes in brightness, contrast, saturation and other
        properties.

        This can help to increase the model performance by allowing for
        more and longer training while at the same time avoiding
        :term:`overtraining`.

    dropout
        A *dropout layer* in a neural network randomly sets some of the values
        passed into it from the preceding layer to zero, i.e. randomly drops or
        deactivates some of its inputs. The fraction of dropped nodes, usually
        called *dropout rate*, is a model :term:`hyperparameter`.

        The original paper: G.E. Hinton et al: *Improving neural networks by
        preventing co-adaptation of feature detectors*, `arXiv:1207.0580
        <https://arxiv.org/abs/1207.0580>`_


E
-

.. glossary::

    EBS
        Amazon *Elastic Block Store*

    EC2
        Amazon *Elastic Cloud Compute*: Virtual servers

    ECR
        Amazon *Elastic Container Registry*: Repositories for *Docker*
        containers


F
-

.. glossary::

    FN
        *False Negative*: The label of a datapoint is predicted to be
        *negative*, but is *positive* in reality.

        .. caution:: False negatives can be particularly dangerous as e.g. a
           patient who really has a condition is not detected as sick and
           therefore is not treated.

    FP
        *False Positive*: The label of a datapoint is predicted to be
        *positive*, but is *negative* in reality.

    FPR
        *False Positive Rate*

        Defined as

        .. math:: \text{FPR} := \frac{\text{FP}}{\text{TN} + \text{FP}}
                              = 1 - \text{specificity} \approx \alpha

        where :math:`\text{FP}` are the false positives and :math:`\text{TN}`
        the true negatives.

        It specifies what fraction of the truly negative datapoints were
        incorrectly classified / predicted to be positive. Therefore, it is
        related to the :term:`type-1 error` and its probability :math:`\alpha`.


H
-

.. glossary::

    hyperparameter
        *Hyperparameters* characterise the layout and architecture of the
        model and its associated functions and algorithms. As such,
        *hyperparameters* are not and cannot be changed
        during training. There are two different types of *hyperparameters*:

        - **model hyperparameters**: e.g. the number of hidden layers in a
          neural network, the :term:`dropout` rate, the activation function
          of a specific layer
        - **algorithm hyperparameters**: e.g. the optimiser, its learning
          rate, the batch size
        
        *Hyperparameters* can be searched and optimised to maximise model
        performance. This process is called :term:`hyperparameter tuning` or
        :term:`hyperparameter optimisation`.

        Contrast against :term:`model parameter`.
    
    hyperparameter optimisation
        see :term:`hyperparameter tuning`
    
    hyperparameter tuning
        Evaluation of the achievable model performance when
        trying out different values for one or more hyperparameters. Normally,
        *hyperparameter tuning* refers to automated strategies for scanning
        different :term:`hyperparameter` values and ranges.

        Since evaluating a single point in hyperparameter space involves
        training and validating a model, *hyperparameter tuning* can be quite
        time-consuming and resource-intensive. Therefore, normally, not the
        full hyperparameter space is scanned for a model, but rather a
        subset.

        Tuning strategies broadly fall into three basic categories:

        - **grid searches**: All possible combinations of the selected
          hyperparameters and their values are tried out systematically.
          For categorical hyperparameters, e.g. the choice of the optimiser,
          all specified options are tried, and for continuous and ordinal
          hyperparameters, linearly or logarithmically equidistant points
          within configured ranges may be tried.
        - **random searches**: Points in the configured hyperparameter space
          are picked randomly
        - **advanced searches**: Advanced searches try to make informed
          decisions on which hyperparameter point to evaluate next, based
          on which hyperparameter points were scanned before and how they
          performed. A typical strategy is Bayesian optimisation together with
          Gaussian random processes.


I
-

.. glossary::

    IAM
        AWS *Identity & Access Management*

    image augmentation
        In *image augmentation*, transformations are applied to images
        before feeding them into a model. These transformations can serve
        to normalise the images, e.g. by rescaling them with a common
        factor, as well as to effectively increase the size of the datasets by
        applying random flips, rotations, brightness changes and other
        transformations. While these random transformations can help protect
        against :term:`overtraining`, they can also help the trained model
        in generalising to other images. For example, this would be the
        case with the *Fashion-MNIST* dataset which, among other types of
        clothes, contains shoes which are all pointing with their tips to
        the left.

    imbalanced data
        When the data contain significantly more datapoints in one class than
        the other(s), in :term:`binary classification` or
        :term:`multi-class classification`.

        See :doc:`examples/1-binary-stroke-prediction/index`


K
-

.. glossary::

    KMS
        AWS *Key Management Service*


L
-

.. glossary::

    L1 regularisation
        When *L1* (or *lasso*) *regularisation* is activated for a layer, a
        penalty term *proportional to the sum of the absolute values* of the
        weights of that layer is added to the loss function. The strength
        of the regularisation can be adjusted by scaling the penalty term
        with a factor.

    L2 regularisation
        When *L2* (or *ridge*) *regularisation* is activated for a layer, a
        penalty term *quadratic in the sum* of the weights of that layer is
        added to the loss function. The strength of the regularisation can
        be adjusted by scaling the penalty term with a factor.

    lasso regularisation
        see :term:`L1 regularisation`


M
-

.. glossary::

    model parameter
        *Model parameters* are the parameters adjusted during training to
        minimise the loss function and fit the model to the training data,
        e.g. the weights of the edges between the nodes in a neural network.

        Contrast against :term:`hyperparameter`.

    multi-class classification
        Classification problem involving more than two classes

        Contrast to :term:`binary classification`.


O
-

.. glossary::

    overfitting
        see :term:`overtraining`

    oversampling
        Method for addressing :term:`imbalanced data`

        See :doc:`examples/1-binary-stroke-prediction/index`

    overtraining
        Also called *overfitting*

        When the model memorises specific random fluctuations in the training
        data. Since the validation does not contain the exact same datapoints,
        but rather others with different random fluctuations, the model fails
        to generalise to the validation data. Therefore, when *overtraining*
        occurs, the model performance is worse during validation than in
        training.

        In training, the *predicted* values lie close to the *true* values,
        but the model fails to generalise beyond the specific datapoints,
        corresponding to a low :term:`bias` but high :term:`variance`.

        *Overtraining* may occur when

        - the model is too complex, i.e. it has too many parameters
        - training continues for too long on a too limited dataset

        There are several strategies aimed at avoiding *overtraining*:

        - more training data
        - early stopping of the training, when the loss and accuracy do not
          improve anymore
        - a less complex model with fewer parameters
        - regularisation techniques

          - :term:`dropout` layers
          - :term:`L1 regularisation` or :term:`L2 regularisation`

        - :term:`data augmentation`


P
-

.. glossary::

    power
        The *power* of a test or classifier quantifies its capability of
        detecting a *positive* result. Therefore, it is related to the
        probability of the :term:`type-2 error` :math:`\beta` by:

        .. math::
        
           \text{power} = 1 - \beta

        See also: :term:`TPR`
    
    precision
        Defined as

        .. math:: \text{precision} := \frac{\text{TP}}{\text{TP} + \text{FP}}

        where :math:`\text{TP}` are the true positives and :math:`\text{FP}`
        the false positives.

        It specifies what fraction of the datapoints that were
        classified/predicted to be *positive* are in fact truly *positive*,
        i.e. which fraction of the *positive* classifications/predictions
        is correct. Therefore, e.g. in the context of medical tests,
        the *precision* is of special interest to the tested person or
        patient because it gives the probability for the *positive* result
        to be actually true.


R
-

.. glossary::

    recall
        see *True Positive Rate* (:term:`TPR`)

    ridge regularisation
        see :term:`L2 regularisation`
    
    ROC
        *Receiver Operator Characteristic*

        The ROC curve shows the *true positive rate* (:term:`TPR`) vs. the
        *false positive rate* (:term:`FPR`) for a given model. So it
        essentially gives the balance between type-1 and type-2 errors and
        visualises to what extent decreasing one will increase the other.
        Choosing a certain threshold value of the activated classifier
        output (and thereby defining the rule for associating datapoints with
        classes) corresponds to picking a working point somewhere on a given
        ROC curve and moving the threshold value scans the ROC curve so that
        a working point with the desired balance of error rates can be picked.

        The more the ROC curve extends to the top left corner, i.e. towards
        high TPRs at low FPRs, the better the performance of a model.
        Therefore, the *area under the curve* (:term:`AUC`) of a ROC curve can
        be used to benchmark different models against each other.


S
-

.. glossary::

    sample
        In datascience, *sample* refers to a single datapoint.

        Since I have a background in particle physics, where the term *sample*
        usually refers to a set of generated/simulated datapoints, I tend to
        avoid it and usually prefer *datapoint*.

        The vocabulary 'confusion matrix' that translates between data science
        and particle physics is the following:

        =============== ============ =================================================
        object          data science particle physics
        =============== ============ =================================================
        single entity   sample       datapoint
        set of entities dataset      - if *measured*: dataset
                                     - if *generated/simulated*: (Monte Carlo) sample
        =============== ============ =================================================

    score
        The *score of a model* is the activated output of a model, e.g. the
        activated output of the last layer in a neural network.

        The unactivated outputs are called *logits*.
        
        Commonly used activation functions are

        - :term:`sigmoid` activation in :term:`binary classification`
        - :term:`softmax` activation in :term:`multi-class classification`

    sensitivity
        see *True Positive Rate* (:term:`TPR`)

    sigmoid
        Sigmoid functions follow a characteristic 'S'-shape. In machine
        learning, *sigmoid activation* usually refers to using the
        *logistic function*

        .. math::
        
           f(x) = \frac{1}{1 + e^{-x}}
        
        as the activation function.
        
        Since the *sigmoid function* maps all real numbers to
        the interval (0, 1), *sigmoid activation* is typically used in
        :term:`binary classification`, with outputs close to 0 associated to
        one category and outputs close to 1 to the other. The sigmoid-
        activated network output is also often interpreted as the probability
        of a datapoint to belong to the second class, but this interpretation
        has to be taken with a grain of salt, see :term:`calibration`.

    softmax
        The *softmax* function is typically used as the activation function
        in :term:`multi-class classification` problems with one-hot
        encoded labels. It is defined as

        .. math::

           \sigma(\vec{x})_i = \frac{e^{x_i}}{\sum_{j=1}^n e^{x_j}}

        Each of the :math:`n` target classes corresponds to one entry in
        the classification vector :math:`\vec{x}` and the *softmax* function
        provides a mapping :math:`\mathbb{R}^n \to [0,1]^n`. Furthermore,
        it provides a normalisation such that the activated entries of the
        classification vector sum up to unity, i.e.
        :math:`\sum_{i=1}^n \sigma(\vec{x})_i = 1`.
        This is what is naturally expected for discreet probabilities.
        However, as long as a classifier is not calibrated, it cannot be
        guaranteed that the activated output of the last layer gives the
        probabilities for a datapoint to belong to each of the involved
        target classes.

    specificity
        Defined as

        .. math:: \text{specificity} := \frac{\text{TN}}{\text{TN} + \text{FP}}
                                      = 1 - \text{FPR}

        where :math:`\text{TN}` are the true negatives and :math:`\text{FP}`
        the false positives.

        It specifies what fraction of the truly *negative* datapoints was
        correctly classified/predicted to be *negative*. Therefore, the
        *specificity* is related to the :term:`FPR`.


T
-

.. glossary::

    testing
        The determination of the *unbiased* model performance.

        To this end, the full dataset available during model design, training
        and development is split up into three distinct parts:

        - the *training* dataset
        - the *validation* / *hold-out cross-validation* or
          *development* dataset and
        - the *test* dataset
        
        While the model parameters are adjusted on the *training* dataset, the
        performance of the model during the development phase is estimated from
        the *validation* dataset. Between the training runs, the
        hyperparameters are changed so as to maximise the performance metrics
        evaluated from the *validation* dataset. Finally, at the end of the
        development phase, a specific model and a set of hyperparameters is
        chosen and afterwards, the model performance is evaluated based on the
        *test* dataset. This is an unbiased estimate since the *test* data
        were never previously used to make choices regarding the model.

        Many times, when getting a proper unbiased estimate of the model
        performance is not crucial, no separate testing is performed. In such
        cases, the model performance is simply quantified with the validation
        results. In practice, this validation stage is then often referred to
        as 'testing'. 

    TN
        *True Negative*: The label of a datapoint is predicted to be
        *negative* and also is *positive* in reality

    TP
        *True Positive*: The label of a datapoint is predicted to be
        *positive* and also is *positive* in reality

    TPR
        *True Positive Rate*

        Defined as

        .. math:: \text{TPR} := \frac{\text{TP}}{\text{TP} + \text{FN}}
                              = \text{sensitivity}
                              = \text{recall} \approx 1 - \beta

        where :math:`\text{TP}` are the true positives and :math:`\text{FN}`
        the false negatives.

        It specifies what fraction of the truly positive datapoints were
        correctly classified / predicted to be positive. Therefore, it is
        related to the :term:`type-2 error` and its probability :math:`\beta`,
        or, more specifically the :term:`power` :math:`1 - \beta`.

    type-1 error
        The error of wrongly rejecting the null hypothesis and accepting the
        alternative hypothesis. Its probability is denoted with :math:`\alpha`:

        .. math:: \alpha := P(\text{type-1 error})

        It is related to the :term:`confidence level` (CL) by

        .. math:: \alpha = 1 - \text{confidence level}

        In :term:`binary classification`, where the null hypothesis is usually
        taken to be
        
        - a negative test
        - the patient is healthy
        -  the absence of new physics effects and the validity of the currently
           established model
        
        or a similarly *normal* situation, making type-1 errors results in
        *false positives* (:term:`FP`).

    type-2 error
        The error of wrongly accepting the null hypothesis and rejecting the
        alternative hypothesis. Its probability is denoted with :math:`\beta`:

        .. math:: \beta := P(\text{type-2 error})

        It is related to the :term:`power` by

        .. math:: \beta = 1 - \text{power}

        In :term:`binary classification`, where the null hypothesis is usually
        taken to be

        - a negative test
        - the patient is healthy
        -  the absence of new physics effects and the validity of the currently
           established model
        
        or a similarly *normal* situation, making type-2 errors results in
        *false negatives* (:term:`FN`).


U
-

.. glossary::

    underfitting
        see :term:`undertraining`

    undertraining
        Also called *underfitting*.

        When the model fails to learn the characteristic properties of the
        data during training. It is indicated by a bad model performance in
        both training and validation and the *predicted* values deviate from
        the *true* values, corresponding to a high :term:`bias`.

        *Undertraining* may occur when

        - there is not enough training data
        - there is too much noise in the training data, hiding the
          real characteristics and dependencies
        - training does not continue long enough
        - the model is inadequate and perhaps too simple to capture the
          characteristics of the data (e.g. as when trying to fit a linear
          function to datapoints following a sinus function)
        
        Possible strategies:

        - more training data
        - cleaner training data with less noise and statistical fluctuations
        - a more sophisticated or flexible model (e.g. more parameters,
          different types of layers)


V
-

.. glossary::

    variance
        The amount of variation in the model itself.
        
        For example, in
        function regression, a model may have very low bias, i.e. approximate
        the given *true values* very well, but at the same time oscillate
        and fluctuate wildly in between those true values. Such a model
        will generalise poorly to new data, see :term:`overtraining`.

    VPC
        Amazon *Virtual Private Cloud*