.. _BinaryStrokePrediction-InputPipeline:

***********************************
Data Preparation and Input Pipeline
***********************************


Now let's move on to bringing the data into a form that can be processed by
the neural network!

We'll have a look at the three scripts

- ``2-stroke-prediction-naive.py``
- ``3-stroke-prediction-oversampling.py``
- ``4-stroke-prediction-oversampling-norm.py``

in ``examples/1-binary-stroke-prediction/``.


Treatment of Categorical Variables
==================================


After data loading and cleaning, we start off by processing the categorical
variables and manipulating them so that they can be digested by the network.

.. margin:: from **2-stroke-prediction-naive.py**

    in ``examples/1-binary-stroke-prediction/``

.. code:: python

    # inplace convert string category labels to numerical indices
    categories = sb.inputs.encode_categories(data)

Behind the scenes, the function :func:`spellbook.inputs.encode_categories`
loops over all categorical variables in the dataset and converts them to
:class:`pandas.Categorical`\s.
A dictionary containing the mapping of the category names to numerical indices
for each categorical variable is returned::

    categories: {
        'gender': {0: 'female', 1: 'male', 2: 'other'},
        'hypertension': {0: 'no', 1: 'yes'},
        'heart_disease': {0: 'no', 1: 'yes'},
        'ever_married': {0: 'no', 1: 'yes'},
        'work_type': {0: 'children', 1: 'govt', 2: 'never', 3: 'private', 4: 'self'},
        'residence_type': {0: 'rural', 1: 'urban'},
        'smoking_status': {0: 'formerly', 1: 'never', 2: 'smokes', 3: 'unknown'},
        'stroke': {0: 'no', 1: 'yes'}
    }

Finally, for each categorical
variable ``var``, an additional column ``var_codes`` is added to the dataset,
containing the numerical category codes for each datapoint. These are the 
columns that we will later feed into the network.

The corresponding code in :func:`spellbook.inputs.encode_categories` looks
like this:

.. margin:: from :func:`spellbook.input.encode_categories`

    in :mod:`spellbook.input`

.. code:: python

    categories = {}
    for var in data:
        if data[var].dtype == 'object':
            print("variable '{}' is categorical".format(var))
            data[var] = pd.Categorical(data[var])

            # taken from https://stackoverflow.com/a/51102402
            categories[var] = dict(enumerate(data[var].cat.categories))

            # use numerical values instead of strings
            data[var+'_codes'] = data[var].cat.codes

Next, we are going to shuffle the dataset and order the rows randomly.
As we could see before, the original dataset is actually ordered in a way that
datapoints for patients with strokes come first. This is not what we want here
because when splitting the dataset into a training and a validation set, this
would mean that stroke cases would only be present in the training but not
the validation set. We do the shuffling with :meth:`pandas.DataFrame.sample`
and afterwards continue to adjust our list of the feature variables and the
name of the target variable to point to the new columns holding the integer
category indices:

.. margin:: from **2-stroke-prediction-naive.py**

    in ``examples/1-binary-stroke-prediction/``

.. code:: python

    # shuffle data either now in pandas or later in TensorFlow
    data = data.sample(frac = 1.0)

    # use new numerical columns for the features
    for var in categories:
        if var == target:
            target = target + '_codes'
        else:
            features[features.index(var)] = var + '_codes'


Feeding the Dataset into *TensorFlow*
=====================================

Now it is time to split the data into a *training set*, which is used to
adjust the model parameters so as to describe the datapoints with
:meth:`tf.keras.Model.fit`, and a *validation set*. The latter is used to
evaluate the model performance and benchmark different models against each
other when changing model :term:`hyperparameter`\s such as the number of
layers, the number of nodes in a layer or the number of training epochs.

While doing the split, we are also going to convert the training and validation
datasets into objects that can be fed into the network. There are at least
three different ways of doing this in terms of the objects and datatypes
involved:

#. from a :class:`pandas.DataFrame` to a :class:`tensorflow.data.Dataset`
   where features and labels are combined in one object
#. from a :class:`pandas.DataFrame` to two :class:`tensorflow.Tensor`\s,
   one for the features and one for the labels
#. from a :class:`pandas.DataFrame` to two :class:`numpy.ndarray`\s,
   one for the features and one for the labels

Since columns in a :class:`pandas.DataFrame` can be accessed in much the
same way as entries in a *Python* :class:`dict`, it is actually also possible
to directly feed features stored in a :class:`pandas.DataFrame` and labels
stored in a :class:`pandas.Series` into *TensorFlow* networks.
The separation of a single :class:`pandas.DataFrame` into separate feature
and label sets for training, validation and testing is implemented in
:func:`spellbook.inputs.split_pddataframe_to_pddataframes`.


Option 1: Using *TensorFlow* Datasets
-------------------------------------

This approach is taken in ``2-stroke-prediction-naive.py``.
It is implemented in :func:`spellbook.inputs.split_pddataframe_to_tfdatasets`
and can be used like this:

.. margin:: from **2-stroke-prediction-naive.py**

    in ``examples/1-binary-stroke-prediction/``

.. code:: python

    train, val = sb.inputs.split_pddataframe_to_tfdatasets(data, target, features, n_train=3500)
    print(train.cardinality()) # print the size of the dataset
    print(val.cardinality())

Using 3500 datapoints for the training set and the remaining 1409 for the
validation set corresponds to reserving a fraction of 71.3% of all data for
training. A typical recommendation is to use about 70% for training when
dealing with datasets containing a few thousand datapoints. As the size of
the total dataset increases, the fraction
reserved for training can increase and when a million datapoints are available,
it is perhaps sufficient to only use about 1% of them for validation.
Basically, the logic behind this is to use as many datapoints as possible for
training while at the same time ensuring that the validation (and possibly the
:term:`testing`) set have sufficient size to provide numerically stable
results for the metrics used to quantify the model performance. The smaller
the validation set is, the larger the statistical uncertainty on the metrics
will be.

Under the hood, in :func:`spellbook.inputs.split_pddataframe_to_tfdatasets`,
the :class:`pandas.DataFrame` is split into two separate frames,
one for the features and one for the target labels. These are converted to
two ``numpy.ndarray``\s which are then used to initialise the
:class:`tensorflow.data.Dataset` using the
:func:`tensorflow.data.Dataset.from_tensor_slices` function. Finally, the
split is applied:

.. margin:: from :func:`spellbook.input.split_pddataframe_to_tfdatasets`

    in :mod:`spellbook.input`

.. code:: python

    n = len(data)
    n_train, n_val, n_test = calculate_splits(n, frac_train, frac_val,
                                              n_train, n_val)

    # separate features and labels
    data_features = data[features]
    data_labels = data[target]

    # create a TensorFlow Dataset
    dataset = tf.data.Dataset.from_tensor_slices(
        (data_features.values, data_labels.values))

    # split it into train/val/test
    train = dataset.take(n_train)
    val = dataset.skip(n_train).take(n_val)
    if n_test:
        test = dataset.skip(n_train).skip(n_val).take(n_test)
        return((train, val, test))
    else:
        return((train, val))

Note that this does not shuffle and batch the resulting dataset. Shuffling
may be done

- before in *pandas*: ``data = data.sample(frac = 1.0)``
- or afterwards in *TensorFlow*:
  ``train = train.shuffle(buffer_size = train.cardinality())``

Since we shuffled the data before converting and splitting them, we can
proceed to divide them into batches in *TensorFlow* with

.. margin:: from **2-stroke-prediction-naive.py**

    in ``examples/1-binary-stroke-prediction/``

.. code:: python

    train = train.batch(batch_size = 100)
    val = val.batch(batch_size = 100)

The :term:`batch` size of ``100`` is chosen so that each batch contains at
least some stroke cases.


Option 2: Using *TensorFlow* Tensors
------------------------------------

Used in ``3-stroke-prediction-oversampling.py``
and implemented in :func:`spellbook.inputs.separate_tfdataset_to_tftensors`.

Just like before, *TensorFlow* :class:`tf.data.Dataset`\s for training and
validation are created. This time, they are each split into two separate
:class:`tf.Tensor`\s - one for the features and one for the labels:

.. margin:: from **3-stroke-prediction-oversampling.py**

    in ``examples/1-binary-stroke-prediction/``

.. code:: python

    train, val = sb.inputs.split_pddataframe_to_tfdatasets(data, target, features, n_train=7000)
    print(train.cardinality()) # print the size of the dataset
    print(val.cardinality())
    
    # separate features and labels
    train_features, train_labels = sb.inputs.separate_tfdataset_to_tftensors(train)
    val_features, val_labels = sb.inputs.separate_tfdataset_to_tftensors(val)

Please don't mind the increased size of the training set for now - this script
uses *oversampling* to deal with the imbalance in the ``stroke`` categories.
We will look into this in more detail later.
    
The advantage of this :class:`tf.Tensor`-based approach is that it yields
a separate object containing just the target
labels, which can then be used when evaluating the model performance, e.g.
when calculating a confusion matrix or a :term:`ROC` curve from comparisons
of the predicted labels against the actual true target labels.

Internally, :func:`spellbook.inputs.separate_tfdataset_to_tftensors` proceeds as follows:

.. margin:: from :func:`spellbook.input.separate_tfdataset_to_tftensors`

    in :mod:`spellbook.input`

.. code:: python

    # unpack the feature and label tensors from the dataset
    # and reshape them into two separate tuples of tensors
    features, labels = zip(*dataset)

    # [...]

    # which are then stacked to form the features and labels tensors
    features = tf.stack(features)
    labels = tf.stack(labels)

Finally, since :class:`tf.Tensor`\s cannot be batched, in this approach, the
batching is left to the call to :meth:`tf.keras.Model.fit` later after model
setup.


.. _BinaryStrokePrediction-NumpyArrays:

Option 3: Using *NumPy* Arrays
------------------------------

Used in ``4-stroke-prediction-oversampling-norm.py``
and implemented in :func:`spellbook.inputs.split_pddataframe_to_nparrays`.

The third option is to split the dataset into :class:`numpy.ndarray`\s:

.. margin:: from **4-stroke-prediction-oversampling-norm.py**

    in ``examples/1-binary-stroke-prediction/``

.. code:: python

    train_features, train_labels, val_features, val_labels \
        = sb.inputs.split_pddataframe_to_nparrays(
            data, target, features, n_train=7000)

    print(train_features.shape) # print the size of the dataset
    print(train_labels.shape)
    print(val_features.shape)
    print(val_labels.shape)

The function :func:`spellbook.inputs.split_pddataframe_to_nparrays` works
like this in principle

.. margin:: from :func:`spellbook.input.split_pddataframe_to_nparrays`

    in :mod:`spellbook.input`

.. code:: python

    train = data.iloc[:n_train]
    val = data.iloc[n_train:].iloc[:n_val]
    result = [train[features].values, train[target].values,
              val[features].values, val[target].values]
    
    # [...]
    return tuple(result)

Like with :class:`tf.Tensor`\s in the previous approach, batching is left
the call to :meth:`tf.keras.Model.fit` later after model setup.