Data Preparation and Input Pipeline
Contents
Data Preparation and Input Pipeline#
Now let’s move on to bringing the data into a form that can be processed by the neural network!
We’ll have a look at the three scripts
2-stroke-prediction-naive.py
3-stroke-prediction-oversampling.py
4-stroke-prediction-oversampling-norm.py
in examples/1-binary-stroke-prediction/
.
Treatment of Categorical Variables#
After data loading and cleaning, we start off by processing the categorical variables and manipulating them so that they can be digested by the network.
# inplace convert string category labels to numerical indices
categories = sb.inputs.encode_categories(data)
Behind the scenes, the function spellbook.inputs.encode_categories()
loops over all categorical variables in the dataset and converts them to
pandas.Categorical
s.
A dictionary containing the mapping of the category names to numerical indices
for each categorical variable is returned:
categories: {
'gender': {0: 'female', 1: 'male', 2: 'other'},
'hypertension': {0: 'no', 1: 'yes'},
'heart_disease': {0: 'no', 1: 'yes'},
'ever_married': {0: 'no', 1: 'yes'},
'work_type': {0: 'children', 1: 'govt', 2: 'never', 3: 'private', 4: 'self'},
'residence_type': {0: 'rural', 1: 'urban'},
'smoking_status': {0: 'formerly', 1: 'never', 2: 'smokes', 3: 'unknown'},
'stroke': {0: 'no', 1: 'yes'}
}
Finally, for each categorical
variable var
, an additional column var_codes
is added to the dataset,
containing the numerical category codes for each datapoint. These are the
columns that we will later feed into the network.
The corresponding code in spellbook.inputs.encode_categories()
looks
like this:
categories = {}
for var in data:
if data[var].dtype == 'object':
print("variable '{}' is categorical".format(var))
data[var] = pd.Categorical(data[var])
# taken from https://stackoverflow.com/a/51102402
categories[var] = dict(enumerate(data[var].cat.categories))
# use numerical values instead of strings
data[var+'_codes'] = data[var].cat.codes
Next, we are going to shuffle the dataset and order the rows randomly.
As we could see before, the original dataset is actually ordered in a way that
datapoints for patients with strokes come first. This is not what we want here
because when splitting the dataset into a training and a validation set, this
would mean that stroke cases would only be present in the training but not
the validation set. We do the shuffling with pandas.DataFrame.sample()
and afterwards continue to adjust our list of the feature variables and the
name of the target variable to point to the new columns holding the integer
category indices:
# shuffle data either now in pandas or later in TensorFlow
data = data.sample(frac = 1.0)
# use new numerical columns for the features
for var in categories:
if var == target:
target = target + '_codes'
else:
features[features.index(var)] = var + '_codes'
Feeding the Dataset into TensorFlow#
Now it is time to split the data into a training set, which is used to
adjust the model parameters so as to describe the datapoints with
tf.keras.Model.fit()
, and a validation set. The latter is used to
evaluate the model performance and benchmark different models against each
other when changing model hyperparameters such as the number of
layers, the number of nodes in a layer or the number of training epochs.
While doing the split, we are also going to convert the training and validation datasets into objects that can be fed into the network. There are at least three different ways of doing this in terms of the objects and datatypes involved:
from a
pandas.DataFrame
to atf.data.Dataset
where features and labels are combined in one objectfrom a
pandas.DataFrame
to twotf.Tensor
s, one for the features and one for the labelsfrom a
pandas.DataFrame
to twonumpy.ndarray
s, one for the features and one for the labels
Since columns in a pandas.DataFrame
can be accessed in much the
same way as entries in a Python dict
, it is actually also possible
to directly feed features stored in a pandas.DataFrame
and labels
stored in a pandas.Series
into TensorFlow networks.
The separation of a single pandas.DataFrame
into separate feature
and label sets for training, validation and testing is implemented in
spellbook.inputs.split_pddataframe_to_pddataframes()
.
Option 1: Using TensorFlow Datasets#
This approach is taken in 2-stroke-prediction-naive.py
.
It is implemented in spellbook.inputs.split_pddataframe_to_tfdatasets()
and can be used like this:
train, val = sb.inputs.split_pddataframe_to_tfdatasets(data, target, features, n_train=3500)
print(train.cardinality()) # print the size of the dataset
print(val.cardinality())
Using 3500 datapoints for the training set and the remaining 1409 for the validation set corresponds to reserving a fraction of 71.3% of all data for training. A typical recommendation is to use about 70% for training when dealing with datasets containing a few thousand datapoints. As the size of the total dataset increases, the fraction reserved for training can increase and when a million datapoints are available, it is perhaps sufficient to only use about 1% of them for validation. Basically, the logic behind this is to use as many datapoints as possible for training while at the same time ensuring that the validation (and possibly the testing) set have sufficient size to provide numerically stable results for the metrics used to quantify the model performance. The smaller the validation set is, the larger the statistical uncertainty on the metrics will be.
Under the hood, in spellbook.inputs.split_pddataframe_to_tfdatasets()
,
the pandas.DataFrame
is split into two separate frames,
one for the features and one for the target labels. These are converted to
two numpy.ndarray
s which are then used to initialise the
tf.data.Dataset
using the
tensorflow.data.Dataset.from_tensor_slices()
function. Finally, the
split is applied:
n = len(data)
n_train, n_val, n_test = calculate_splits(n, frac_train, frac_val,
n_train, n_val)
# separate features and labels
data_features = data[features]
data_labels = data[target]
# create a TensorFlow Dataset
dataset = tf.data.Dataset.from_tensor_slices(
(data_features.values, data_labels.values))
# split it into train/val/test
train = dataset.take(n_train)
val = dataset.skip(n_train).take(n_val)
if n_test:
test = dataset.skip(n_train).skip(n_val).take(n_test)
return((train, val, test))
else:
return((train, val))
Note that this does not shuffle and batch the resulting dataset. Shuffling may be done
before in pandas:
data = data.sample(frac = 1.0)
or afterwards in TensorFlow:
train = train.shuffle(buffer_size = train.cardinality())
Since we shuffled the data before converting and splitting them, we can proceed to divide them into batches in TensorFlow with
train = train.batch(batch_size = 100)
val = val.batch(batch_size = 100)
The batch size of 100
is chosen so that each batch contains at
least some stroke cases.
Option 2: Using TensorFlow Tensors#
Used in 3-stroke-prediction-oversampling.py
and implemented in spellbook.inputs.separate_tfdataset_to_tftensors()
.
Just like before, TensorFlow tf.data.Dataset
s for training and
validation are created. This time, they are each split into two separate
tf.Tensor
s - one for the features and one for the labels:
train, val = sb.inputs.split_pddataframe_to_tfdatasets(data, target, features, n_train=7000)
print(train.cardinality()) # print the size of the dataset
print(val.cardinality())
# separate features and labels
train_features, train_labels = sb.inputs.separate_tfdataset_to_tftensors(train)
val_features, val_labels = sb.inputs.separate_tfdataset_to_tftensors(val)
Please don’t mind the increased size of the training set for now - this script
uses oversampling to deal with the imbalance in the stroke
categories.
We will look into this in more detail later.
The advantage of this tf.Tensor
-based approach is that it yields
a separate object containing just the target
labels, which can then be used when evaluating the model performance, e.g.
when calculating a confusion matrix or a ROC curve from comparisons
of the predicted labels against the actual true target labels.
Internally, spellbook.inputs.separate_tfdataset_to_tftensors()
proceeds as follows:
# unpack the feature and label tensors from the dataset
# and reshape them into two separate tuples of tensors
features, labels = zip(*dataset)
# [...]
# which are then stacked to form the features and labels tensors
features = tf.stack(features)
labels = tf.stack(labels)
Finally, since tf.Tensor
s cannot be batched, in this approach, the
batching is left to the call to tf.keras.Model.fit()
later after model
setup.
Option 3: Using NumPy Arrays#
Used in 4-stroke-prediction-oversampling-norm.py
and implemented in spellbook.inputs.split_pddataframe_to_nparrays()
.
The third option is to split the dataset into numpy.ndarray
s:
train_features, train_labels, val_features, val_labels \
= sb.inputs.split_pddataframe_to_nparrays(
data, target, features, n_train=7000)
print(train_features.shape) # print the size of the dataset
print(train_labels.shape)
print(val_features.shape)
print(val_labels.shape)
The function spellbook.inputs.split_pddataframe_to_nparrays()
works
like this in principle
train = data.iloc[:n_train]
val = data.iloc[n_train:].iloc[:n_val]
result = [train[features].values, train[target].values,
val[features].values, val[target].values]
# [...]
return tuple(result)
Like with tf.Tensor
s in the previous approach, batching is left
the call to tf.keras.Model.fit()
later after model setup.