Splitting

Functions to split the dataset.

ecgan.utils.splitting.create_splits(data, label, folds, seed, method=SplitMethods.MIXED, split=(0.8, 0.2))[source]

Take given data and label tensors and split them into n folds with train, test and validation data.

Currently we support the creation of two kinds of splits:

Mixed train split: A standard split of all data into train, test and validation using the sklearn StratifiedShuffleSplit, frequently used in classification tasks. The split-tuple has to sum up to 1, containing the respective percentages for the (training+validation, test) set. Using at least 60% of data for training is common.
Healthy only train split: Training is only performed on healthy data. To create the training set, all normal data (label==0) is shuffled and the training set contains \(train\_x = len(normal\_data) * split[0])\) samples. The test/validation set contain the remaining normal samples and all anomalous samples. In this implementation we follow the cross-validation setup and use the same test set for all folds. This means that the abnormal validation data also remains the same. The split-tuple determines how many of the normal data is used in train/vali and test. Make sure to keep possible data imbalances in mind.

Parameters

data (Union[Tensor, ndarray]) -- Input data of shape (num_samples, seq_len, channels).
label (Union[Tensor, ndarray]) -- Input labels as tensor.
folds (int) -- Amount of folds.
seed (int) -- PyTorch/Numpy RNG seed to deterministically create splits.
method (SplitMethods) -- Indicator of how to split dataset. Based on random split ('mixed') or splitting the data such that only the normal class is used during training ('normal_only') and normal and abnormal classes are used during validation/testing. 0 if no instance of the normal data are used, 1 if all instances are used in the test set.
split (Tuple[float, float]) -- Fraction of data in the (train+vali, test) sets.

Return type

Dict

Returns

Split indices dictionary containing n folds with indices to construct n datasets consisting of (train_x, test_x, vali_x, train_y, test_y, vali_y).

ecgan.utils.splitting.train_only_normal_split(label, folds, seed, split=(0.85, 0.15))[source]

Take given data and label tensors and split them into a train, test and validation set.

Training is only performed on healthy data. To create the training set, all normal data (label==0) is shuffled and the training set contains \(train\_x = len(normal\_data) * split[0])\) samples. The test/validation set contain the remaining normal samples and all anomalous samples. In this implementation we follow the cross-validation setup and use the same test set for all folds. This means that the abnormal validation data also remains the same. The split-tuple determines how many of the normal data is used in train/vali and test. Make sure to keep possible data imbalances in mind.

Parameters

label (Tensor) -- Input labels as tensor.
folds (int) -- Amount of splits performed.
seed (int) -- Random seed.
split (Tuple[float, float]) -- Fraction of data in the (train, test) set.

Return type

Dict

Returns

Index dictionary with n folds containing indices used in train, test and validation set.

ecgan.utils.splitting.mixed_split(data, label, folds, seed, split=(0.85, 0.15))[source]

Take given data and label tensors and split them into a training and test set.

Parameters

data (Tensor) -- Input dataset as tensor.
label (Tensor) -- Input labels as tensor.
folds (int) -- Amount of folds.
seed (int) -- PyTorch/Numpy RNG seed to deterministically create splits.
split (Tuple[float, float]) -- Fraction of data in the test set.

Return type

Dict

Returns

Index dictionary with n folds containing indices used in train, test and validation set.

ecgan.utils.splitting.load_split(data, label, index_dict, fold)[source]

Load split from a given fold of a previous run.

Return type: Tuple[Tensor, Tensor, Tensor, Tensor, Tensor, Tensor]

ecgan.utils.splitting.select_channels(data, channels)[source]

Select channels based on their indices (given as a list or an int).

If an int n is selected, the first n columns are used. However, it is usually preferred to pass a list of ints to be sure to select the correct columns. Passed channels are zero-indexed.

Parameters

data (Tensor) -- Tensor containing series of shape (num_samples, seq_len, num_channels).
channels (Union[int, List[int]]) -- Selected channels.

Return type

Tensor

ecgan.utils.splitting.verbose_channel_selection(data, channels)[source]

Verbose output corresponding to the select_channel function.

Parameters

data (Tensor) -- Tensor containing series of shape (num_samples, seq_len, num_channels).
channels (Union[int, List[int]]) -- Selected channels.

Return type

None