Splitting
Functions to split the dataset.
- ecgan.utils.splitting.create_splits(data, label, folds, seed, method=SplitMethods.MIXED, split=(0.8, 0.2))[source]
Take given data and label tensors and split them into n folds with train, test and validation data.
Currently we support the creation of two kinds of splits:
Mixed train split: A standard split of all data into train, test and validation using the sklearn
StratifiedShuffleSplit
, frequently used in classification tasks. The split-tuple has to sum up to 1, containing the respective percentages for the (training+validation, test) set. Using at least 60% of data for training is common.Healthy only train split: Training is only performed on healthy data. To create the training set, all normal data (label==0) is shuffled and the training set contains \(train\_x = len(normal\_data) * split[0])\) samples. The test/validation set contain the remaining normal samples and all anomalous samples. In this implementation we follow the cross-validation setup and use the same test set for all folds. This means that the abnormal validation data also remains the same. The split-tuple determines how many of the normal data is used in train/vali and test. Make sure to keep possible data imbalances in mind.
- Parameters
data (
Union
[Tensor
,ndarray
]) -- Input data of shape(num_samples, seq_len, channels)
.label (
Union
[Tensor
,ndarray
]) -- Input labels as tensor.folds (
int
) -- Amount of folds.seed (
int
) -- PyTorch/Numpy RNG seed to deterministically create splits.method (
SplitMethods
) -- Indicator of how to split dataset. Based on random split ('mixed') or splitting the data such that only the normal class is used during training ('normal_only') and normal and abnormal classes are used during validation/testing. 0 if no instance of the normal data are used, 1 if all instances are used in the test set.split (
Tuple
[float
,float
]) -- Fraction of data in the (train+vali, test) sets.
- Return type
Dict
- Returns
Split indices dictionary containing n folds with indices to construct n datasets consisting of (train_x, test_x, vali_x, train_y, test_y, vali_y).
- ecgan.utils.splitting.train_only_normal_split(label, folds, seed, split=(0.85, 0.15))[source]
Take given data and label tensors and split them into a train, test and validation set.
Training is only performed on healthy data. To create the training set, all normal data (label==0) is shuffled and the training set contains \(train\_x = len(normal\_data) * split[0])\) samples. The test/validation set contain the remaining normal samples and all anomalous samples. In this implementation we follow the cross-validation setup and use the same test set for all folds. This means that the abnormal validation data also remains the same. The split-tuple determines how many of the normal data is used in train/vali and test. Make sure to keep possible data imbalances in mind.
- Parameters
label (
Tensor
) -- Input labels as tensor.folds (
int
) -- Amount of splits performed.seed (
int
) -- Random seed.split (
Tuple
[float
,float
]) -- Fraction of data in the (train, test) set.
- Return type
Dict
- Returns
Index dictionary with n folds containing indices used in train, test and validation set.
- ecgan.utils.splitting.mixed_split(data, label, folds, seed, split=(0.85, 0.15))[source]
Take given data and label tensors and split them into a training and test set.
- Parameters
data (
Tensor
) -- Input dataset as tensor.label (
Tensor
) -- Input labels as tensor.folds (
int
) -- Amount of folds.seed (
int
) -- PyTorch/Numpy RNG seed to deterministically create splits.split (
Tuple
[float
,float
]) -- Fraction of data in the test set.
- Return type
Dict
- Returns
Index dictionary with n folds containing indices used in train, test and validation set.
- ecgan.utils.splitting.load_split(data, label, index_dict, fold)[source]
Load split from a given fold of a previous run.
- Return type
Tuple
[Tensor
,Tensor
,Tensor
,Tensor
,Tensor
,Tensor
]
- ecgan.utils.splitting.select_channels(data, channels)[source]
Select channels based on their indices (given as a list or an int).
If an int n is selected, the first n columns are used. However, it is usually preferred to pass a list of ints to be sure to select the correct columns. Passed channels are zero-indexed.
- Parameters
data (
Tensor
) -- Tensor containing series of shape(num_samples, seq_len, num_channels)
.channels (
Union
[int
,List
[int
]]) -- Selected channels.
- Return type
Tensor
- ecgan.utils.splitting.verbose_channel_selection(data, channels)[source]
Verbose output corresponding to the select_channel function.
- Parameters
data (
Tensor
) -- Tensor containing series of shape(num_samples, seq_len, num_channels)
.channels (
Union
[int
,List
[int
]]) -- Selected channels.
- Return type
None