Cleansing

Implementation of the data cleansing process.

class ecgan.preprocessing.cleansing.DataCleanser(lower_fault_threshold=None, upper_fault_threshold=None, nan_threshold=None, target_shape=None)[source]

Check for dead or faulty sensors, NaNs and correct shape.

A DataCleanser object can be used for some or all of the above tasks. Most often the ecgan.preprocessing.cleansing.DataCleanser.should_cleanse() method is called which checks if the series fulfills all of the checks. Each check can also be called individually. The input series is generally expected to be a single 2D series of shape (seq_len, features) with features being the different data channels. By default, all values are accepted if no threshold/condition is set.

Parameters
  • lower_fault_threshold (Optional[int]) -- Lowest value accepted without removing the series from dataset.

  • upper_fault_threshold (Optional[int]) -- Highest value accepted without removing the series from dataset.

  • nan_threshold (Optional[float]) -- Upper limit of allowed percentage of NaNs. Remove series if more than \((self.nan\_threshold \cdot 100)\%\) of all values are NaN.

  • target_shape (Optional[Tuple[int, int]]) -- Accepted shape of series.

should_cleanse(series)[source]

Conduct checks for a given 2D time series to determine if it should be cleansed.

Remove sample from dataset if any check fails.

Performed checks:

Parameters

series (ndarray) -- 2D series of shape seq_len, features.

Return type

bool

Returns

Flag indicating whether the sample should be removed from the final dataset.

check_shape(series)[source]

Check if the sample should be removed because its shape.

If no target_shape is specified in the instance creation, the shape is assumed to be a simple 2D (seq_len, features).

Parameters

series (ndarray) -- 2D series of shape seq_len, features.

Return type

bool

Returns

Flag indicating whether the sample should be removed from the final dataset.

check_for_nan(series)[source]

Check for NaN values in the data.

Data is marked for cleansing when at least \((self.nan\_threshold \cdot 100)\%\) of values of one feature are NaN. The data is expected to be a single time series sample of shape (seq_len, features), i.e. a 2D array. Series with more than 0 but less NaNs than allowed can impute the remaining NaNs using the ecgan.preprocessing.preprocessor.BasePreprocessor.

Parameters

series (ndarray) -- 2D series of shape seq_len, features.

Return type

bool

Returns

Flag indicating whether the sample should be removed from the final dataset.

static check_for_dead_sensor(series)[source]

Check for dead sensors in the data.

Data is marked as dead and subsequently as 'to be cleansed' if the variance (and thus the standard deviation) of a sensor is close to zero.

Parameters

series (ndarray) -- 2D series of shape seq_len, features.

Return type

bool

Returns

Flag indicating whether the sample should be removed from the final dataset.

check_for_faulty_sensor(series)[source]

Check for faulty sensors in the data.

Data is marked for cleansing if certain values in the data exceed a threshold or if all values are NaN.

Parameters

series (ndarray) -- 2D series of shape seq_len, features.

Return type

bool

Returns

Flag indicating whether the sample should be removed from the final dataset.