All APIs of PyGrinder

PyGrinder logo

PyGrinder

PyGrinder: a Python toolkit for grinding data beans into the incomplete for real-world data simulation by introducing missing values with different missingness patterns

pygrinder.mcar(X, p)[source]

Create completely random missing values (MCAR case).

Parameters:
  • X (Union[ndarray, Tensor]) – Data vector. If X has any missing values, they should be numpy.nan.

  • p (float) – The probability that values may be masked as missing completely at random. Note that the values are randomly selected no matter if they are originally missing or observed. If the selected values are originally missing, they will be kept as missing. If the selected values are originally observed, they will be masked as missing. Therefore, if the given X already contains missing data, the final missing rate in the output X could be in range [original_missing_rate, original_missing_rate+rate], but not strictly equal to original_missing_rate+rate. Because the selected values to be artificially masked out may be originally missing, and the masking operation on the values will do nothing.

Returns:

Original X with artificial missing values. Both originally-missing and artificially-missing values are left as NaN.

Return type:

corrupted_X

pygrinder.mcar_little_test(X)[source]

Little’s MCAR Test. Refer to [42] for more details.

Notes

This implementation is inspired by https://github.com/RianneSchouten/pyampute/blob/master/pyampute/exploration/mcar_statistical_tests.py. Note that this function should be used carefully. Rejecting the null hypothesis may not always mean that the data is not MCAR, nor is accepting the null hypothesis a guarantee that the data is MCAR.

Parameters:

X (Union[DataFrame, ndarray]) – Time series data containing missing values that should be in shape of [n_steps, n_features], i.e. have 2 dimensions.

Returns:

The p-value of a chi-square hypothesis test. Null hypothesis: the time series is missing completely at random (MCAR).

Return type:

p_value

pygrinder.mar_logistic(X, obs_rate, missing_rate)[source]

Create random missing values (MAR case) with a logistic model. First, a subset of the variables without missing values is randomly selected. Missing values will be introduced into the remaining variables according to a logistic model with random weights. This implementation is inspired by the tutorial https://rmisstastic.netlify.app/how-to/python/generate_html/how%20to%20generate%20missing%20values

Parameters:
  • X (Union[Tensor, ndarray]) – A time series data vector without any missing data. Shape of [n_steps, n_features].

  • obs_rate (float) – The proportion of variables without missing values that will be used for fitting the logistic masking model.

  • missing_rate (float) – The proportion of missing values to generate for variables which will have missing values.

Returns:

Original X with artificial missing values. Both originally-missing and artificially-missing values are left as NaN.

Return type:

corrupted_X

pygrinder.mnar_x(X, offset=0)[source]

Create not-random missing values related to values themselves (MNAR-x case ot self-masking MNAR case). This case follows the setting in Ipsen et al. “not-MIWAE: Deep Generative Modelling with Missing Not at Random Data” [43].

Parameters:
  • X (Union[ndarray, Tensor, None]) – Data vector. If X has any missing values, they should be numpy.nan.

  • offset (float) – the weight of standard deviation. In MNAR-x case, for each time series, the values larger than the mean of each time series plus offset*standard deviation will be missing

Returns:

Original X with artificial missing values. Both originally-missing and artificially-missing values are left as NaN.

Return type:

corrupted_X

pygrinder.mnar_t(X, cycle=20, pos=10, scale=3)[source]

Create not-random missing values related to temporal dynamics (MNAR-t case). In particular, the missingness is generated by an intensity function f(t) = exp(3*torch.sin(cycle*t + pos)). This case mainly follows the setting in https://hawkeslib.readthedocs.io/en/latest/tutorial.html.

Parameters:
  • X (Union[ndarray, Tensor]) – Data vector. If X has any missing values, they should be numpy.nan.

  • cycle (float) – The cycle of the used intensity function

  • pos (float) – The displacement of the used intensity function

  • scale (float) – The scale number to control the missing rate

Returns:

Original X with artificial missing values. Both originally-missing and artificially-missing values are left as NaN.

Return type:

corrupted_X

pygrinder.rdo(X, p)[source]

Create missingness in the data by randomly drop observations.

Parameters:
  • X (Union[ndarray, Tensor]) – Data vector. If X has any missing values, they should be numpy.nan.

  • p (float) – The proportion of the observed values that will be randomly masked as missing. RDO (randomly drop observations) will randomly select values from the observed values to be masked as missing. The number of selected observations is determined by p and the total number of observed values in X, e.g. if `p`=0.1, and there are 1000 observed values in X, then 0.1*1000=100 values will be randomly selected to be masked as missing. If the result is not an integer, the number of selected values will be rounded to the nearest.

Returns:

Original X with artificial missing values. Both originally-missing and artificially-missing values are left as NaN.

Return type:

corrupted_X

pygrinder.seq_missing(X, p, seq_len, feature_idx=None, step_idx=None)[source]

Create subsequence missing data.

Parameters:
  • X (Union[ndarray, Tensor]) – Data vector. If X has any missing values, they should be numpy.nan.

  • p (float) – The probability that values may be masked as missing completely at random.

  • seq_len (int) – The length of missing sequence.

  • feature_idx (Optional[list]) – The indices of features for missing sequences to be corrupted.

  • step_idx (Optional[list]) – The indices of steps for a missing sequence to start with.

Returns:

Original X with artificial missing values. Both originally-missing and artificially-missing values are left as NaN.

Return type:

corrupted_X

pygrinder.block_missing(X, factor, block_len, block_width, feature_idx=None, step_idx=None)[source]

Create block missing data.

Parameters:
  • X (Union[ndarray, Tensor]) – Data vector. If X has any missing values, they should be numpy.nan.

  • factor (float) – The actual missing rate of block_missing is hard to be strictly controlled. Hence, we use factor to help adjust the final missing rate.

  • block_len (int) – The length of the mask block.

  • block_width (int) – The width of the mask block.

  • feature_idx (Optional[list]) – The indices of features for missing block to star with.

  • step_idx (Optional[list]) – The indices of steps for a missing block to start with.

Returns:

Original X with artificial missing values. Both originally-missing and artificially-missing values are left as NaN.

Return type:

corrupted_X

pygrinder.calc_missing_rate(X)[source]

Calculate the originally missing rate of the raw data.

Parameters:

X (Union[ndarray, Tensor, DataFrame]) – Data array/tensor/frame that may contain missing values.

Returns:

The originally missing rate of the raw data. Its value should be in the range [0,1].

Return type:

missing_rate,

pygrinder.masked_fill(X, mask, val)[source]

Like torch.Tensor.masked_fill(), fill elements in given X with val where mask is True.

Parameters:
Returns:

Mask filled X.

Return type:

filled_X

pygrinder.fill_and_get_mask(X, nan=0)[source]

Fill missing values in X with nan and return the missing mask.

Parameters:
  • X (Union[Tensor, ndarray]) – Data with missing values

  • nan (int/float, optional, default=0) – Value used to fill NaN values. Only valid when return_masks is True. If return_masks is False, the NaN values will be kept as NaN.

Return type:

Union[Tuple[ndarray, ...], Tuple[Tensor, ...]]

Returns:

  • X – Original X with artificial missing values. X is for model input. Both originally-missing and artificially-missing values are filled with given parameter nan.

  • missing_mask – The mask indicates all missing values in X. In it, 1 indicates observed values, and 0 indicates missing values.

pygrinder.fill_and_get_mask_torch(X, nan=0)[source]

Fill missing values in torch tensor X with nan and return the missing mask.

Parameters:
  • X (Tensor) – Time series data generated from X_intact, with artificially missing values added.

  • nan (int/float, optional, default=0) – Value used to fill NaN values. Only valid when return_masks is True. If return_masks is False, the NaN values will be kept as NaN.

Return type:

Tuple[Tensor, ...]

Returns:

  • X – Original X with artificial missing values. X is for model input. Both originally-missing and artificially-missing values are filled with given parameter nan.

  • missing_mask – The mask indicates all missing values in X. In it, 1 indicates observed values, and 0 indicates missing values.

pygrinder.fill_and_get_mask_numpy(X, nan=0)[source]

Fill missing values in numpy array X with nan and return the missing mask.

Parameters:
  • X (np.ndarray) – Time series data generated from X_intact, with artificially missing values added.

  • nan (int/float, optional, default=0) – Value used to fill NaN values. Only valid when return_masks is True. If return_masks is False, the NaN values will be kept as NaN.

Return type:

Tuple[ndarray, ...]

Returns:

  • X – Original X with artificial missing values. X is for model input. Both originally-missing and artificially-missing values are filled with given parameter nan.

  • missing_mask – The mask indicates all missing values in X. In it, 1 indicates observed values, and 0 indicates missing values.