All APIs of PyGrinder#

PyGrinder#

PyGrinder: a Python toolkit for grinding data beans into the incomplete.

pygrinder.mcar(X, p)[source]#

Create completely random missing values (MCAR case).

Parameters:
  • X (Union[ndarray, Tensor]) – Data vector. If X has any missing values, they should be numpy.nan.

  • p (float, in (0,1),) – The probability that values may be masked as missing completely at random. Note that the values are randomly selected no matter if they are originally missing or observed. If the selected values are originally missing, they will be kept as missing. If the selected values are originally observed, they will be masked as missing. Therefore, if the given X already contains missing data, the final missing rate in the output X could be in range [original_missing_rate, original_missing_rate+rate], but not strictly equal to original_missing_rate+rate. Because the selected values to be artificially masked out may be originally missing, and the masking operation on the values will do nothing.

Returns:

corrupted_X – Original X with artificial missing values. Both originally-missing and artificially-missing values are left as NaN.

Return type:

array-like

pygrinder.mcar_little_test(X)[source]#

Little’s MCAR Test. Refer to [34] for more details.

Notes

This implementation is inspired by https://github.com/RianneSchouten/pyampute/blob/master/pyampute/exploration/mcar_statistical_tests.py. Note that this function should be used carefully. Rejecting the null hypothesis may not always mean that the data is not MCAR, nor is accepting the null hypothesis a guarantee that the data is MCAR.

Parameters:

X (Union[DataFrame, ndarray]) – Time series data containing missing values that should be in shape of [n_steps, n_features], i.e. have 2 dimensions.

Returns:

p_value – The p-value of a chi-square hypothesis test. Null hypothesis: the time series is missing completely at random (MCAR).

Return type:

float

pygrinder.mar_logistic(X, obs_rate, missing_rate)[source]#

Create random missing values (MAR case) with a logistic model. First, a subset of the variables without missing values is randomly selected. Missing values will be introduced into the remaining variables according to a logistic model with random weights. This implementation is inspired by the tutorial https://rmisstastic.netlify.app/how-to/python/generate_html/how%20to%20generate%20missing%20values

Parameters:
  • X (shape of [n_steps, n_features]) – A time series data vector without any missing data.

  • obs_rate (float) – The proportion of variables without missing values that will be used for fitting the logistic masking model.

  • missing_rate (float) – The proportion of missing values to generate for variables which will have missing values.

Returns:

corrupted_X – Original X with artificial missing values. Both originally-missing and artificially-missing values are left as NaN.

Return type:

array-like

pygrinder.mnar_x(X, offset=0)[source]#

Create not-random missing values related to values themselves (MNAR-x case ot self-masking MNAR case). This case follows the setting in Ipsen et al. “not-MIWAE: Deep Generative Modelling with Missing Not at Random Data” [35].

Parameters:
  • X (Union[ndarray, Tensor, None]) – Data vector. If X has any missing values, they should be numpy.nan.

  • offset (float) – the weight of standard deviation. In MNAR-x case, for each time series, the values larger than the mean of each time series plus offset*standard deviation will be missing

Returns:

corrupted_X – Original X with artificial missing values. Both originally-missing and artificially-missing values are left as NaN.

Return type:

array-like

pygrinder.mnar_t(X, cycle=20, pos=10, scale=3)[source]#

Create not-random missing values related to temporal dynamics (MNAR-t case). In particular, the missingness is generated by an intensity function f(t) = exp(3*torch.sin(cycle*t + pos)). This case mainly follows the setting in https://hawkeslib.readthedocs.io/en/latest/tutorial.html.

Parameters:
  • X (Union[ndarray, Tensor]) – Data vector. If X has any missing values, they should be numpy.nan.

  • cycle (float) – The cycle of the used intensity function

  • pos (float) – The displacement of the used intensity function

  • scale (float) – The scale number to control the missing rate

Returns:

corrupted_X – Original X with artificial missing values. Both originally-missing and artificially-missing values are left as NaN.

Return type:

array-like

pygrinder.calc_missing_rate(X)[source]#

Calculate the originally missing rate of the raw data.

Parameters:

X (Union[ndarray, Tensor]) – Data array that may contain missing values.

Returns:

The originally missing rate of the raw data. Its value should be in the range [0,1].

Return type:

originally_missing_rate,

pygrinder.masked_fill(X, mask, val)[source]#

Like torch.Tensor.masked_fill(), fill elements in given X with val where mask is True.

Parameters:
Returns:

Mask filled X.

Return type:

filled_X

pygrinder.fill_and_get_mask(X, nan=0)[source]#

Fill missing values in X with nan and return the missing mask.

Parameters:
  • X (Union[Tensor, ndarray]) – Data with missing values

  • nan (int/float, optional, default=0) – Value used to fill NaN values. Only valid when return_masks is True. If return_masks is False, the NaN values will be kept as NaN.

Return type:

Union[Tuple[ndarray, ...], Tuple[Tensor, ...]]

Returns:

  • X – Original X with artificial missing values. X is for model input. Both originally-missing and artificially-missing values are filled with given parameter nan.

  • missing_mask – The mask indicates all missing values in X. In it, 1 indicates observed values, and 0 indicates missing values.

pygrinder.fill_and_get_mask_torch(X, nan=0)[source]#

Fill missing values in torch tensor X with nan and return the missing mask.

Parameters:
  • X (Tensor) – Time series data generated from X_intact, with artificially missing values added.

  • nan (int/float, optional, default=0) – Value used to fill NaN values. Only valid when return_masks is True. If return_masks is False, the NaN values will be kept as NaN.

Return type:

Tuple[Tensor, ...]

Returns:

  • X – Original X with artificial missing values. X is for model input. Both originally-missing and artificially-missing values are filled with given parameter nan.

  • missing_mask – The mask indicates all missing values in X. In it, 1 indicates observed values, and 0 indicates missing values.

pygrinder.fill_and_get_mask_numpy(X, nan=0)[source]#

Fill missing values in numpy array X with nan and return the missing mask.

Parameters:
  • X (np.ndarray) – Time series data generated from X_intact, with artificially missing values added.

  • nan (int/float, optional, default=0) – Value used to fill NaN values. Only valid when return_masks is True. If return_masks is False, the NaN values will be kept as NaN.

Return type:

Tuple[ndarray, ...]

Returns:

  • X – Original X with artificial missing values. X is for model input. Both originally-missing and artificially-missing values are filled with given parameter nan.

  • missing_mask – The mask indicates all missing values in X. In it, 1 indicates observed values, and 0 indicates missing values.