All APIs of PyGrinder¶
PyGrinder¶
PyGrinder: a Python toolkit for grinding data beans into the incomplete for real-world data simulation by introducing missing values with different missingness patterns
- pygrinder.mcar(X, p)[source]¶
Create completely random missing values (MCAR case).
- Parameters:
X (
Union
[ndarray
,Tensor
]) – Data vector. If X has any missing values, they should be numpy.nan.p (
float
) – The probability that values may be masked as missing completely at random. Note that the values are randomly selected no matter if they are originally missing or observed. If the selected values are originally missing, they will be kept as missing. If the selected values are originally observed, they will be masked as missing. Therefore, if the given X already contains missing data, the final missing rate in the output X could be in range [original_missing_rate, original_missing_rate+rate], but not strictly equal to original_missing_rate+rate. Because the selected values to be artificially masked out may be originally missing, and the masking operation on the values will do nothing.
- Returns:
Original X with artificial missing values. Both originally-missing and artificially-missing values are left as NaN.
- Return type:
corrupted_X
- pygrinder.mcar_little_test(X)[source]¶
Little’s MCAR Test. Refer to [42] for more details.
Notes
This implementation is inspired by https://github.com/RianneSchouten/pyampute/blob/master/pyampute/exploration/mcar_statistical_tests.py. Note that this function should be used carefully. Rejecting the null hypothesis may not always mean that the data is not MCAR, nor is accepting the null hypothesis a guarantee that the data is MCAR.
- Parameters:
X (
Union
[DataFrame
,ndarray
]) – Time series data containing missing values that should be in shape of [n_steps, n_features], i.e. have 2 dimensions.- Returns:
The p-value of a chi-square hypothesis test. Null hypothesis: the time series is missing completely at random (MCAR).
- Return type:
p_value
- pygrinder.mar_logistic(X, obs_rate, missing_rate)[source]¶
Create random missing values (MAR case) with a logistic model. First, a subset of the variables without missing values is randomly selected. Missing values will be introduced into the remaining variables according to a logistic model with random weights. This implementation is inspired by the tutorial https://rmisstastic.netlify.app/how-to/python/generate_html/how%20to%20generate%20missing%20values
- Parameters:
X (
Union
[Tensor
,ndarray
]) – A time series data vector without any missing data. Shape of [n_steps, n_features].obs_rate (
float
) – The proportion of variables without missing values that will be used for fitting the logistic masking model.missing_rate (
float
) – The proportion of missing values to generate for variables which will have missing values.
- Returns:
Original X with artificial missing values. Both originally-missing and artificially-missing values are left as NaN.
- Return type:
corrupted_X
- pygrinder.mnar_x(X, offset=0)[source]¶
Create not-random missing values related to values themselves (MNAR-x case ot self-masking MNAR case). This case follows the setting in Ipsen et al. “not-MIWAE: Deep Generative Modelling with Missing Not at Random Data” [43].
- Parameters:
- Returns:
Original X with artificial missing values. Both originally-missing and artificially-missing values are left as NaN.
- Return type:
corrupted_X
- pygrinder.mnar_t(X, cycle=20, pos=10, scale=3)[source]¶
Create not-random missing values related to temporal dynamics (MNAR-t case). In particular, the missingness is generated by an intensity function f(t) = exp(3*torch.sin(cycle*t + pos)). This case mainly follows the setting in https://hawkeslib.readthedocs.io/en/latest/tutorial.html.
- Parameters:
- Returns:
Original X with artificial missing values. Both originally-missing and artificially-missing values are left as NaN.
- Return type:
corrupted_X
- pygrinder.rdo(X, p)[source]¶
Create missingness in the data by randomly drop observations.
- Parameters:
X (
Union
[ndarray
,Tensor
]) – Data vector. If X has any missing values, they should be numpy.nan.p (
float
) – The proportion of the observed values that will be randomly masked as missing. RDO (randomly drop observations) will randomly select values from the observed values to be masked as missing. The number of selected observations is determined by p and the total number of observed values in X, e.g. if `p`=0.1, and there are 1000 observed values in X, then 0.1*1000=100 values will be randomly selected to be masked as missing. If the result is not an integer, the number of selected values will be rounded to the nearest.
- Returns:
Original X with artificial missing values. Both originally-missing and artificially-missing values are left as NaN.
- Return type:
corrupted_X
- pygrinder.seq_missing(X, p, seq_len, feature_idx=None, step_idx=None)[source]¶
Create subsequence missing data.
- Parameters:
X (
Union
[ndarray
,Tensor
]) – Data vector. If X has any missing values, they should be numpy.nan.p (
float
) – The probability that values may be masked as missing completely at random.seq_len (
int
) – The length of missing sequence.feature_idx (
Optional
[list
]) – The indices of features for missing sequences to be corrupted.step_idx (
Optional
[list
]) – The indices of steps for a missing sequence to start with.
- Returns:
Original X with artificial missing values. Both originally-missing and artificially-missing values are left as NaN.
- Return type:
corrupted_X
- pygrinder.block_missing(X, factor, block_len, block_width, feature_idx=None, step_idx=None)[source]¶
Create block missing data.
- Parameters:
X (
Union
[ndarray
,Tensor
]) – Data vector. If X has any missing values, they should be numpy.nan.factor (
float
) – The actual missing rate of block_missing is hard to be strictly controlled. Hence, we usefactor
to help adjust the final missing rate.block_len (
int
) – The length of the mask block.block_width (
int
) – The width of the mask block.feature_idx (
Optional
[list
]) – The indices of features for missing block to star with.step_idx (
Optional
[list
]) – The indices of steps for a missing block to start with.
- Returns:
Original X with artificial missing values. Both originally-missing and artificially-missing values are left as NaN.
- Return type:
corrupted_X
- pygrinder.masked_fill(X, mask, val)[source]¶
Like torch.Tensor.masked_fill(), fill elements in given X with val where mask is True.
- pygrinder.fill_and_get_mask(X, nan=0)[source]¶
Fill missing values in X with nan and return the missing mask.
- Parameters:
- Return type:
- Returns:
X – Original X with artificial missing values. X is for model input. Both originally-missing and artificially-missing values are filled with given parameter nan.
missing_mask – The mask indicates all missing values in X. In it, 1 indicates observed values, and 0 indicates missing values.
- pygrinder.fill_and_get_mask_torch(X, nan=0)[source]¶
Fill missing values in torch tensor X with nan and return the missing mask.
- Parameters:
X (
Tensor
) – Time series data generated from X_intact, with artificially missing values added.nan (int/float, optional, default=0) – Value used to fill NaN values. Only valid when return_masks is True. If return_masks is False, the NaN values will be kept as NaN.
- Return type:
- Returns:
X – Original X with artificial missing values. X is for model input. Both originally-missing and artificially-missing values are filled with given parameter nan.
missing_mask – The mask indicates all missing values in X. In it, 1 indicates observed values, and 0 indicates missing values.
- pygrinder.fill_and_get_mask_numpy(X, nan=0)[source]¶
Fill missing values in numpy array X with nan and return the missing mask.
- Parameters:
X (np.ndarray) – Time series data generated from X_intact, with artificially missing values added.
nan (int/float, optional, default=0) – Value used to fill NaN values. Only valid when return_masks is True. If return_masks is False, the NaN values will be kept as NaN.
- Return type:
- Returns:
X – Original X with artificial missing values. X is for model input. Both originally-missing and artificially-missing values are filled with given parameter nan.
missing_mask – The mask indicates all missing values in X. In it, 1 indicates observed values, and 0 indicates missing values.