pypots.data¶

pypots.data.dataset¶

The package including dataset classes for PyPOTS.

class pypots.data.dataset.BaseDataset(data, return_X_ori, return_X_pred, return_y, file_type='hdf5')[source]¶

Bases: Dataset

Base dataset class for models in PyPOTS.

Parameters:

data (Union[dict, str]) – The dataset for model input, should be a dictionary or a path string locating a data file that is in supported formats. If it is a dict, ‘X’ is mandatory and ‘X_ori’, ‘X_pred’, and ‘y’ are optional. X is time-series data for input and could contain missing values. It should be array-like of shape [n_samples, n_steps (sequence length), n_features]. X_ori is optional. If X is constructed from X_ori with specially designed artificial missingness, your model may need X_ori for evaluation or loss calculation during training (e.g. SAITS). It should have the same shape as X. X_pred is optional, and it is the forecasting results for the model to predict in forecasting tasks. X_pred should be array-like of shape [n_samples, n_steps (sequence length), n_features], and its shape could different from X. But remember that X_pred contains time series forecasting results of X, hence it has the same number of samples as X, i.e. n_samples of them are the same, but their n_steps and n_features could be different. X_pred could have missing values as well as X. y should be array-like of shape [n_samples], which is classification labels of X. If it is a path string, the path should point to a data file, e.g. a h5 file, which contains key-value pairs like a dict, and it has to include keys as ‘X’, etc.
return_X_ori (bool) – Whether to return X_ori and indicating_mask in function __getitem__() if it is given. If True, for example, during training of models that need the original X, the Dataset class will return X_ori in __getitem__() for model input. Otherwise, X_ori and indicating mask won’t be included in the data list returned by __getitem__().
return_X_pred (bool) – Whether to return X_pred and X_pred_missing_mask in function __getitem__() if it is given. If True, for example, during training of forecasting models, the Dataset class will return forecasting X in __getitem__() for model input. Otherwise, X_pred and its missing mask X_pred_missing_mask won’t be included in the data list returned by __getitem__().
return_y (bool) – Whether to return y (i.e. labels) in function __getitem__() if they exist in the given data. If True, for example, during training of classification models, the Dataset class will return labels in __getitem__() for model input. Otherwise, labels won’t be included in the data returned by __getitem__(). This parameter exists because we need the defined Dataset class for all training/validating/testing stages. For those big datasets stored in h5 files, they already have both X and y saved. But we don’t read labels from the file for validating and testing with function _fetch_data_from_file(), which works for all three stages. Therefore, we need this parameter for distinction.
file_type (str) – The type of the given file if train_set and val_set are path strings.

fetch_entire_dataset()[source]¶

Fetch the entire dataset from the given data source.

Returns:: The entire dataset in a dictionary fetched from the given data source.
Return type:: data

pypots.data.saving¶

Data saving utilities.

pypots.data.saving.save_dict_into_h5(data_dict, saving_path, file_name=None)[source]¶

Save the given data (in a dictionary) into the given h5 file.

Parameters:

data_dict (dict,) – The data to be saved, should be a Python dictionary.
saving_path (str,) – If file_name is not given, the given path should be a path to a file with “.h5” suffix. If file_name is given, the given path should be a path to a directory. If parent directories don’t exist, they will be created.
file_name (str, optional (default=None)) – The name of the H5 file to be saved and should be with “.h5” suffix. It’s optional. If not set, saving_path should be a path to a file with “.h5” suffix.

Return type:

None

pypots.data.saving.load_dict_from_h5(file_path)[source]¶

Load the data from the given h5 file and return as a Python dictionary.

Notes

This implementation was inspired by https://github.com/SiggiGue/hdfdict/blob/master/hdfdict/hdfdict.py#L93

Parameters:: file_path (str,) – The path to the h5 file.
Returns:: data – The data loaded from the given h5 file.
Return type:: dict,

pypots.data.saving.pickle_dump(data, path)[source]¶

Pickle the given object.

Parameters:

data (object) – The object to be pickled.
path (str) – Saving path.

Return type:

path if succeed else None

pypots.data.saving.pickle_load(path)[source]¶

Load pickled object from file.

Parameters:: path (str) – Local path of the pickled object.
Returns:: Pickled object.
Return type:: Object

pypots.data.checking¶

pypots.data.checking.key_in_data_set(key, dataset)[source]¶

Check if the key is in the given dataset. The dataset could be a path to an HDF5 file or a Python dictionary.

Parameters:

key (str) – The key to check.
dataset (Union[str, dict]) – The dataset to be checked.

Returns:

Whether the key is in the dataset.

Return type:

bool

pypots.data.utils¶

Data utils.

pypots.data.utils.turn_data_into_specified_dtype(data, dtype='tensor')[source]¶

Turn the given data into the specified data type.

Return type:: Union[ndarray, Tensor]

pypots.data.utils.parse_delta(missing_mask)[source]¶

Generate the time-gap matrix (i.e. the delta metrix) from the missing mask. Please refer to [49] for its math definition.

Parameters:: missing_mask (Union[ndarray, Tensor]) – Binary masks indicate missing data (0 means missing values, 1 means observed values). Shape of [n_steps, n_features] or [n_samples, n_steps, n_features].
Returns:: The delta matrix indicates the time gaps between observed values. With the same shape of missing_mask.
Return type:: delta

References

pypots.data.utils.sliding_window(time_series, window_len, sliding_len=None)[source]¶

Generate time series samples with sliding window method, truncating windows from time-series data with a given sequence length.

Given a time series of shape [seq_len, n_features] (seq_len is the total sequence length of the time series), this sliding_window function will generate time-series samples from this given time series with sliding window method. The number of generated samples is seq_len//sliding_len. And the final returned numpy ndarray has a shape [seq_len//sliding_len, n_steps, n_features].

Parameters:

time_series (Union[ndarray, Tensor]) – time series data, len(shape)=2, [total_length, feature_num]
window_len (int) – The length of the sliding window, i.e. the number of time steps in the generated data samples.
sliding_len (Optional[int]) – The sliding length of the window for each moving step. It will be set as the same with n_steps if None.

Returns:

The generated time-series data samples of shape [seq_len//sliding_len, n_steps, n_features].

Return type:

samples

pypots.data.utils.inverse_sliding_window(X, sliding_len)[source]¶

Restore the original time-series data from the generated sliding window samples. Note that this is the inverse operation of the sliding_window function, but there is no guarantee that the restored data is the same as the original data considering that 1. the sliding length may be larger than the window size and there will be gaps between restored data; 2. if values in the samples get changed, the overlap part may not be the same as the original data after averaging; 3. some incomplete samples at the tail may be dropped during the sliding window operation, hence the restored data

may be shorter than the original data.

Parameters:

X – The generated time-series samples with sliding window method, shape of [n_samples, n_steps, n_features], where n_steps is the window size of the used sliding window method.
sliding_len – The sliding length of the window for each moving step in the sliding window method used to generate X.

Returns:

The restored time-series data with shape of [total_length, n_features].

Return type:

restored_data