pypots.data package¶
pypots.data.dataset¶
The package including dataset classes for PyPOTS.
- class pypots.data.dataset.BaseDataset(data, return_X_ori, return_X_pred, return_y, file_type='hdf5')[source]¶
Bases:
Dataset
Base dataset class for models in PyPOTS.
- Parameters:
data (
Union
[dict
,str
]) – The dataset for model input, should be a dictionary or a path string locating a data file that is in supported formats. If it is a dict, ‘X’ is mandatory and ‘X_ori’, ‘X_pred’, and ‘y’ are optional.X
is time-series data for input and could contain missing values. It should be array-like of shape [n_samples, n_steps (sequence length), n_features].X_ori
is optional. IfX
is constructed fromX_ori
with specially designed artificial missingness, your model may needX_ori
for evaluation or loss calculation during training (e.g. SAITS). It should have the same shape asX
.X_pred
is optional, and it is the forecasting results for the model to predict in forecasting tasks.X_pred
should be array-like of shape [n_samples, n_steps (sequence length), n_features], and its shape could different fromX
. But remember thatX_pred
contains time series forecasting results ofX
, hence it has the same number of samples asX
, i.e. n_samples of them are the same, but their n_steps and n_features could be different.X_pred
could have missing values as well asX
.y
should be array-like of shape [n_samples], which is classification labels of X. If it is a path string, the path should point to a data file, e.g. a h5 file, which contains key-value pairs like a dict, and it has to include keys as ‘X’, etc.return_X_ori (
bool
) – Whether to return X_ori and indicating_mask in function __getitem__() if it is given. If True, for example, during training of models that need the original X, the Dataset class will return X_ori in __getitem__() for model input. Otherwise, X_ori and indicating mask won’t be included in the data list returned by __getitem__().return_X_pred (
bool
) – Whether to return X_pred and X_pred_missing_mask in function __getitem__() if it is given. If True, for example, during training of forecasting models, the Dataset class will return forecasting X in __getitem__() for model input. Otherwise, X_pred and its missing mask X_pred_missing_mask won’t be included in the data list returned by __getitem__().return_y (
bool
) – Whether to return y (i.e. labels) in function __getitem__() if they exist in the given data. If True, for example, during training of classification models, the Dataset class will return labels in __getitem__() for model input. Otherwise, labels won’t be included in the data returned by __getitem__(). This parameter exists because we need the defined Dataset class for all training/validating/testing stages. For those big datasets stored in h5 files, they already have both X and y saved. But we don’t read labels from the file for validating and testing with function _fetch_data_from_file(), which works for all three stages. Therefore, we need this parameter for distinction.file_type (
str
) – The type of the given file if train_set and val_set are path strings.
pypots.data.saving¶
Data saving utilities.
- pypots.data.saving.save_dict_into_h5(data_dict, saving_path, file_name=None)[source]¶
Save the given data (in a dictionary) into the given h5 file.
- Parameters:
data_dict (dict,) – The data to be saved, should be a Python dictionary.
saving_path (str,) – If file_name is not given, the given path should be a path to a file with “.h5” suffix. If file_name is given, the given path should be a path to a directory. If parent directories don’t exist, they will be created.
file_name (str, optional (default=None)) – The name of the H5 file to be saved and should be with “.h5” suffix. It’s optional. If not set, saving_path should be a path to a file with “.h5” suffix.
- Return type:
- pypots.data.saving.load_dict_from_h5(file_path)[source]¶
Load the data from the given h5 file and return as a Python dictionary.
Notes
This implementation was inspired by https://github.com/SiggiGue/hdfdict/blob/master/hdfdict/hdfdict.py#L93
pypots.data.generating¶
Utilities for random data generating.
pypots.data.load_specific_datasets¶
Functions to load supported open-source time-series datasets.
- pypots.data.load_specific_datasets.list_supported_datasets()[source]¶
Return the datasets natively supported by PyPOTS so far.
- Returns:
A list including all supported datasets.
- Return type:
SUPPORTED_DATASETS
- pypots.data.load_specific_datasets.load_specific_dataset(dataset_name, use_cache=True)[source]¶
Load specific datasets supported by PyPOTS. Different from tsdb.load(), which only produces merely raw data, load_specific_dataset here does some preprocessing operations, like truncating time series to generate samples with the same length.
- Parameters:
- Returns:
A dict contains the preprocessed dataset. Users only need to continue the preprocessing steps to generate the data they want, e.g. standardizing and splitting.
- Return type:
data
pypots.data.utils¶
Data utils.
- pypots.data.utils.turn_data_into_specified_dtype(data, dtype='tensor')[source]¶
Turn the given data into the specified data type.
- pypots.data.utils.parse_delta(missing_mask)[source]¶
Generate the time-gap matrix (i.e. the delta metrix) from the missing mask. Please refer to [38] for its math definition.
- Parameters:
missing_mask (
Union
[ndarray
,Tensor
]) – Binary masks indicate missing data (0 means missing values, 1 means observed values). Shape of [n_steps, n_features] or [n_samples, n_steps, n_features].- Returns:
The delta matrix indicates the time gaps between observed values. With the same shape of missing_mask.
- Return type:
delta
References
- pypots.data.utils.sliding_window(time_series, window_len, sliding_len=None)[source]¶
Generate time series samples with sliding window method, truncating windows from time-series data with a given sequence length.
Given a time series of shape [seq_len, n_features] (seq_len is the total sequence length of the time series), this sliding_window function will generate time-series samples from this given time series with sliding window method. The number of generated samples is seq_len//sliding_len. And the final returned numpy ndarray has a shape [seq_len//sliding_len, n_steps, n_features].
- Parameters:
time_series (
Union
[ndarray
,Tensor
]) – time series data, len(shape)=2, [total_length, feature_num]window_len (
int
) – The length of the sliding window, i.e. the number of time steps in the generated data samples.sliding_len (
Optional
[int
]) – The sliding length of the window for each moving step. It will be set as the same with n_steps if None.
- Returns:
The generated time-series data samples of shape [seq_len//sliding_len, n_steps, n_features].
- Return type:
samples
- pypots.data.utils.inverse_sliding_window(X, sliding_len)[source]¶
Restore the original time-series data from the generated sliding window samples. Note that this is the inverse operation of the sliding_window function, but there is no guarantee that the restored data is the same as the original data considering that 1. the sliding length may be larger than the window size and there will be gaps between restored data; 2. if values in the samples get changed, the overlap part may not be the same as the original data after averaging; 3. some incomplete samples at the tail may be dropped during the sliding window operation, hence the restored data
may be shorter than the original data.
- Parameters:
X – The generated time-series samples with sliding window method, shape of [n_samples, n_steps, n_features], where n_steps is the window size of the used sliding window method.
sliding_len – The sliding length of the window for each moving step in the sliding window method used to generate X.
- Returns:
The restored time-series data with shape of [total_length, n_features].
- Return type:
restored_data