pypots.data package#

pypots.data.dataset#

The package including dataset classes for PyPOTS.

class pypots.data.dataset.BaseDataset(data, return_X_ori, return_X_pred, return_y, file_type='hdf5')[source]#

Bases: Dataset

Base dataset class for models in PyPOTS.

Parameters:
  • data (Union[dict, str]) – The dataset for model input, should be a dictionary or a path string locating a data file that is in supported formats. If it is a dict, ‘X’ is mandatory and ‘X_ori’, ‘X_pred’, and ‘y’ are optional. X is time-series data for input and could contain missing values. It should be array-like of shape [n_samples, n_steps (sequence length), n_features]. X_ori is optional. If X is constructed from X_ori with specially designed artificial missingness, your model may need X_ori for evaluation or loss calculation during training (e.g. SAITS). It should have the same shape as X. X_pred is optional, and it is the forecasting results for the model to predict in forecasting tasks. X_pred should be array-like of shape [n_samples, n_steps (sequence length), n_features], and its shape could different from X. But remember that X_pred contains time series forecasting results of X, hence it has the same number of samples as X, i.e. n_samples of them are the same, but their n_steps and n_features could be different. X_pred could have missing values as well as X. y should be array-like of shape [n_samples], which is classification labels of X. If it is a path string, the path should point to a data file, e.g. a h5 file, which contains key-value pairs like a dict, and it has to include keys as ‘X’, etc.

  • return_X_ori (bool) – Whether to return X_ori and indicating_mask in function __getitem__() if it is given. If True, for example, during training of models that need the original X, the Dataset class will return X_ori in __getitem__() for model input. Otherwise, X_ori and indicating mask won’t be included in the data list returned by __getitem__().

  • return_X_pred (bool) – Whether to return X_pred and X_pred_missing_mask in function __getitem__() if it is given. If True, for example, during training of forecasting models, the Dataset class will return forecasting X in __getitem__() for model input. Otherwise, X_pred and its missing mask X_pred_missing_mask won’t be included in the data list returned by __getitem__().

  • return_y (bool) – Whether to return y (i.e. labels) in function __getitem__() if they exist in the given data. If True, for example, during training of classification models, the Dataset class will return labels in __getitem__() for model input. Otherwise, labels won’t be included in the data returned by __getitem__(). This parameter exists because we need the defined Dataset class for all training/validating/testing stages. For those big datasets stored in h5 files, they already have both X and y saved. But we don’t read labels from the file for validating and testing with function _fetch_data_from_file(), which works for all three stages. Therefore, we need this parameter for distinction.

  • file_type (str) – The type of the given file if train_set and val_set are path strings.

pypots.data.saving#

Data saving utilities.

pypots.data.saving.save_dict_into_h5(data_dict, saving_path, file_name=None)[source]#

Save the given data (in a dictionary) into the given h5 file.

Parameters:
  • data_dict (dict,) – The data to be saved, should be a Python dictionary.

  • saving_path (str,) – If file_name is not given, the given path should be a path to a file with “.h5” suffix. If file_name is given, the given path should be a path to a directory. If parent directories don’t exist, they will be created.

  • file_name (str, optional (default=None)) – The name of the H5 file to be saved and should be with “.h5” suffix. It’s optional. If not set, saving_path should be a path to a file with “.h5” suffix.

Return type:

None

pypots.data.saving.load_dict_from_h5(file_path)[source]#

Load the data from the given h5 file and return as a Python dictionary.

Notes

This implementation was inspired by https://github.com/SiggiGue/hdfdict/blob/master/hdfdict/hdfdict.py#L93

Parameters:

file_path (str,) – The path to the h5 file.

Returns:

data – The data loaded from the given h5 file.

Return type:

dict,

pypots.data.saving.pickle_dump(data, path)[source]#

Pickle the given object.

Parameters:
  • data (object) – The object to be pickled.

  • path (str) – Saving path.

Return type:

path if succeed else None

pypots.data.saving.pickle_load(path)[source]#

Load pickled object from file.

Parameters:

path (str) – Local path of the pickled object.

Returns:

Pickled object.

Return type:

Object

pypots.data.generating#

Utilities for random data generating.

pypots.data.generating.gene_complete_random_walk(n_samples=1000, n_steps=24, n_features=10, mu=0.0, std=1.0, random_state=None)[source]#

Generate complete random walk time-series data, i.e. having no missing values.

Parameters:
  • n_samples (int, default=1000) – The number of training time-series samples to generate.

  • n_steps (int, default=24) – The number of time steps (length) of generated time-series samples.

  • n_features (int, default=10) – The number of features (dimensions) of generated time-series samples.

  • mu (float, default=0.0) – Mean of the normal distribution, which random walk steps are sampled from.

  • std (float, default=1.0) – Standard deviation of the normal distribution, which random walk steps are sampled from.

  • random_state (int, default=None) – Random seed for data generation.

Returns:

ts_samples – Generated random walk time series.

Return type:

array, shape of [n_samples, n_steps, n_features]

pypots.data.generating.gene_complete_random_walk_for_classification(n_classes=2, n_samples_each_class=500, n_steps=24, n_features=10, shuffle=True, random_state=None)[source]#

Generate complete random walk time-series data for the classification task.

Parameters:
  • n_classes (int, must >=1, default=2) – Number of classes (types) of the generated data.

  • n_samples_each_class (int, default=500) – Number of samples for each class to generate.

  • n_steps (int, default=24) – Number of time steps in each sample.

  • n_features (int, default=10) – Number of features.

  • shuffle (bool, default=True) – Whether to shuffle generated samples. If not, you can separate samples of each class according to n_samples_each_class. For example, X_class0=X[:n_samples_each_class], X_class1=X[n_samples_each_class:n_samples_each_class*2]

  • random_state (int, default=None) – Random seed for data generation.

Return type:

Tuple[ndarray, ndarray]

Returns:

  • X (array, shape of [n_samples, n_steps, n_features]) – Generated time-series data.

  • y (array, shape of [n_samples]) – Labels indicating classes of time-series samples.

pypots.data.generating.gene_complete_random_walk_for_anomaly_detection(n_samples=1000, n_steps=24, n_features=10, mu=0.0, std=1.0, anomaly_proportion=0.1, anomaly_fraction=0.02, anomaly_scale_factor=2.0, random_state=None)[source]#

Generate random walk time-series data for the anomaly-detection task.

Parameters:
  • n_samples (int, default=1000) – The number of training time-series samples to generate.

  • n_features (int, default=10) – The number of features (dimensions) of generated time-series samples.

  • n_steps (int, default=24) – The number of time steps (length) of generated time-series samples.

  • mu (float, default=0.0) – Mean of the normal distribution, which random walk steps are sampled from.

  • std (float, default=1.0) – Standard deviation of the normal distribution, which random walk steps are sampled from.

  • anomaly_proportion (float, default=0.1) – Proportion of anomaly samples in all samples.

  • anomaly_fraction (float, default=0.02) – Fraction of anomaly points in each anomaly sample.

  • anomaly_scale_factor (float, default=2.0) – Scale factor for value scaling to create anomaly points in time series samples.

  • random_state (int, default=None) – Random seed for data generation.

Return type:

Tuple[ndarray, ndarray]

Returns:

  • X (array, shape of [n_samples, n_steps, n_features]) – Generated time-series data.

  • y (array, shape of [n_samples]) – Labels indicating if time-series samples are anomalies.

pypots.data.generating.gene_random_walk(n_steps=24, n_features=10, n_classes=2, n_samples_each_class=1000, missing_rate=0.1)[source]#

Generate a random-walk data.

Parameters:
  • n_steps (int, default=24) – Number of time steps in each sample.

  • n_features (int, default=10) – Number of features.

  • n_classes (int, default=2) – Number of classes (types) of the generated data.

  • n_samples_each_class (int, default=1000) – Number of samples for each class to generate.

  • missing_rate (float, default=0.1) – The rate of randomly missing values to generate, should be in [0,1).

Returns:

data – A dictionary containing the generated data.

Return type:

dict,

pypots.data.generating.gene_physionet2012(artificially_missing_rate=0.1)[source]#

Generate a fully-prepared PhysioNet-2012 dataset for model testing.

Parameters:

artificially_missing_rate (float, default=0.1) – The rate of artificially missing values to generate for model evaluation. This ratio is calculated based on the number of observed values, i.e. if artificially_missing_rate = 0.1, then 10% of the observed values will be randomly masked as missing data and hold out for model evaluation.

Returns:

data – A dictionary containing the generated PhysioNet-2012 dataset.

Return type:

dict,

pypots.data.load_preprocessing#

Preprocessing functions to load supported open-source time-series datasets.

pypots.data.load_preprocessing.preprocess_physionet2012(data)[source]#

The preprocessing function for dataset PhysioNet-2012.

Parameters:

data (dict) – A data dict from tsdb.load_dataset().

Returns:

A dict containing processed data, including:
Xpandas.DataFrame,

A dataframe contains all time series vectors from 11988 patients, distinguished by column RecordID.

ypandas.Series

The 11988 classification labels of all patients, indicating whether they were deceased.

Return type:

dataset

pypots.data.load_specific_datasets#

Functions to load supported open-source time-series datasets.

pypots.data.load_specific_datasets.list_supported_datasets()[source]#

Return the datasets natively supported by PyPOTS so far.

Returns:

A list including all supported datasets.

Return type:

SUPPORTED_DATASETS

pypots.data.load_specific_datasets.load_specific_dataset(dataset_name, use_cache=True)[source]#

Load specific datasets supported by PyPOTS. Different from tsdb.load_dataset(), which only produces merely raw data, load_specific_dataset here does some preprocessing operations, like truncating time series to generate samples with the same length.

Parameters:
  • dataset_name (str) – The name of the dataset to be loaded, which should be supported, i.e. in SUPPORTED_DATASETS.

  • use_cache (bool) – Whether to use cache. This is an argument of tsdb.load_dataset().

Returns:

A dict contains the preprocessed dataset. Users only need to continue the preprocessing steps to generate the data they want, e.g. standardizing and splitting.

Return type:

data

pypots.data.utils#

Data utils.

pypots.data.utils.turn_data_into_specified_dtype(data, dtype='tensor')[source]#

Turn the given data into the specified data type.

Return type:

Union[ndarray, Tensor]

pypots.data.utils.parse_delta(missing_mask)[source]#

Generate the time-gap matrix (i.e. the delta metrix) from the missing mask. Please refer to [28] for its math definition.

Parameters:

missing_mask (Union[ndarray, Tensor]) – Binary masks indicate missing data (0 means missing values, 1 means observed values). Shape of [n_steps, n_features] or [n_samples, n_steps, n_features].

Returns:

The delta matrix indicates the time gaps between observed values. With the same shape of missing_mask.

Return type:

delta

References

pypots.data.utils.sliding_window(time_series, window_len, sliding_len=None)[source]#

Generate time series samples with sliding window method, truncating windows from time-series data with a given sequence length.

Given a time series of shape [seq_len, n_features] (seq_len is the total sequence length of the time series), this sliding_window function will generate time-series samples from this given time series with sliding window method. The number of generated samples is seq_len//sliding_len. And the final returned numpy ndarray has a shape [seq_len//sliding_len, n_steps, n_features].

Parameters:
  • time_series (np.ndarray,) – time series data, len(shape)=2, [total_length, feature_num]

  • window_len (int,) – The length of the sliding window, i.e. the number of time steps in the generated data samples.

  • sliding_len (int, default = None,) – The sliding length of the window for each moving step. It will be set as the same with n_steps if None.

Returns:

samples – The generated time-series data samples of shape [seq_len//sliding_len, n_steps, n_features].

Return type:

np.ndarray,