pypots.data package#
pypots.data.dataset#
The package including dataset classes for PyPOTS.
- class pypots.data.dataset.BaseDataset(data, return_X_ori, return_X_pred, return_y, file_type='hdf5')[source]#
Bases:
Dataset
Base dataset class for models in PyPOTS.
- Parameters:
data (
Union
[dict
,str
]) – The dataset for model input, should be a dictionary or a path string locating a data file that is in supported formats. If it is a dict, ‘X’ is mandatory and ‘X_ori’, ‘X_pred’, and ‘y’ are optional.X
is time-series data for input and could contain missing values. It should be array-like of shape [n_samples, n_steps (sequence length), n_features].X_ori
is optional. IfX
is constructed fromX_ori
with specially designed artificial missingness, your model may needX_ori
for evaluation or loss calculation during training (e.g. SAITS). It should have the same shape asX
.X_pred
is optional, and it is the forecasting results for the model to predict in forecasting tasks.X_pred
should be array-like of shape [n_samples, n_steps (sequence length), n_features], and its shape could different fromX
. But remember thatX_pred
contains time series forecasting results ofX
, hence it has the same number of samples asX
, i.e. n_samples of them are the same, but their n_steps and n_features could be different.X_pred
could have missing values as well asX
.y
should be array-like of shape [n_samples], which is classification labels of X. If it is a path string, the path should point to a data file, e.g. a h5 file, which contains key-value pairs like a dict, and it has to include keys as ‘X’, etc.return_X_ori (
bool
) – Whether to return X_ori and indicating_mask in function __getitem__() if it is given. If True, for example, during training of models that need the original X, the Dataset class will return X_ori in __getitem__() for model input. Otherwise, X_ori and indicating mask won’t be included in the data list returned by __getitem__().return_X_pred (
bool
) – Whether to return X_pred and X_pred_missing_mask in function __getitem__() if it is given. If True, for example, during training of forecasting models, the Dataset class will return forecasting X in __getitem__() for model input. Otherwise, X_pred and its missing mask X_pred_missing_mask won’t be included in the data list returned by __getitem__().return_y (
bool
) – Whether to return y (i.e. labels) in function __getitem__() if they exist in the given data. If True, for example, during training of classification models, the Dataset class will return labels in __getitem__() for model input. Otherwise, labels won’t be included in the data returned by __getitem__(). This parameter exists because we need the defined Dataset class for all training/validating/testing stages. For those big datasets stored in h5 files, they already have both X and y saved. But we don’t read labels from the file for validating and testing with function _fetch_data_from_file(), which works for all three stages. Therefore, we need this parameter for distinction.file_type (
str
) – The type of the given file if train_set and val_set are path strings.
pypots.data.saving#
Data saving utilities.
- pypots.data.saving.save_dict_into_h5(data_dict, saving_path, file_name=None)[source]#
Save the given data (in a dictionary) into the given h5 file.
- Parameters:
data_dict (dict,) – The data to be saved, should be a Python dictionary.
saving_path (str,) – If file_name is not given, the given path should be a path to a file with “.h5” suffix. If file_name is given, the given path should be a path to a directory. If parent directories don’t exist, they will be created.
file_name (str, optional (default=None)) – The name of the H5 file to be saved and should be with “.h5” suffix. It’s optional. If not set, saving_path should be a path to a file with “.h5” suffix.
- Return type:
- pypots.data.saving.load_dict_from_h5(file_path)[source]#
Load the data from the given h5 file and return as a Python dictionary.
Notes
This implementation was inspired by https://github.com/SiggiGue/hdfdict/blob/master/hdfdict/hdfdict.py#L93
pypots.data.generating#
Utilities for random data generating.
- pypots.data.generating.gene_complete_random_walk(n_samples=1000, n_steps=24, n_features=10, mu=0.0, std=1.0, random_state=None)[source]#
Generate complete random walk time-series data, i.e. having no missing values.
- Parameters:
n_samples (int, default=1000) – The number of training time-series samples to generate.
n_steps (int, default=24) – The number of time steps (length) of generated time-series samples.
n_features (int, default=10) – The number of features (dimensions) of generated time-series samples.
mu (float, default=0.0) – Mean of the normal distribution, which random walk steps are sampled from.
std (float, default=1.0) – Standard deviation of the normal distribution, which random walk steps are sampled from.
random_state (int, default=None) – Random seed for data generation.
- Returns:
ts_samples – Generated random walk time series.
- Return type:
array, shape of [n_samples, n_steps, n_features]
- pypots.data.generating.gene_complete_random_walk_for_classification(n_classes=2, n_samples_each_class=500, n_steps=24, n_features=10, shuffle=True, random_state=None)[source]#
Generate complete random walk time-series data for the classification task.
- Parameters:
n_classes (int, must >=1, default=2) – Number of classes (types) of the generated data.
n_samples_each_class (int, default=500) – Number of samples for each class to generate.
n_steps (int, default=24) – Number of time steps in each sample.
n_features (int, default=10) – Number of features.
shuffle (bool, default=True) – Whether to shuffle generated samples. If not, you can separate samples of each class according to n_samples_each_class. For example, X_class0=X[:n_samples_each_class], X_class1=X[n_samples_each_class:n_samples_each_class*2]
random_state (int, default=None) – Random seed for data generation.
- Return type:
- Returns:
X (array, shape of [n_samples, n_steps, n_features]) – Generated time-series data.
y (array, shape of [n_samples]) – Labels indicating classes of time-series samples.
- pypots.data.generating.gene_complete_random_walk_for_anomaly_detection(n_samples=1000, n_steps=24, n_features=10, mu=0.0, std=1.0, anomaly_proportion=0.1, anomaly_fraction=0.02, anomaly_scale_factor=2.0, random_state=None)[source]#
Generate random walk time-series data for the anomaly-detection task.
- Parameters:
n_samples (int, default=1000) – The number of training time-series samples to generate.
n_features (int, default=10) – The number of features (dimensions) of generated time-series samples.
n_steps (int, default=24) – The number of time steps (length) of generated time-series samples.
mu (float, default=0.0) – Mean of the normal distribution, which random walk steps are sampled from.
std (float, default=1.0) – Standard deviation of the normal distribution, which random walk steps are sampled from.
anomaly_proportion (float, default=0.1) – Proportion of anomaly samples in all samples.
anomaly_fraction (float, default=0.02) – Fraction of anomaly points in each anomaly sample.
anomaly_scale_factor (float, default=2.0) – Scale factor for value scaling to create anomaly points in time series samples.
random_state (int, default=None) – Random seed for data generation.
- Return type:
- Returns:
X (array, shape of [n_samples, n_steps, n_features]) – Generated time-series data.
y (array, shape of [n_samples]) – Labels indicating if time-series samples are anomalies.
- pypots.data.generating.gene_random_walk(n_steps=24, n_features=10, n_classes=2, n_samples_each_class=1000, missing_rate=0.1)[source]#
Generate a random-walk data.
- Parameters:
n_steps (int, default=24) – Number of time steps in each sample.
n_features (int, default=10) – Number of features.
n_classes (int, default=2) – Number of classes (types) of the generated data.
n_samples_each_class (int, default=1000) – Number of samples for each class to generate.
missing_rate (float, default=0.1) – The rate of randomly missing values to generate, should be in [0,1).
- Returns:
data – A dictionary containing the generated data.
- Return type:
dict,
- pypots.data.generating.gene_physionet2012(artificially_missing_rate=0.1)[source]#
Generate a fully-prepared PhysioNet-2012 dataset for model testing.
- Parameters:
artificially_missing_rate (float, default=0.1) – The rate of artificially missing values to generate for model evaluation. This ratio is calculated based on the number of observed values, i.e. if artificially_missing_rate = 0.1, then 10% of the observed values will be randomly masked as missing data and hold out for model evaluation.
- Returns:
data – A dictionary containing the generated PhysioNet-2012 dataset.
- Return type:
dict,
pypots.data.load_preprocessing#
Preprocessing functions to load supported open-source time-series datasets.
- pypots.data.load_preprocessing.preprocess_physionet2012(data)[source]#
The preprocessing function for dataset PhysioNet-2012.
- Parameters:
data (
dict
) – A data dict from tsdb.load_dataset().- Returns:
- A dict containing processed data, including:
- Xpandas.DataFrame,
A dataframe contains all time series vectors from 11988 patients, distinguished by column RecordID.
- ypandas.Series
The 11988 classification labels of all patients, indicating whether they were deceased.
- Return type:
dataset
pypots.data.load_specific_datasets#
Functions to load supported open-source time-series datasets.
- pypots.data.load_specific_datasets.list_supported_datasets()[source]#
Return the datasets natively supported by PyPOTS so far.
- Returns:
A list including all supported datasets.
- Return type:
SUPPORTED_DATASETS
- pypots.data.load_specific_datasets.load_specific_dataset(dataset_name, use_cache=True)[source]#
Load specific datasets supported by PyPOTS. Different from tsdb.load_dataset(), which only produces merely raw data, load_specific_dataset here does some preprocessing operations, like truncating time series to generate samples with the same length.
- Parameters:
- Returns:
A dict contains the preprocessed dataset. Users only need to continue the preprocessing steps to generate the data they want, e.g. standardizing and splitting.
- Return type:
data
pypots.data.utils#
Data utils.
- pypots.data.utils.turn_data_into_specified_dtype(data, dtype='tensor')[source]#
Turn the given data into the specified data type.
- pypots.data.utils.parse_delta(missing_mask)[source]#
Generate the time-gap matrix (i.e. the delta metrix) from the missing mask. Please refer to [31] for its math definition.
- Parameters:
missing_mask (
Union
[ndarray
,Tensor
]) – Binary masks indicate missing data (0 means missing values, 1 means observed values). Shape of [n_steps, n_features] or [n_samples, n_steps, n_features].- Returns:
The delta matrix indicates the time gaps between observed values. With the same shape of missing_mask.
- Return type:
delta
References
- pypots.data.utils.sliding_window(time_series, window_len, sliding_len=None)[source]#
Generate time series samples with sliding window method, truncating windows from time-series data with a given sequence length.
Given a time series of shape [seq_len, n_features] (seq_len is the total sequence length of the time series), this sliding_window function will generate time-series samples from this given time series with sliding window method. The number of generated samples is seq_len//sliding_len. And the final returned numpy ndarray has a shape [seq_len//sliding_len, n_steps, n_features].
- Parameters:
time_series (np.ndarray,) – time series data, len(shape)=2, [total_length, feature_num]
window_len (int,) – The length of the sliding window, i.e. the number of time steps in the generated data samples.
sliding_len (int, default = None,) – The sliding length of the window for each moving step. It will be set as the same with n_steps if None.
- Returns:
samples – The generated time-series data samples of shape [seq_len//sliding_len, n_steps, n_features].
- Return type:
np.ndarray,