Welcome to HyperImpute’s documentation!

HyperImpute - A library for NaNs and nulls.

[![Test In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1zGm4VeXsJ-0x6A5_icnknE7mbJ0knUig?usp=sharing) [![Tests PR](https://github.com/vanderschaarlab/hyperimpute/actions/workflows/test_pr.yml/badge.svg)](https://github.com/vanderschaarlab/hyperimpute/actions/workflows/test_pr.yml) [![Tests Full](https://github.com/vanderschaarlab/hyperimpute/actions/workflows/test_full.yml/badge.svg)](https://github.com/vanderschaarlab/hyperimpute/actions/workflows/test_full.yml) [![Tutorials](https://github.com/vanderschaarlab/hyperimpute/actions/workflows/test_tutorials.yml/badge.svg)](https://github.com/vanderschaarlab/hyperimpute/actions/workflows/test_tutorials.yml) [![Documentation Status](https://readthedocs.org/projects/hyperimpute/badge/?version=latest)](https://hyperimpute.readthedocs.io/en/latest/?badge=latest) [![arXiv](https://img.shields.io/badge/arXiv-2206.07769-b31b1b.svg)](https://arxiv.org/abs/2206.07769) [![](https://pepy.tech/badge/hyperimpute)](https://pypi.org/project/hyperimpute/) [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT) [![Python 3.7+](https://img.shields.io/badge/python-3.7+-blue.svg)](https://www.python.org/downloads/release/python-370/) [![slack](https://img.shields.io/badge/chat-on%20slack-purple?logo=slack)](https://join.slack.com/t/vanderschaarlab/shared_invite/zt-1pzy8z7ti-zVsUPHAKTgCd1UoY8XtTEw) ![image](https://github.com/vanderschaarlab/hyperimpute/raw/main/docs/arch.png "HyperImpute")

HyperImpute simplifies the selection process of a data imputation algorithm for your ML pipelines. It includes various novel algorithms for missing data and is compatible with sklearn.

HyperImpute features

  • 🚀 Fast and extensible dataset imputation algorithms, compatible with sklearn.

  • 🔑 New iterative imputation method: HyperImpute.

  • 🌀 Classic methods: MICE, MissForest, GAIN, MIRACLE, MIWAE, Sinkhorn, SoftImpute, etc.

  • 🔥 Pluginable architecture.

🚀 Installation

The library can be installed from PyPI using

$ pip install hyperimpute

or from source, using

$ pip install .

💥 Sample Usage

List available imputers

from hyperimpute.plugins.imputers import Imputers

imputers = Imputers()

imputers.list()

Impute a dataset using one of the available methods

import pandas as pd
import numpy as np
from hyperimpute.plugins.imputers import Imputers

X = pd.DataFrame([[1, 1, 1, 1], [4, 5, np.nan, np.nan], [3, 3, 9, 9], [2, 2, 2, 2]])

method = "gain"

plugin = Imputers().get(method)
out = plugin.fit_transform(X.copy())

print(method, out)

Specify the baseline models for HyperImpute

import pandas as pd
import numpy as np
from hyperimpute.plugins.imputers import Imputers

X = pd.DataFrame([[1, 1, 1, 1], [4, 5, np.nan, np.nan], [3, 3, 9, 9], [2, 2, 2, 2]])

plugin = Imputers().get(
    "hyperimpute",
    optimizer="hyperband",
    classifier_seed=["logistic_regression"],
    regression_seed=["linear_regression"],
)

out = plugin.fit_transform(X.copy())
print(out)

Use an imputer with a SKLearn pipeline

import pandas as pd
import numpy as np

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor

from hyperimpute.plugins.imputers import Imputers

X = pd.DataFrame([[1, 1, 1, 1], [4, 5, np.nan, np.nan], [3, 3, 9, 9], [2, 2, 2, 2]])
y = pd.Series([1, 2, 1, 2])

imputer = Imputers().get("hyperimpute")

estimator = Pipeline(
    [
        ("imputer", imputer),
        ("forest", RandomForestRegressor(random_state=0, n_estimators=100)),
    ]
)

estimator.fit(X, y)

Write a new imputation plugin

from sklearn.impute import KNNImputer
from hyperimpute.plugins.imputers import Imputers, ImputerPlugin

imputers = Imputers()

knn_imputer = "custom_knn"

class KNN(ImputerPlugin):
    def __init__(self) -> None:
        super().__init__()
        self._model = KNNImputer(n_neighbors=2, weights="uniform")

    @staticmethod
    def name():
        return knn_imputer

    @staticmethod
    def hyperparameter_space():
        return []

    def _fit(self, *args, **kwargs):
        self._model.fit(*args, **kwargs)
        return self

    def _transform(self, *args, **kwargs):
        return self._model.transform(*args, **kwargs)

imputers.add(knn_imputer, KNN)

assert imputers.get(knn_imputer) is not None

Benchmark imputation models on a dataset

from sklearn.datasets import load_iris
from hyperimpute.plugins.imputers import Imputers
from hyperimpute.utils.benchmarks import compare_models

X, y = load_iris(as_frame=True, return_X_y=True)

imputer = Imputers().get("hyperimpute")

compare_models(
    name="example",
    evaluated_model=imputer,
    X_raw=X,
    ref_methods=["ice", "missforest"],
    scenarios=["MAR"],
    miss_pct=[0.1, 0.3],
    n_iter=2,
)

📓 Tutorials

⚡ Imputation methods

The following table contains the default imputation plugins:

Strategy

Description

Code

HyperImpute

Iterative imputer using both regression and classification methods based on linear models, trees, XGBoost, CatBoost and neural nets

``plugin_hyperimpute.py` <src/hyperimpute/plugins/imputers/plugin_hyperimpute.py>`_

Mean

Replace the missing values using the mean along each column with ``SimpleImputer` <https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html>`_

``plugin_mean.py` <src/hyperimpute/plugins/imputers/plugin_mean.py>`_

Median

Replace the missing values using the median along each column with ``SimpleImputer` <https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html>`_

``plugin_median.py` <src/hyperimpute/plugins/imputers/plugin_median.py>`_

Most-frequent

Replace the missing values using the most frequent value along each column with ``SimpleImputer` <https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html>`_

``plugin_most_freq.py` <src/hyperimpute/plugins/imputers/plugin_most_freq.py>`_

MissForest

Iterative imputation method based on Random Forests using ``IterativeImputer` <https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html#sklearn.impute.IterativeImputer>`_ and ``ExtraTreesRegressor` <https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html>`_

``plugin_missforest.py` <src/hyperimpute/plugins/imputers/plugin_missforest.py>`_

ICE

Iterative imputation method based on regularized linear regression using ``IterativeImputer` <https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html#sklearn.impute.IterativeImputer>`_ and ``BayesianRidge` <https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.BayesianRidge.html>`_

``plugin_ice.py` <src/hyperimpute/plugins/imputers/plugin_ice.py>`_

MICE

Multiple imputations based on ICE using ``IterativeImputer` <https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html#sklearn.impute.IterativeImputer>`_ and ``BayesianRidge` <https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.BayesianRidge.html>`_

``plugin_mice.py` <src/hyperimpute/plugins/imputers/plugin_mice.py>`_

SoftImpute

``Low-rank matrix approximation via nuclear-norm regularization` <https://jmlr.org/papers/volume16/hastie15a/hastie15a.pdf>`_

``plugin_softimpute.py` <src/hyperimpute/plugins/imputers/plugin_softimpute.py>`_

EM

Iterative procedure which uses other variables to impute a value (Expectation), then checks whether that is the value most likely (Maximization) - ``EM imputation algorithm` <https://joon3216.github.io/research_materials/2019/em_imputation.html>`_

``plugin_em.py` <src/hyperimpute/plugins//imputers/plugin_em.py>`_

Sinkhorn

``Missing Data Imputation using Optimal Transport` <https://arxiv.org/pdf/2002.03860.pdf>`_

``plugin_sinkhorn.py` <src/hyperimpute/plugins/imputers/plugin_sinkhorn.py>`_

GAIN

``GAIN: Missing Data Imputation using Generative Adversarial Nets` <https://arxiv.org/abs/1806.02920>`_

``plugin_gain.py` <src/hyperimpute/plugins/imputers/plugin_gain.py>`_

MIRACLE

``MIRACLE: Causally-Aware Imputation via Learning Missing Data Mechanisms` <https://arxiv.org/abs/2111.03187>`_

``plugin_miracle.py` <src/hyperimpute/plugins/imputers/plugin_miracle.py>`_

MIWAE

``MIWAE: Deep Generative Modelling and Imputation of Incomplete Data` <https://arxiv.org/abs/1812.02633>`_

``plugin_miwae.py` <src/hyperimpute/plugins/imputers/plugin_miwae.py>`_

🔨 Tests

Install the testing dependencies using

pip install .[testing]

The tests can be executed using

pytest -vsx

Citing

If you use this code, please cite the associated paper:

@article{Jarrett2022HyperImpute,
  doi = {10.48550/ARXIV.2206.07769},
  url = {https://arxiv.org/abs/2206.07769},
  author = {Jarrett, Daniel and Cebere, Bogdan and Liu, Tennison and Curth, Alicia and van der Schaar, Mihaela},
  keywords = {Machine Learning (stat.ML), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {HyperImpute: Generalized Iterative Imputation with Automatic Model Selection},
  year = {2022},
  booktitle={39th International Conference on Machine Learning},
}

API documentation

Imputers

Imputers

hyperimpute.plugins.imputers.plugin_hyperimpute module

hyperimpute.plugins.imputers.plugin_EM module

class EM(maxit: int = 500, convergence_threshold: float = 1e-08)

Bases: TransformerMixin

The EM algorithm is an optimization algorithm that assumes a distribution for the partially missing data and tries to maximize the expected complete data log-likelihood under that distribution.

Steps:
  1. For an input dataset X with missing values, we assume that the values are sampled from distribution N(Mu, Sigma).

  2. We generate the “observed” and “missing” masks from X, and choose some initial values for Mu = Mu0 and Sigma = Sigma0.

  3. The EM loop tries to approximate the (Mu, Sigma) pair by some iterative means under the conditional distribution of missing components.

  4. The E step finds the conditional expectation of the “missing” data, given the observed values and current estimates of the parameters. These expectations are then substituted for the “missing” data.

  5. In the M step, maximum likelihood estimates of the parameters are computed as though the missing data had been filled in.

  6. The X_reconstructed contains the approximation after each iteration.

Parameters:
  • maxit – int, default=500 maximum number of imputation rounds to perform.

  • convergence_threshold – float, default=1e-08 Minimum ration difference between iterations before stopping.

Paper: “Maximum Likelihood from Incomplete Data via the EM Algorithm”, A. P. Dempster, N. M. Laird and D. B. Rubin

_converged(Mu: ndarray, Sigma: ndarray, Mu_new: ndarray, Sigma_new: ndarray) bool

Checks if the EM loop has converged.

Parameters:
  • Mu – np.ndarray The previous value of the mean.

  • Sigma – np.ndarray The previous value of the variance.

  • Mu_new – np.ndarray The new value of the mean.

  • Sigma_new – np.ndarray The new value of the variance.

Returns:

True/False if the algorithm has converged.

Return type:

bool

_em(X_reconstructed: ndarray, Mu: ndarray, Sigma: ndarray, observed: ndarray, missing: ndarray) Tuple[ndarray, ndarray, ndarray]

The EM step.

Parameters:
  • X_reconstructed – np.ndarray The current imputation approximation.

  • Mu – np.ndarray The previous value of the mean.

  • Sigma – np.ndarray The previous value of the variance.

  • observed – np.ndarray Mask of the observed values in the original input.

  • missing – np.ndarray Mask of the missing values in the original input.

Returns:

The new approximation of the mean. ndarray: The new approximation of the variance. ndarray: The new imputed dataset.

Return type:

ndarray

_impute_em(X: ndarray) ndarray

The EM imputation core loop.

Parameters:

X – np.ndarray The dataset with missing values.

Raises:

RuntimeError – raised if the static checks on the final result fail.

Returns:

The dataset with imputed values.

Return type:

ndarray

fit_transform(**kwargs: Any) Any

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

class EMPlugin(random_state: int = 0, maxit: int = 500, convergence_threshold: float = 1e-08)

Bases: ImputerPlugin

Imputation plugin for completing missing values using the EM strategy.

The EM algorithm is an optimization algorithm that assumes a distribution for the partially missing data and tries to maximize the expected complete data log-likelihood under that distribution.

Steps:
  1. For an input dataset X with missing values, we assume that the values are sampled from distribution N(Mu, Sigma).

  2. We generate the “observed” and “missing” masks from X, and choose some initial values for Mu = Mu0 and Sigma = Sigma0.

  3. The EM loop tries to approximate the (Mu, Sigma) pair by some iterative means under the conditional distribution of missing components.

  4. The E step finds the conditional expectation of the “missing” data, given the observed values and current estimates of the parameters. These expectations are then substituted for the “missing” data.

  5. In the M step, maximum likelihood estimates of the parameters are computed as though the missing data had been filled in.

  6. The X_reconstructed contains the approximation after each iteration.

Parameters:
  • maxit – int, default=500 maximum number of imputation rounds to perform.

  • convergence_threshold – float, default=1e-08 Minimum ration difference between iterations before stopping.

Example

>>> import numpy as np
>>> from hyperimpute.plugins.imputers import Imputers
>>> plugin = Imputers().get("EM")
>>> plugin.fit_transform([[1, 1, 1, 1], [np.nan, np.nan, np.nan, np.nan], [1, 2, 2, 1], [2, 2, 2, 2]])

Reference: “Maximum Likelihood from Incomplete Data via the EM Algorithm”, A. P. Dempster, N. M. Laird and D. B. Rubin

_abc_impl = <_abc_data object>
_fit(**kwargs: Any) Any
_transform(**kwargs: Any) Any
static hyperparameter_space(*args: Any, **kwargs: Any) List[Params]
module_relative_path: Optional[Path]
static name() str
plugin

alias of EMPlugin

hyperimpute.plugins.imputers.plugin_gain module

class GainImputation(batch_size: int = 256, n_epochs: int = 1000, hint_rate: float = 0.9, loss_alpha: float = 10)

Bases: TransformerMixin

GAIN Imputation for static data using Generative Adversarial Nets. The training steps are:

  • The generato imputes the missing components conditioned on what is actually observed, and outputs a completed vector.

  • The discriminator takes a completed vector and attempts to determine which components were actually observed and which were imputed.

Parameters:
  • batch_size – int The batch size for the training steps.

  • n_epochs – int Number of epochs for training.

  • hint_rate – float Percentage of additional information for the discriminator.

  • loss_alpha – int Hyperparameter for the generator loss.

Paper: J. Yoon, J. Jordon, M. van der Schaar, “GAIN: Missing Data Imputation using Generative Adversarial Nets,” ICML, 2018. Original code: https://github.com/jsyoon0823/GAIN

fit(X: Tensor) GainImputation

Train the GAIN model.

Parameters:

X – incomplete dataset.

Returns:

the updated model.

Return type:

self

fit_transform(X: Tensor) Tensor

Imputes the provided dataset using the GAIN strategy.

Parameters:

X – np.ndarray A dataset with missing values.

Returns:

The imputed dataset.

Return type:

Xhat

transform(Xmiss: Tensor) Tensor

Return imputed data by trained GAIN model.

Parameters:

Xmiss – the array with missing data

Returns:

the array without missing data

Return type:

torch.Tensor

Raises:

RuntimeError – if the result contains np.nans.

class GainModel(dim: int, h_dim: int, loss_alpha: float = 10)

Bases: object

The core model for GAIN Imputation.

Parameters:
  • dim – float Number of features.

  • h_dim – float Size of the hidden layer.

  • loss_alpha – int Hyperparameter for the generator loss.

discr_loss(X: Tensor, M: Tensor, H: Tensor) Tensor
discriminator(X: Tensor, hints: Tensor) Tensor
gen_loss(X: Tensor, M: Tensor, H: Tensor) Tensor
generator(X: Tensor, mask: Tensor) Tensor
class GainPlugin(batch_size: int = 128, n_epochs: int = 100, hint_rate: float = 0.8, loss_alpha: int = 10, random_state: int = 0)

Bases: ImputerPlugin

Imputation plugin for completing missing values using the GAIN strategy.

Method:

Details in the GainImputation class implementation.

Example

>>> import numpy as np
>>> from hyperimpute.plugins.imputers import Imputers
>>> plugin = Imputers().get("gain")
>>> plugin.fit_transform([[1, 1, 1, 1], [np.nan, np.nan, np.nan, np.nan], [1, 2, 2, 1], [2, 2, 2, 2]])
_abc_impl = <_abc_data object>
_fit(**kwargs: Any) Any
_transform(**kwargs: Any) Any
static hyperparameter_space(*args: Any, **kwargs: Any) List[Params]
module_relative_path: Optional[Path]
static name() str
plugin

alias of GainPlugin

sample_M(m: int, n: int, p: float) ndarray

Hint Vector Generation

Parameters:
  • m – number of rows

  • n – number of columns

  • p – hint rate

Returns:

generated random values

Return type:

np.ndarray

sample_Z(m: int, n: int) ndarray

Random sample generator for Z.

Parameters:
  • m – number of rows

  • n – number of columns

Returns:

generated random values

Return type:

np.ndarray

sample_idx(m: int, n: int) ndarray

Mini-batch generation

Parameters:
  • m – number of rows

  • n – number of columns

Returns:

generated random indices

Return type:

np.ndarray

hyperimpute.plugins.imputers.plugin_miracle module

class MiraclePlugin(lr: float = 0.001, batch_size: int = 1024, num_outputs: int = 1, n_hidden: int = 32, reg_lambda: float = 1, reg_beta: float = 1, DAG_only: bool = False, reg_m: float = 1.0, window: int = 10, max_steps: int = 400, seed_imputation: str = 'mean', random_state: int = 0)

Bases: ImputerPlugin

MIRACLE (Missing data Imputation Refinement And Causal LEarning) MIRACLE iteratively refines the imputation of a baseline by simultaneously modeling the missingness generating mechanism and encouraging imputation to be consistent with the causal structure of the data.

Example

>>> import numpy as np
>>> from hyperimpute.plugins.imputers import Imputers
>>> plugin = Imputers().get("miracle")
>>> plugin.fit_transform([[1, 1, 1, 1], [np.nan, np.nan, np.nan, np.nan], [1, 2, 2, 1], [2, 2, 2, 2]])

Reference: “MIRACLE: Causally-Aware Imputation via Learning Missing Data Mechanisms”, Trent Kyono, Yao Zhang, Alexis Bellot, Mihaela van der Schaar

_abc_impl = <_abc_data object>
_fit(X: DataFrame, *args: Any, **kwargs: Any) MiraclePlugin
_get_seed_imputer(method: str) ImputerPlugin
_transform(X: DataFrame) DataFrame
static hyperparameter_space(*args: Any, **kwargs: Any) List[Params]
classmethod load(buff: bytes) MiraclePlugin
module_relative_path: Optional[Path]
static name() str
save() bytes
plugin

alias of MiraclePlugin

hyperimpute.plugins.imputers.plugin_ice module

hyperimpute.plugins.imputers.plugin_mice module

class MicePlugin(n_imputations: int = 1, max_iter: int = 100, tol: float = 0.001, initial_strategy: int = 0, imputation_order: int = 0, random_state: int = 0)

Bases: ImputerPlugin

Imputation plugin for completing missing values using the Multivariate Iterative chained equations and multiple imputations.

Method:

Multivariate Iterative chained equations(MICE) methods model each feature with missing values as a function of other features in a round-robin fashion. For each step of the round-robin imputation, we use a BayesianRidge estimator, which does a regularized linear regression. The class sklearn.impute.IterativeImputer is able to generate multiple imputations of the same incomplete dataset. We can then learn a regression or classification model on different imputations of the same dataset. Setting sample_posterior=True for the IterativeImputer will randomly draw values to fill each missing value from the Gaussian posterior of the predictions. If each IterativeImputer uses a different random_state, this results in multiple imputations, each of which can be used to train a predictive model. The final result is the average of all the n_imputation estimates.

Parameters:
  • n_imputations – int, default=5i number of multiple imputations to perform.

  • max_iter – int, default=500 maximum number of imputation rounds to perform.

  • random_state – int, default set to the current time. seed of the pseudo random number generator to use.

Example

>>> import numpy as np
>>> from hyperimpute.plugins.imputers import Imputers
>>> plugin = Imputers().get("mice")
>>> plugin.fit_transform([[1, 1, 1, 1], [np.nan, np.nan, np.nan, np.nan], [1, 2, 2, 1], [2, 2, 2, 2]])
_abc_impl = <_abc_data object>
_fit(**kwargs: Any) Any
_transform(**kwargs: Any) Any
static hyperparameter_space(*args: Any, **kwargs: Any) List[Params]
imputation_order_vals = ['ascending', 'descending', 'roman', 'arabic', 'random']
initial_strategy_vals = ['mean', 'median', 'most_frequent', 'constant']
module_relative_path: Optional[Path]
static name() str
plugin

alias of MicePlugin

hyperimpute.plugins.imputers.plugin_missforest module

hyperimpute.plugins.imputers.plugin_sinkhorn module

hyperimpute.plugins.imputers.plugin_softimpute module

class SoftImpute(maxit: int = 1000, convergence_threshold: float = 1e-05, max_rank: int = 2, shrink_lambda: float = 0, cv_len: int = 3, random_state: int = 0)

Bases: TransformerMixin

The SoftImpute algorithm fits a low-rank matrix approximation to a matrix with missing values via nuclear-norm regularization. The algorithm can be used to impute quantitative data. To calibrate the the nuclear-norm regularization parameter(shrink_lambda), we perform cross-validation(_cv_softimpute)

Parameters:
  • maxit – int, default=500 maximum number of imputation rounds to perform.

  • convergence_threshold – float, default=1e-5 Minimum ration difference between iterations before stopping.

  • max_rank – int, default=2 Perform a truncated SVD on each iteration with this value as its rank.

  • shrink_lambda – float, default=0 Value by which we shrink singular values on each iteration. If it’s missing, it is calibrated using cross validation.

  • cv_len – int, default=15 the length of the grid on which the cross-validation is performed.

Example

>>> import numpy as np
>>> from hyperimpute.plugins.imputers import Imputers
>>> plugin = Imputers().get("softimpute")
>>> plugin.fit_transform([[1, 1, 1, 1], [np.nan, np.nan, np.nan, np.nan], [1, 2, 2, 1], [2, 2, 2, 2]])

Reference: “Spectral Regularization Algorithms for Learning Large Incomplete Matrices”, by Mazumder, Hastie, and Tibshirani.

_approximate_shrink_val(X: ndarray) float

Try to calibrate the shrinkage step using cross-validation. It simulates more missing items and tests the performance of different shrinkage values.

Parameters:

X – np.ndarray The dataset to use.

Returns:

The value to use for the shrinkage step.

Return type:

float

_converged(Xold: ndarray, X: ndarray, mask: ndarray) bool

Checks if the SoftImpute algorithm has converged.

Parameters:
  • Xold – np.ndarray The previous version of the imputed dataset.

  • X – np.ndarray The new version of the imputed dataset.

  • mask – np.ndarray The original missing mask.

Returns:

True/False if the algorithm has converged.

Return type:

bool

_simulate_more_nan(X: ndarray, mask: ndarray) ndarray

Generate more missing values for cross-validation.

Parameters:
  • X – np.ndarray The dataset to use.

  • mask – np.ndarray The existing missing positions

Returns:

A new version of X with more missing values.

Return type:

Xsim

_softimpute(X: ndarray, shrink_val: float) ndarray

Core loop of the algorithm. It approximates the imputed X using the SVD decomposition in a loop, until the algorithm converges/the maxit iteration is reached.

Parameters:
  • X – np.ndarray The previous version of the imputed dataset.

  • shrink_val – float The value by which we shrink singular values on each iteration.

Returns:

The imputed dataset.

Return type:

X_hat

_svd(X: ndarray, shrink_val: float) ndarray

Reconstructs X from low-rank thresholded SVD.

Parameters:
  • X – np.ndarray The previous version of the imputed dataset.

  • shrink_val – float The value by which we shrink singular values on each iteration.

Raises:

RuntimeError – raised if the static checks on the final result fail.

Returns:

new candidate for the result.

Return type:

X_reconstructed

fit(**kwargs: Any) Any
fit_transform(**kwargs: Any) Any

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

classmethod load(buff: bytes) SoftImpute
save() bytes
transform(**kwargs: Any) Any
class SoftImputePlugin(maxit: int = 1000, convergence_threshold: float = 1e-05, max_rank: int = 2, shrink_lambda: float = 0, cv_len: int = 3, random_state: int = 0)

Bases: ImputerPlugin

Imputation plugin for completing missing values using the SoftImpute strategy.

Method:

Details in the SoftImpute class implementation.

Example

>>> import numpy as np
>>> from hyperimpute.plugins.imputers import Imputers
>>> plugin = Imputers().get("softimpute")
>>> plugin.fit_transform([[1, 1, 1, 1], [np.nan, np.nan, np.nan, np.nan], [1, 2, 2, 1], [2, 2, 2, 2]])
              0             1             2             3
0  1.000000e+00  1.000000e+00  1.000000e+00  1.000000e+00
1  3.820605e-16  1.708249e-16  1.708249e-16  3.820605e-16
2  1.000000e+00  2.000000e+00  2.000000e+00  1.000000e+00
3  2.000000e+00  2.000000e+00  2.000000e+00  2.000000e+00
_abc_impl = <_abc_data object>
_fit(**kwargs: Any) Any
_transform(**kwargs: Any) Any
static hyperparameter_space(*args: Any, **kwargs: Any) List[Params]
module_relative_path: Optional[Path]
static name() str
plugin

alias of SoftImputePlugin

hyperimpute.plugins.imputers.plugin_miwae module

class MIWAEPlugin(n_epochs: int = 500, batch_size: int = 256, latent_size: int = 1, n_hidden: int = 1, random_state: int = 0, K: int = 20)

Bases: ImputerPlugin

MIWAE imputation plugin

Parameters:
  • n_epochs – int Number of training iterations

  • batch_size – int Batch size

  • latent_size – int dimension of the latent space

  • n_hidden – int number of hidden units

  • K – int number of IS during training

  • random_state – int random seed

Example

>>> import numpy as np
>>> from hyperimpute.plugins.imputers import Imputers
>>> plugin = Imputers().get("miwae")
>>> plugin.fit_transform([[1, 1, 1, 1], [np.nan, np.nan, np.nan, np.nan], [1, 2, 2, 1], [2, 2, 2, 2]])

Reference: “MIWAE: Deep Generative Modelling and Imputation of Incomplete Data”, Pierre-Alexandre Mattei, Jes Frellsen Original code: https://github.com/pamattei/miwae

_abc_impl = <_abc_data object>
_fit(X: DataFrame, *args: Any, **kwargs: Any) MIWAEPlugin
_miwae_impute(iota_x: Tensor, mask: Tensor, L: int) Tensor
_miwae_loss(iota_x: Tensor, mask: Tensor) Tensor
_transform(X: DataFrame) DataFrame
static hyperparameter_space(*args: Any, **kwargs: Any) List[Params]
module_relative_path: Optional[Path]
static name() str
plugin

alias of MIWAEPlugin

weights_init(layer: Any) None

hyperimpute.plugins.imputers.plugin_mean module

class MeanPlugin(random_state: int = 0)

Bases: ImputerPlugin

Imputation plugin for completing missing values using the Mean Imputation strategy.

Method:

The Mean Imputation strategy replaces the missing values using the mean along each column.

Example

>>> import numpy as np
>>> from hyperimpute.plugins.imputers import Imputers
>>> plugin = Imputers().get("mean")
>>> plugin.fit_transform([[1, 1, 1, 1], [np.nan, np.nan, np.nan, np.nan], [1, 2, 2, 1], [2, 2, 2, 2]])
_abc_impl = <_abc_data object>
_fit(X: DataFrame, *args: Any, **kwargs: Any) MeanPlugin
_transform(X: DataFrame) DataFrame
static hyperparameter_space(*args: Any, **kwargs: Any) List[Params]
module_relative_path: Optional[Path]
static name() str
plugin

alias of MeanPlugin

hyperimpute.plugins.imputers.plugin_median module

class MedianPlugin(random_state: int = 0)

Bases: ImputerPlugin

Imputation plugin for completing missing values using the Median Imputation strategy.

Method:

The Median Imputation strategy replaces the missing values using the median along each column.

Example

>>> import numpy as np
>>> from hyperimpute.plugins.imputers import Imputers
>>> plugin = Imputers().get("median")
>>> plugin.fit_transform([[1, 1, 1, 1], [np.nan, np.nan, np.nan, np.nan], [1, 2, 2, 1], [2, 2, 2, 2]])
     0    1    2    3
0  1.0  1.0  1.0  1.0
1  1.0  2.0  2.0  1.0
2  1.0  2.0  2.0  1.0
3  2.0  2.0  2.0  2.0
_abc_impl = <_abc_data object>
_fit(X: DataFrame, *args: Any, **kwargs: Any) MedianPlugin
_transform(X: DataFrame) DataFrame
static hyperparameter_space(*args: Any, **kwargs: Any) List[Params]
module_relative_path: Optional[Path]
static name() str
plugin

alias of MedianPlugin

Prediction models

Classifiers

hyperimpute.plugins.prediction.classifiers.plugin_logistic_regression module

class LogisticRegressionPlugin(C: float = 1.0, solver: int = 1, multi_class: int = 0, class_weight: int = 0, max_iter: int = 10000, penalty: str = 'l2', model: Optional[Any] = None, random_state: int = 0, hyperparam_search_iterations: Optional[int] = None, **kwargs: Any)

Bases: ClassifierPlugin

Classification plugin based on the Logistic Regression classifier.

Method:

Logistic regression is a linear model for classification rather than regression. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.

Parameters:
  • C – float Inverse of regularization strength; must be a positive float.

  • solver – str Algorithm to use in the optimization problem: [‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’]

  • multi_class – str If the option chosen is ‘ovr’, then a binary problem is fit for each label. For ‘multinomial’ the loss minimised is the multinomial loss fit across the entire probability distribution, even when the data is binary. ‘multinomial’ is unavailable when solver=’liblinear’. ‘auto’ selects ‘ovr’ if the data is binary, or if solver=’liblinear’, and otherwise selects ‘multinomial’.

  • class_weight – str Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one.

  • max_iter – int Maximum number of iterations taken for the solvers to converge.

Example

>>> from hyperimpute.plugins.prediction import Predictions
>>> plugin = Predictions(category="classifiers").get("logistic_regression")
>>> from sklearn.datasets import load_iris
>>> X, y = load_iris(return_X_y=True)
>>> plugin.fit_predict(X, y) # returns the probabilities for each class
_abc_impl = <_abc_data object>
_fit(X: DataFrame, *args: Any, **kwargs: Any) LogisticRegressionPlugin
_predict(X: DataFrame, *args: Any, **kwargs: Any) DataFrame
_predict_proba(X: DataFrame, *args: Any, **kwargs: Any) DataFrame
classes = ['auto', 'ovr', 'multinomial']
static hyperparameter_space(*args: Any, **kwargs: Any) List[Params]
module_relative_path: Optional[Path]
static name() str
solvers = ['newton-cg', 'lbfgs', 'sag', 'saga']
weights = ['balanced', None]
plugin

alias of LogisticRegressionPlugin

hyperimpute.plugins.prediction.classifiers.plugin_random_forest module

class RandomForestPlugin(n_estimators: int = 100, criterion: int = 0, max_features: int = 0, min_samples_split: int = 2, min_samples_leaf: int = 1, max_depth: Optional[int] = 3, random_state: int = 0, hyperparam_search_iterations: Optional[int] = None, **kwargs: Any)

Bases: ClassifierPlugin

Classification plugin based on Random forests.

Method:

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

Parameters:
  • n_estimators – int The number of trees in the forest.

  • criterion – str The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.

  • max_features – str The number of features to consider when looking for the best split.

  • min_samples_split – int The minimum number of samples required to split an internal node.

  • boostrap – bool Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.

  • min_samples_leaf – int The minimum number of samples required to be at a leaf node.

Example

>>> from hyperimpute.plugins.prediction import Predictions
>>> plugin = Predictions(category="classifiers").get("random_forest")
>>> from sklearn.datasets import load_iris
>>> X, y = load_iris(return_X_y=True)
>>> plugin.fit_predict(X, y)
_abc_impl = <_abc_data object>
_fit(X: DataFrame, *args: Any, **kwargs: Any) RandomForestPlugin
_predict(X: DataFrame, *args: Any, **kwargs: Any) DataFrame
_predict_proba(X: DataFrame, *args: Any, **kwargs: Any) DataFrame
criterions = ['gini', 'entropy']
features = ['sqrt', 'log2', None]
static hyperparameter_space(*args: Any, **kwargs: Any) List[Params]
module_relative_path: Optional[Path]
static name() str
plugin

alias of RandomForestPlugin

hyperimpute.plugins.prediction.classifiers.plugin_xgboost module

class XGBoostPlugin(n_estimators: int = 100, reg_lambda: Optional[float] = None, reg_alpha: Optional[float] = None, colsample_bytree: Optional[float] = None, colsample_bynode: Optional[float] = None, colsample_bylevel: Optional[float] = None, max_depth: Optional[int] = 3, subsample: Optional[float] = None, lr: Optional[float] = None, min_child_weight: Optional[int] = None, max_bin: int = 256, booster: int = 0, grow_policy: int = 0, nthread: int = 1, random_state: int = 0, eta: float = 0.3, hyperparam_search_iterations: Optional[int] = None, **kwargs: Any)

Bases: ClassifierPlugin

Classification plugin based on the XGBoost classifier.

Method:

Gradient boosting is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler and weaker models. The XGBoost algorithm has a robust handling of a variety of data types, relationships, distributions, and the variety of hyperparameters that you can fine-tune.

Parameters:
  • n_estimators – int The maximum number of estimators at which boosting is terminated.

  • max_depth – int Maximum depth of a tree.

  • reg_lambda – float L2 regularization term on weights (xgb’s lambda).

  • reg_alpha – float L1 regularization term on weights (xgb’s alpha).

  • colsample_bytree – float Subsample ratio of columns when constructing each tree.

  • colsample_bynode – float Subsample ratio of columns for each split.

  • colsample_bylevel – float Subsample ratio of columns for each level.

  • subsample – float Subsample ratio of the training instance.

  • lr – float Boosting learning rate

  • booster – str Specify which booster to use: gbtree, gblinear or dart.

  • min_child_weight – int Minimum sum of instance weight(hessian) needed in a child.

  • max_bin – int Number of bins for histogram construction.

  • random_state – float Random number seed.

Example

>>> from hyperimpute.plugins.prediction import Predictions
>>> plugin = Predictions(category="classifiers").get("xgboost")
>>> from sklearn.datasets import load_iris
>>> X, y = load_iris(return_X_y=True)
>>> plugin.fit_predict(X, y)
_abc_impl = <_abc_data object>
_fit(X: DataFrame, *args: Any, **kwargs: Any) XGBoostPlugin
_predict(X: DataFrame, *args: Any, **kwargs: Any) DataFrame
_predict_proba(X: DataFrame, *args: Any, **kwargs: Any) DataFrame
booster = ['gbtree', 'gblinear', 'dart']
grow_policy = ['depthwise', 'lossguide']
static hyperparameter_space(*args: Any, **kwargs: Any) List[Params]
module_relative_path: Optional[Path]
static name() str
plugin

alias of XGBoostPlugin

hyperimpute.plugins.prediction.classifiers.plugin_catboost module

class CatBoostPlugin(n_estimators: Optional[int] = 10, depth: Optional[int] = None, grow_policy: int = 0, model: Optional[Any] = None, hyperparam_search_iterations: Optional[int] = None, random_state: int = 0, l2_leaf_reg: float = 3, learning_rate: float = 0.001, min_data_in_leaf: int = 1, random_strength: float = 1, **kwargs: Any)

Bases: ClassifierPlugin

Classification plugin based on the CatBoost framework.

Method:

CatBoost provides a gradient boosting framework which attempts to solve for Categorical features using a permutation driven alternative compared to the classical algorithm. It uses Ordered Boosting to overcome over fitting and Symmetric Trees for faster execution.

Parameters:
  • learning_rate – float The learning rate used for training.

  • depth – int

  • iterations – int

  • grow_policy – int

Example

>>> from hyperimpute.plugins.prediction import Predictions
>>> plugin = Predictions(category="classifiers").get("catboost")
>>> from sklearn.datasets import load_iris
>>> X, y = load_iris(return_X_y=True)
>>> plugin.fit_predict(X, y) # returns the probabilities for each class
_abc_impl = <_abc_data object>
_fit(X: DataFrame, *args: Any, **kwargs: Any) CatBoostPlugin
_predict(X: DataFrame, *args: Any, **kwargs: Any) DataFrame
_predict_proba(X: DataFrame, *args: Any, **kwargs: Any) DataFrame
grow_policies: List[Optional[str]] = [None, 'Depthwise', 'SymmetricTree', 'Lossguide']
static hyperparameter_space(*args: Any, **kwargs: Any) List[Params]
static name() str
plugin

alias of CatBoostPlugin

hyperimpute.plugins.prediction.classifiers.plugin_neural_nets module

class BasicNet(n_unit_in: int, categories_cnt: int, n_layers_hidden: int = 1, n_units_hidden: int = 100, nonlin: str = 'relu', lr: float = 0.001, weight_decay: float = 0.001, n_iter: int = 300, batch_size: int = 1024, n_iter_print: int = 10, random_state: int = 0, patience: int = 10, n_iter_min: int = 100, dropout: float = 0.1, clipping_value: int = 1, batch_norm: bool = True, early_stopping: bool = True)

Bases: Module

Basic neural net.

Parameters:
  • n_unit_in (int) – Number of features

  • categories (int) –

  • n_layers_hidden (int) – Number of hypothesis layers (n_layers_hidden x n_units_hidden + 1 x Linear layer)

  • n_units_hidden (int) – Number of hidden units in each hypothesis layer

  • nonlin (string, default 'elu') – Nonlinearity to use in NN. Can be ‘elu’, ‘relu’, ‘selu’ or ‘leaky_relu’.

  • lr (float) – learning rate for optimizer. step_size equivalent in the JAX version.

  • weight_decay (float) – l2 (ridge) penalty for the weights.

  • n_iter (int) – Maximum number of iterations.

  • batch_size (int) – Batch size

  • n_iter_print (int) – Number of iterations after which to print updates and check the validation loss.

  • random_state (int) – random_state used

  • val_split_prop (float) – Proportion of samples used for validation split (can be 0)

  • patience (int) – Number of iterations to wait before early stopping after decrease in validation loss

  • n_iter_min (int) – Minimum number of iterations to go through before starting early stopping

  • clipping_value (int, default 1) – Gradients clipping value

_backward_hooks: Dict[int, Callable]
_buffers: Dict[str, Optional[Tensor]]
_check_tensor(X: Tensor) Tensor
_forward_hooks: Dict[int, Callable]
_forward_pre_hooks: Dict[int, Callable]
_is_full_backward_hook: Optional[bool]
_load_state_dict_post_hooks: Dict[int, Callable]
_load_state_dict_pre_hooks: Dict[int, Callable]
_modules: Dict[str, Optional[Module]]
_non_persistent_buffers_set: Set[str]
_parameters: Dict[str, Optional[Parameter]]
_state_dict_hooks: Dict[int, Callable]
forward(X: Tensor) Tensor

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

train(X: Tensor, y: Tensor) BasicNet

Sets the module in training mode.

This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g. Dropout, BatchNorm, etc.

Parameters:

mode (bool) – whether to set training mode (True) or evaluation mode (False). Default: True.

Returns:

self

Return type:

Module

training: bool
class NeuralNetsPlugin(n_layers_hidden: int = 1, n_units_hidden: int = 100, nonlin: str = 'relu', lr: float = 0.001, weight_decay: float = 0.001, n_iter: int = 1000, batch_size: int = 128, n_iter_print: int = 10, random_state: int = 0, patience: int = 10, n_iter_min: int = 100, dropout: float = 0.1, clipping_value: int = 1, batch_norm: bool = True, early_stopping: bool = True, hyperparam_search_iterations: Optional[int] = None, **kwargs: Any)

Bases: ClassifierPlugin

Classification plugin based on Neural networks.

Example

>>> from hyperimpute.plugins.prediction import Predictions
>>> plugin = Predictions(category="classifiers").get("neural_nets")
>>> from sklearn.datasets import load_iris
>>> X, y = load_iris(return_X_y=True)
>>> plugin.fit_predict(X, y) # returns the probabilities for each class
_abc_impl = <_abc_data object>
_fit(X: DataFrame, *args: Any, **kwargs: Any) NeuralNetsPlugin
_predict(X: DataFrame, *args: Any, **kwargs: Any) DataFrame
_predict_proba(X: DataFrame, *args: Any, **kwargs: Any) DataFrame
static hyperparameter_space(*args: Any, **kwargs: Any) List[Params]
module_relative_path: Optional[Path]
static name() str
plugin

alias of NeuralNetsPlugin

Regressors

hyperimpute.plugins.prediction.regression.plugin_linear_regression module

class LinearRegressionPlugin(solver: int = 0, max_iter: Optional[int] = 10000, tol: float = 0.001, hyperparam_search_iterations: Optional[int] = None, random_state: int = 0, **kwargs: Any)

Bases: RegressionPlugin

Regression plugin based on the Linear Regression.

Example

>>> from hyperimpute.plugins.prediction import Predictions
>>> plugin = Predictions(category="regression").get("linear_regression")
>>> from sklearn.datasets import load_iris
>>> X, y = load_iris(return_X_y=True)
>>> plugin.fit_predict(X, y) # returns the probabilities for each class
_abc_impl = <_abc_data object>
_fit(X: DataFrame, *args: Any, **kwargs: Any) LinearRegressionPlugin
_predict(X: DataFrame, *args: Any, **kwargs: Any) DataFrame
static hyperparameter_space(*args: Any, **kwargs: Any) List[Params]
module_relative_path: Optional[Path]
static name() str
solvers = ['auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga']
plugin

alias of LinearRegressionPlugin

hyperimpute.plugins.prediction.regression.plugin_random_forest_regressor module

class RandomForestRegressionPlugin(n_estimators: int = 100, criterion: int = 0, max_features: int = 0, min_samples_split: int = 2, min_samples_leaf: int = 1, max_depth: Optional[int] = 3, hyperparam_search_iterations: Optional[int] = None, random_state: int = 0, **kwargs: Any)

Bases: RegressionPlugin

Regression plugin based on Random forests.

Method:

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

Parameters:
  • n_estimators – int The number of trees in the forest.

  • criterion – str The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.

  • max_features – str The number of features to consider when looking for the best split.

  • min_samples_split – int The minimum number of samples required to split an internal node.

  • boostrap – bool Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.

  • min_samples_leaf – int The minimum number of samples required to be at a leaf node.

Example

>>> from hyperimpute.plugins.prediction import Predictions
>>> plugin = Predictions(category="regression").get("random_forest")
>>> from sklearn.datasets import load_iris
>>> X, y = load_iris(return_X_y=True)
>>> plugin.fit_predict(X, y)
_abc_impl = <_abc_data object>
_fit(X: DataFrame, *args: Any, **kwargs: Any) RandomForestRegressionPlugin
_predict(X: DataFrame, *args: Any, **kwargs: Any) DataFrame
criterions = ['squared_error', 'absolute_error', 'poisson']
features = ['sqrt', 'log2', None]
static hyperparameter_space(*args: Any, **kwargs: Any) List[Params]
module_relative_path: Optional[Path]
static name() str
plugin

alias of RandomForestRegressionPlugin

hyperimpute.plugins.prediction.regression.plugin_xgboost_regressor module

class XGBoostRegressorPlugin(reg_lambda: Optional[float] = None, reg_alpha: Optional[float] = None, colsample_bytree: Optional[float] = None, colsample_bynode: Optional[float] = None, colsample_bylevel: Optional[float] = None, n_estimators: int = 100, max_depth: Optional[int] = 3, lr: Optional[float] = None, random_state: int = 0, subsample: Optional[float] = None, min_child_weight: Optional[int] = None, max_bin: int = 256, booster: int = 0, grow_policy: int = 0, eta: float = 0.3, hyperparam_search_iterations: Optional[int] = None, **kwargs: Any)

Bases: RegressionPlugin

Classification plugin based on the XGBoostRegressor.

Method:

Gradient boosting is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler and weaker models. The XGBoostRegressor algorithm has a robust handling of a variety of data types, relationships, distributions, and the variety of hyperparameters that you can fine-tune.

Parameters:
  • n_estimators – int The maximum number of estimators at which boosting is terminated.

  • max_depth – int Maximum depth of a tree.

  • reg_lambda – float L2 regularization term on weights (xgb’s lambda).

  • reg_alpha – float L1 regularization term on weights (xgb’s alpha).

  • colsample_bytree – float Subsample ratio of columns when constructing each tree.

  • colsample_bynode – float Subsample ratio of columns for each split.

  • colsample_bylevel – float Subsample ratio of columns for each level.

  • subsample – float Subsample ratio of the training instance.

  • learning_rate – float Boosting learning rate

  • booster – str Specify which booster to use: gbtree, gblinear or dart.

  • min_child_weight – int Minimum sum of instance weight(hessian) needed in a child.

  • max_bin – int Number of bins for histogram construction.

  • tree_method – str Specify which tree method to use. Default to auto. If this parameter is set to default, XGBoostRegressor will choose the most conservative option available.

  • random_state – float Random number seed.

Example

>>> from hyperimpute.plugins.prediction import Predictions
>>> plugin = Predictions(category="regressors").get("xgboost")
>>> from sklearn.datasets import load_iris
>>> X, y = load_iris(return_X_y=True)
>>> plugin.fit_predict(X, y)
_abc_impl = <_abc_data object>
_fit(X: DataFrame, *args: Any, **kwargs: Any) XGBoostRegressorPlugin
_predict(X: DataFrame, *args: Any, **kwargs: Any) DataFrame
grow_policy = ['depthwise', 'lossguide']
static hyperparameter_space(*args: Any, **kwargs: Any) List[Params]
module_relative_path: Optional[Path]
static name() str
plugin

alias of XGBoostRegressorPlugin

hyperimpute.plugins.prediction.regression.plugin_catboost_regressor module

class CatBoostRegressorPlugin(depth: Optional[int] = None, grow_policy: int = 0, n_estimators: Optional[int] = 10, hyperparam_search_iterations: Optional[int] = None, random_state: int = 0, l2_leaf_reg: float = 3, learning_rate: float = 0.001, min_data_in_leaf: int = 1, random_strength: float = 1, **kwargs: Any)

Bases: RegressionPlugin

Regression plugin based on the CatBoost framework.

Method:

CatBoost provides a gradient boosting framework which attempts to solve for Categorical features using a permutation driven alternative compared to the classical algorithm. It uses Ordered Boosting to overcome over fitting and Symmetric Trees for faster execution.

Example

>>> from hyperimpute.plugins.prediction import Predictions
>>> plugin = Predictions(category="regression").get("catboost_regressor")
>>> from sklearn.datasets import load_iris
>>> X, y = load_iris(return_X_y=True)
>>> plugin.fit_predict(X, y) # returns the probabilities for each class
_abc_impl = <_abc_data object>
_fit(X: DataFrame, *args: Any, **kwargs: Any) CatBoostRegressorPlugin
_predict(X: DataFrame, *args: Any, **kwargs: Any) DataFrame
_predict_proba(X: DataFrame, *args: Any, **kwargs: Any) DataFrame
grow_policies: List[Optional[str]] = [None, 'Depthwise', 'SymmetricTree', 'Lossguide']
static hyperparameter_space(*args: Any, **kwargs: Any) List[Params]
static name() str
plugin

alias of CatBoostRegressorPlugin

hyperimpute.plugins.prediction.regression.plugin_neural_nets_regression module

class BasicNet(n_unit_in: int, n_layers_hidden: int = 1, n_units_hidden: int = 100, nonlin: str = 'relu', lr: float = 0.001, weight_decay: float = 0.001, n_iter: int = 300, batch_size: int = 1024, n_iter_print: int = 10, random_state: int = 0, patience: int = 10, n_iter_min: int = 100, dropout: float = 0.1, clipping_value: int = 1, batch_norm: bool = True, early_stopping: bool = True)

Bases: Module

Basic neural net.

Parameters:
  • n_unit_in (int) – Number of features

  • n_layers_hidden (int) – Number of hypothesis layers (n_layers_hidden x n_units_hidden + 1 x Linear layer)

  • n_units_hidden (int) – Number of hidden units in each hypothesis layer

  • nonlin (string, default 'elu') – Nonlinearity to use in NN. Can be ‘elu’, ‘relu’, ‘selu’ or ‘leaky_relu’.

  • lr (float) – learning rate for optimizer. step_size equivalent in the JAX version.

  • weight_decay (float) – l2 (ridge) penalty for the weights.

  • n_iter (int) – Maximum number of iterations.

  • batch_size (int) – Batch size

  • n_iter_print (int) – Number of iterations after which to print updates and check the validation loss.

  • seed (int) – Seed used

  • val_split_prop (float) – Proportion of samples used for validation split (can be 0)

  • patience (int) – Number of iterations to wait before early stopping after decrease in validation loss

  • n_iter_min (int) – Minimum number of iterations to go through before starting early stopping

  • clipping_value (int, default 1) – Gradients clipping value

_backward_hooks: Dict[int, Callable]
_buffers: Dict[str, Optional[Tensor]]
_check_tensor(X: Tensor) Tensor
_forward_hooks: Dict[int, Callable]
_forward_pre_hooks: Dict[int, Callable]
_is_full_backward_hook: Optional[bool]
_load_state_dict_post_hooks: Dict[int, Callable]
_load_state_dict_pre_hooks: Dict[int, Callable]
_modules: Dict[str, Optional[Module]]
_non_persistent_buffers_set: Set[str]
_parameters: Dict[str, Optional[Parameter]]
_state_dict_hooks: Dict[int, Callable]
forward(X: Tensor) Tensor

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

train(X: Tensor, y: Tensor) BasicNet

Sets the module in training mode.

This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g. Dropout, BatchNorm, etc.

Parameters:

mode (bool) – whether to set training mode (True) or evaluation mode (False). Default: True.

Returns:

self

Return type:

Module

training: bool
class NeuralNetsRegressionPlugin(n_layers_hidden: int = 1, n_units_hidden: int = 100, nonlin: str = 'relu', lr: float = 0.001, weight_decay: float = 0.001, n_iter: int = 1000, batch_size: int = 512, n_iter_print: int = 10, patience: int = 10, n_iter_min: int = 100, dropout: float = 0.1, clipping_value: int = 1, batch_norm: bool = True, early_stopping: bool = True, hyperparam_search_iterations: Optional[int] = None, random_state: int = 0, **kwargs: Any)

Bases: RegressionPlugin

Regression plugin based on Neural networks.

Example

>>> from hyperimpute.plugins.prediction import Predictions
>>> plugin = Predictions(category="regression").get("neural_nets_regression")
>>> from sklearn.datasets import load_iris
>>> X, y = load_iris(return_X_y=True)
>>> plugin.fit_predict(X, y) # returns the probabilities for each class
_abc_impl = <_abc_data object>
_fit(X: DataFrame, *args: Any, **kwargs: Any) NeuralNetsRegressionPlugin
_predict(X: DataFrame, *args: Any, **kwargs: Any) DataFrame
static hyperparameter_space(*args: Any, **kwargs: Any) List[Params]
module_relative_path: Optional[Path]
static name() str
plugin

alias of NeuralNetsRegressionPlugin

Utils

Utils

hyperimpute.plugins.utils.simulate module

Original code: https://rmisstastic.netlify.app/how-to/python/generate_html/how%20to%20generate%20missing%20values

MAR_mask(X: ndarray, p: float, p_obs: float, sample_columns: bool = True) ndarray

Missing at random mechanism with a logistic masking model. First, a subset of variables with no missing values is randomly selected. The remaining variables have missing values according to a logistic model with random weights, re-scaled so as to attain the desired proportion of missing values on those variables.

Parameters:
  • X – Data for which missing values will be simulated.

  • p – Proportion of missing values to generate for variables which will have missing values.

  • p_obs – Proportion of variables with no missing values that will be used for the logistic masking model.

Returns:

Mask of generated missing values (True if the value is missing).

Return type:

mask

MNAR_mask_logistic(X: ndarray, p: float, p_params: float = 0.3, exclude_inputs: bool = True) ndarray

Missing not at random mechanism with a logistic masking model. It implements two mechanisms: (i) Missing probabilities are selected with a logistic model, taking all variables as inputs. Hence, values that are inputs can also be missing. (ii) Variables are split into a set of intputs for a logistic model, and a set whose missing probabilities are determined by the logistic model. Then inputs are then masked MCAR (hence, missing values from the second set will depend on masked values. In either case, weights are random and the intercept is selected to attain the desired proportion of missing values.

Parameters:
  • X – Data for which missing values will be simulated.

  • p – Proportion of missing values to generate for variables which will have missing values.

  • p_params – Proportion of variables that will be used for the logistic masking model (only if exclude_inputs).

  • exclude_inputs – True: mechanism (ii) is used, False: (i)

Returns:

Mask of generated missing values (True if the value is missing).

Return type:

mask

MNAR_mask_quantiles(X: ndarray, p: float, q: float, p_params: float, cut: str = 'both', MCAR: bool = False) ndarray

Missing not at random mechanism with quantile censorship. First, a subset of variables which will have missing variables is randomly selected. Then, missing values are generated on the q-quantiles at random. Since missingness depends on quantile information, it depends on masked values, hence this is a MNAR mechanism.

Parameters:
  • X – Data for which missing values will be simulated.

  • p – Proportion of missing values to generate for variables which will have missing values.

  • q – Quantile level at which the cuts should occur

  • p_params – Proportion of variables that will have missing values

  • cut – ‘both’, ‘upper’ or ‘lower’. Where the cut should be applied. For instance, if q=0.25 and cut=’upper’, then missing values will be generated in the upper quartiles of selected variables.

  • MCAR – If true, masks variables that were not selected for quantile censorship with a MCAR mechanism.

Returns:

Mask of generated missing values (True if the value is missing).

Return type:

mask

MNAR_self_mask_logistic(X: ndarray, p: float) ndarray

Missing not at random mechanism with a logistic self-masking model. Variables have missing values probabilities given by a logistic model, taking the same variable as input (hence, missingness is independent from one variable to another). The intercepts are selected to attain the desired missing rate.

Parameters:
  • X – Data for which missing values will be simulated.

  • p – Proportion of missing values to generate for variables which will have missing values.

Returns:

Mask of generated missing values (True if the value is missing).

Return type:

mask

fit_intercepts(X: ndarray, coeffs: ndarray, p: float, self_mask: bool = False) ndarray
pick_coeffs(X: ndarray, idxs_obs: List[int] = [], idxs_nas: List[int] = [], self_mask: bool = False) ndarray
simulate_nan(X: ndarray, p_miss: float, mecha: str = 'MCAR', opt: str = 'logistic', p_obs: float = 0.5, q: float = 0, sample_columns: bool = True) dict

Generate missing values for specifics missing-data mechanism and proportion of missing values.

Parameters:
  • X – Data for which missing values will be simulated.

  • p_miss – Proportion of missing values to generate for variables which will have missing values.

  • mecha – Indicates the missing-data mechanism to be used. “MCAR” by default, “MAR”, “MNAR” or “MNARsmask”

  • opt – For mecha = “MNAR”, it indicates how the missing-data mechanism is generated: using a logistic regression (“logistic”), a quantile censorship (“quantile”) or logistic regression for generating a self-masked MNAR mechanism (“selfmasked”).

  • p_obs – If mecha = “MAR”, or mecha = “MNAR” with opt = “logistic” or “quantile”, proportion of variables with no missing values that will be used for the logistic masking model.

  • q – If mecha = “MNAR” and opt = “quanti”, quantile level at which the cuts should occur.

Returns:

  • ‘X_init’: the initial data matrix.

  • ’X_incomp’: the data with the generated missing values.

  • ’mask’: a matrix indexing the generated missing values.

Return type:

A dictionnary containing

hyperimpute.utils.benchmarks module

hyperimpute.utils.tester module

class Eval(metric: str = 'aucroc')

Bases: object

Helper class for evaluating the performance of the models.

Parameters:

metric – str, default=”aucroc” The type of metric to use for evaluation. Potential values: [“aucprc”, “aucroc”].

average_precision_score(y_test: ndarray, y_pred_proba: ndarray) float
get_metric() str
roc_auc_score(y_test: ndarray, y_pred_proba: ndarray) float
score_proba(y_test: ndarray, y_pred_proba: ndarray) float
evaluate_estimator(estimator: Any, X: DataFrame, Y: Series, n_folds: int = 3, metric: str = 'aucroc', seed: int = 0, pretrained: bool = False, *args: Any, **kwargs: Any) Dict
evaluate_regression(estimator: Any, X: DataFrame, Y: Series, n_folds: int = 3, seed: int = 0, *args: Any, **kwargs: Any) Dict
score_classification_model(estimator: Any, X_train: DataFrame, X_test: Series, y_train: DataFrame, y_test: Series) float