Welcome to HyperImpute’s documentation!
HyperImpute - A library for NaNs and nulls.
HyperImpute simplifies the selection process of a data imputation algorithm for your ML pipelines. It includes various novel algorithms for missing data and is compatible with sklearn.
HyperImpute features
🚀 Fast and extensible dataset imputation algorithms, compatible with sklearn.
🔑 New iterative imputation method: HyperImpute.
🌀 Classic methods: MICE, MissForest, GAIN, MIRACLE, MIWAE, Sinkhorn, SoftImpute, etc.
🔥 Pluginable architecture.
🚀 Installation
The library can be installed from PyPI using
$ pip install hyperimpute
or from source, using
$ pip install .
💥 Sample Usage
List available imputers
from hyperimpute.plugins.imputers import Imputers
imputers = Imputers()
imputers.list()
Impute a dataset using one of the available methods
import pandas as pd
import numpy as np
from hyperimpute.plugins.imputers import Imputers
X = pd.DataFrame([[1, 1, 1, 1], [4, 5, np.nan, np.nan], [3, 3, 9, 9], [2, 2, 2, 2]])
method = "gain"
plugin = Imputers().get(method)
out = plugin.fit_transform(X.copy())
print(method, out)
Specify the baseline models for HyperImpute
import pandas as pd
import numpy as np
from hyperimpute.plugins.imputers import Imputers
X = pd.DataFrame([[1, 1, 1, 1], [4, 5, np.nan, np.nan], [3, 3, 9, 9], [2, 2, 2, 2]])
plugin = Imputers().get(
"hyperimpute",
optimizer="hyperband",
classifier_seed=["logistic_regression"],
regression_seed=["linear_regression"],
)
out = plugin.fit_transform(X.copy())
print(out)
Use an imputer with a SKLearn pipeline
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from hyperimpute.plugins.imputers import Imputers
X = pd.DataFrame([[1, 1, 1, 1], [4, 5, np.nan, np.nan], [3, 3, 9, 9], [2, 2, 2, 2]])
y = pd.Series([1, 2, 1, 2])
imputer = Imputers().get("hyperimpute")
estimator = Pipeline(
[
("imputer", imputer),
("forest", RandomForestRegressor(random_state=0, n_estimators=100)),
]
)
estimator.fit(X, y)
Write a new imputation plugin
from sklearn.impute import KNNImputer
from hyperimpute.plugins.imputers import Imputers, ImputerPlugin
imputers = Imputers()
knn_imputer = "custom_knn"
class KNN(ImputerPlugin):
def __init__(self) -> None:
super().__init__()
self._model = KNNImputer(n_neighbors=2, weights="uniform")
@staticmethod
def name():
return knn_imputer
@staticmethod
def hyperparameter_space():
return []
def _fit(self, *args, **kwargs):
self._model.fit(*args, **kwargs)
return self
def _transform(self, *args, **kwargs):
return self._model.transform(*args, **kwargs)
imputers.add(knn_imputer, KNN)
assert imputers.get(knn_imputer) is not None
Benchmark imputation models on a dataset
from sklearn.datasets import load_iris
from hyperimpute.plugins.imputers import Imputers
from hyperimpute.utils.benchmarks import compare_models
X, y = load_iris(as_frame=True, return_X_y=True)
imputer = Imputers().get("hyperimpute")
compare_models(
name="example",
evaluated_model=imputer,
X_raw=X,
ref_methods=["ice", "missforest"],
scenarios=["MAR"],
miss_pct=[0.1, 0.3],
n_iter=2,
)
📓 Tutorials
⚡ Imputation methods
The following table contains the default imputation plugins:
Strategy |
Description |
Code |
---|---|---|
HyperImpute |
Iterative imputer using both regression and classification methods based on linear models, trees, XGBoost, CatBoost and neural nets |
``plugin_hyperimpute.py` <src/hyperimpute/plugins/imputers/plugin_hyperimpute.py>`_ |
Mean |
Replace the missing values using the mean along each column with ``SimpleImputer` <https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html>`_ |
``plugin_mean.py` <src/hyperimpute/plugins/imputers/plugin_mean.py>`_ |
Median |
Replace the missing values using the median along each column with ``SimpleImputer` <https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html>`_ |
``plugin_median.py` <src/hyperimpute/plugins/imputers/plugin_median.py>`_ |
Most-frequent |
Replace the missing values using the most frequent value along each column with ``SimpleImputer` <https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html>`_ |
``plugin_most_freq.py` <src/hyperimpute/plugins/imputers/plugin_most_freq.py>`_ |
MissForest |
Iterative imputation method based on Random Forests using ``IterativeImputer` <https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html#sklearn.impute.IterativeImputer>`_ and ``ExtraTreesRegressor` <https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html>`_ |
``plugin_missforest.py` <src/hyperimpute/plugins/imputers/plugin_missforest.py>`_ |
ICE |
Iterative imputation method based on regularized linear regression using ``IterativeImputer` <https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html#sklearn.impute.IterativeImputer>`_ and ``BayesianRidge` <https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.BayesianRidge.html>`_ |
``plugin_ice.py` <src/hyperimpute/plugins/imputers/plugin_ice.py>`_ |
MICE |
Multiple imputations based on ICE using ``IterativeImputer` <https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html#sklearn.impute.IterativeImputer>`_ and ``BayesianRidge` <https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.BayesianRidge.html>`_ |
``plugin_mice.py` <src/hyperimpute/plugins/imputers/plugin_mice.py>`_ |
SoftImpute |
``Low-rank matrix approximation via nuclear-norm regularization` <https://jmlr.org/papers/volume16/hastie15a/hastie15a.pdf>`_ |
``plugin_softimpute.py` <src/hyperimpute/plugins/imputers/plugin_softimpute.py>`_ |
EM |
Iterative procedure which uses other variables to impute a value (Expectation), then checks whether that is the value most likely (Maximization) - ``EM imputation algorithm` <https://joon3216.github.io/research_materials/2019/em_imputation.html>`_ |
``plugin_em.py` <src/hyperimpute/plugins//imputers/plugin_em.py>`_ |
Sinkhorn |
``Missing Data Imputation using Optimal Transport` <https://arxiv.org/pdf/2002.03860.pdf>`_ |
``plugin_sinkhorn.py` <src/hyperimpute/plugins/imputers/plugin_sinkhorn.py>`_ |
GAIN |
``GAIN: Missing Data Imputation using Generative Adversarial Nets` <https://arxiv.org/abs/1806.02920>`_ |
``plugin_gain.py` <src/hyperimpute/plugins/imputers/plugin_gain.py>`_ |
MIRACLE |
``MIRACLE: Causally-Aware Imputation via Learning Missing Data Mechanisms` <https://arxiv.org/abs/2111.03187>`_ |
``plugin_miracle.py` <src/hyperimpute/plugins/imputers/plugin_miracle.py>`_ |
MIWAE |
``MIWAE: Deep Generative Modelling and Imputation of Incomplete Data` <https://arxiv.org/abs/1812.02633>`_ |
``plugin_miwae.py` <src/hyperimpute/plugins/imputers/plugin_miwae.py>`_ |
🔨 Tests
Install the testing dependencies using
pip install .[testing]
The tests can be executed using
pytest -vsx
Citing
If you use this code, please cite the associated paper:
@article{Jarrett2022HyperImpute,
doi = {10.48550/ARXIV.2206.07769},
url = {https://arxiv.org/abs/2206.07769},
author = {Jarrett, Daniel and Cebere, Bogdan and Liu, Tennison and Curth, Alicia and van der Schaar, Mihaela},
keywords = {Machine Learning (stat.ML), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {HyperImpute: Generalized Iterative Imputation with Automatic Model Selection},
year = {2022},
booktitle={39th International Conference on Machine Learning},
}
API documentation
Imputers
Imputers
hyperimpute.plugins.imputers.plugin_hyperimpute module
hyperimpute.plugins.imputers.plugin_EM module
- class EM(maxit: int = 500, convergence_threshold: float = 1e-08)
Bases:
TransformerMixin
The EM algorithm is an optimization algorithm that assumes a distribution for the partially missing data and tries to maximize the expected complete data log-likelihood under that distribution.
- Steps:
For an input dataset X with missing values, we assume that the values are sampled from distribution N(Mu, Sigma).
We generate the “observed” and “missing” masks from X, and choose some initial values for Mu = Mu0 and Sigma = Sigma0.
The EM loop tries to approximate the (Mu, Sigma) pair by some iterative means under the conditional distribution of missing components.
The E step finds the conditional expectation of the “missing” data, given the observed values and current estimates of the parameters. These expectations are then substituted for the “missing” data.
In the M step, maximum likelihood estimates of the parameters are computed as though the missing data had been filled in.
The X_reconstructed contains the approximation after each iteration.
- Parameters:
maxit – int, default=500 maximum number of imputation rounds to perform.
convergence_threshold – float, default=1e-08 Minimum ration difference between iterations before stopping.
Paper: “Maximum Likelihood from Incomplete Data via the EM Algorithm”, A. P. Dempster, N. M. Laird and D. B. Rubin
- _converged(Mu: ndarray, Sigma: ndarray, Mu_new: ndarray, Sigma_new: ndarray) bool
Checks if the EM loop has converged.
- Parameters:
Mu – np.ndarray The previous value of the mean.
Sigma – np.ndarray The previous value of the variance.
Mu_new – np.ndarray The new value of the mean.
Sigma_new – np.ndarray The new value of the variance.
- Returns:
True/False if the algorithm has converged.
- Return type:
bool
- _em(X_reconstructed: ndarray, Mu: ndarray, Sigma: ndarray, observed: ndarray, missing: ndarray) Tuple[ndarray, ndarray, ndarray]
The EM step.
- Parameters:
X_reconstructed – np.ndarray The current imputation approximation.
Mu – np.ndarray The previous value of the mean.
Sigma – np.ndarray The previous value of the variance.
observed – np.ndarray Mask of the observed values in the original input.
missing – np.ndarray Mask of the missing values in the original input.
- Returns:
The new approximation of the mean. ndarray: The new approximation of the variance. ndarray: The new imputed dataset.
- Return type:
ndarray
- _impute_em(X: ndarray) ndarray
The EM imputation core loop.
- Parameters:
X – np.ndarray The dataset with missing values.
- Raises:
RuntimeError – raised if the static checks on the final result fail.
- Returns:
The dataset with imputed values.
- Return type:
ndarray
- fit_transform(**kwargs: Any) Any
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns:
X_new – Transformed array.
- Return type:
ndarray array of shape (n_samples, n_features_new)
- class EMPlugin(random_state: int = 0, maxit: int = 500, convergence_threshold: float = 1e-08)
Bases:
ImputerPlugin
Imputation plugin for completing missing values using the EM strategy.
The EM algorithm is an optimization algorithm that assumes a distribution for the partially missing data and tries to maximize the expected complete data log-likelihood under that distribution.
- Steps:
For an input dataset X with missing values, we assume that the values are sampled from distribution N(Mu, Sigma).
We generate the “observed” and “missing” masks from X, and choose some initial values for Mu = Mu0 and Sigma = Sigma0.
The EM loop tries to approximate the (Mu, Sigma) pair by some iterative means under the conditional distribution of missing components.
The E step finds the conditional expectation of the “missing” data, given the observed values and current estimates of the parameters. These expectations are then substituted for the “missing” data.
In the M step, maximum likelihood estimates of the parameters are computed as though the missing data had been filled in.
The X_reconstructed contains the approximation after each iteration.
- Parameters:
maxit – int, default=500 maximum number of imputation rounds to perform.
convergence_threshold – float, default=1e-08 Minimum ration difference between iterations before stopping.
Example
>>> import numpy as np >>> from hyperimpute.plugins.imputers import Imputers >>> plugin = Imputers().get("EM") >>> plugin.fit_transform([[1, 1, 1, 1], [np.nan, np.nan, np.nan, np.nan], [1, 2, 2, 1], [2, 2, 2, 2]])
Reference: “Maximum Likelihood from Incomplete Data via the EM Algorithm”, A. P. Dempster, N. M. Laird and D. B. Rubin
- _abc_impl = <_abc_data object>
- _fit(**kwargs: Any) Any
- _transform(**kwargs: Any) Any
- module_relative_path: Optional[Path]
- static name() str
hyperimpute.plugins.imputers.plugin_gain module
- class GainImputation(batch_size: int = 256, n_epochs: int = 1000, hint_rate: float = 0.9, loss_alpha: float = 10)
Bases:
TransformerMixin
GAIN Imputation for static data using Generative Adversarial Nets. The training steps are:
The generato imputes the missing components conditioned on what is actually observed, and outputs a completed vector.
The discriminator takes a completed vector and attempts to determine which components were actually observed and which were imputed.
- Parameters:
batch_size – int The batch size for the training steps.
n_epochs – int Number of epochs for training.
hint_rate – float Percentage of additional information for the discriminator.
loss_alpha – int Hyperparameter for the generator loss.
Paper: J. Yoon, J. Jordon, M. van der Schaar, “GAIN: Missing Data Imputation using Generative Adversarial Nets,” ICML, 2018. Original code: https://github.com/jsyoon0823/GAIN
- fit(X: Tensor) GainImputation
Train the GAIN model.
- Parameters:
X – incomplete dataset.
- Returns:
the updated model.
- Return type:
self
- fit_transform(X: Tensor) Tensor
Imputes the provided dataset using the GAIN strategy.
- Parameters:
X – np.ndarray A dataset with missing values.
- Returns:
The imputed dataset.
- Return type:
Xhat
- transform(Xmiss: Tensor) Tensor
Return imputed data by trained GAIN model.
- Parameters:
Xmiss – the array with missing data
- Returns:
the array without missing data
- Return type:
torch.Tensor
- Raises:
RuntimeError – if the result contains np.nans.
- class GainModel(dim: int, h_dim: int, loss_alpha: float = 10)
Bases:
object
The core model for GAIN Imputation.
- Parameters:
dim – float Number of features.
h_dim – float Size of the hidden layer.
loss_alpha – int Hyperparameter for the generator loss.
- discr_loss(X: Tensor, M: Tensor, H: Tensor) Tensor
- discriminator(X: Tensor, hints: Tensor) Tensor
- gen_loss(X: Tensor, M: Tensor, H: Tensor) Tensor
- generator(X: Tensor, mask: Tensor) Tensor
- class GainPlugin(batch_size: int = 128, n_epochs: int = 100, hint_rate: float = 0.8, loss_alpha: int = 10, random_state: int = 0)
Bases:
ImputerPlugin
Imputation plugin for completing missing values using the GAIN strategy.
- Method:
Details in the GainImputation class implementation.
Example
>>> import numpy as np >>> from hyperimpute.plugins.imputers import Imputers >>> plugin = Imputers().get("gain") >>> plugin.fit_transform([[1, 1, 1, 1], [np.nan, np.nan, np.nan, np.nan], [1, 2, 2, 1], [2, 2, 2, 2]])
- _abc_impl = <_abc_data object>
- _fit(**kwargs: Any) Any
- _transform(**kwargs: Any) Any
- module_relative_path: Optional[Path]
- static name() str
- plugin
alias of
GainPlugin
- sample_M(m: int, n: int, p: float) ndarray
Hint Vector Generation
- Parameters:
m – number of rows
n – number of columns
p – hint rate
- Returns:
generated random values
- Return type:
np.ndarray
- sample_Z(m: int, n: int) ndarray
Random sample generator for Z.
- Parameters:
m – number of rows
n – number of columns
- Returns:
generated random values
- Return type:
np.ndarray
- sample_idx(m: int, n: int) ndarray
Mini-batch generation
- Parameters:
m – number of rows
n – number of columns
- Returns:
generated random indices
- Return type:
np.ndarray
hyperimpute.plugins.imputers.plugin_miracle module
- class MiraclePlugin(lr: float = 0.001, batch_size: int = 1024, num_outputs: int = 1, n_hidden: int = 32, reg_lambda: float = 1, reg_beta: float = 1, DAG_only: bool = False, reg_m: float = 1.0, window: int = 10, max_steps: int = 400, seed_imputation: str = 'mean', random_state: int = 0)
Bases:
ImputerPlugin
MIRACLE (Missing data Imputation Refinement And Causal LEarning) MIRACLE iteratively refines the imputation of a baseline by simultaneously modeling the missingness generating mechanism and encouraging imputation to be consistent with the causal structure of the data.
Example
>>> import numpy as np >>> from hyperimpute.plugins.imputers import Imputers >>> plugin = Imputers().get("miracle") >>> plugin.fit_transform([[1, 1, 1, 1], [np.nan, np.nan, np.nan, np.nan], [1, 2, 2, 1], [2, 2, 2, 2]])
Reference: “MIRACLE: Causally-Aware Imputation via Learning Missing Data Mechanisms”, Trent Kyono, Yao Zhang, Alexis Bellot, Mihaela van der Schaar
- _abc_impl = <_abc_data object>
- _fit(X: DataFrame, *args: Any, **kwargs: Any) MiraclePlugin
- _get_seed_imputer(method: str) ImputerPlugin
- _transform(X: DataFrame) DataFrame
- classmethod load(buff: bytes) MiraclePlugin
- module_relative_path: Optional[Path]
- static name() str
- save() bytes
- plugin
alias of
MiraclePlugin
hyperimpute.plugins.imputers.plugin_ice module
hyperimpute.plugins.imputers.plugin_mice module
- class MicePlugin(n_imputations: int = 1, max_iter: int = 100, tol: float = 0.001, initial_strategy: int = 0, imputation_order: int = 0, random_state: int = 0)
Bases:
ImputerPlugin
Imputation plugin for completing missing values using the Multivariate Iterative chained equations and multiple imputations.
- Method:
Multivariate Iterative chained equations(MICE) methods model each feature with missing values as a function of other features in a round-robin fashion. For each step of the round-robin imputation, we use a BayesianRidge estimator, which does a regularized linear regression. The class sklearn.impute.IterativeImputer is able to generate multiple imputations of the same incomplete dataset. We can then learn a regression or classification model on different imputations of the same dataset. Setting sample_posterior=True for the IterativeImputer will randomly draw values to fill each missing value from the Gaussian posterior of the predictions. If each IterativeImputer uses a different random_state, this results in multiple imputations, each of which can be used to train a predictive model. The final result is the average of all the n_imputation estimates.
- Parameters:
n_imputations – int, default=5i number of multiple imputations to perform.
max_iter – int, default=500 maximum number of imputation rounds to perform.
random_state – int, default set to the current time. seed of the pseudo random number generator to use.
Example
>>> import numpy as np >>> from hyperimpute.plugins.imputers import Imputers >>> plugin = Imputers().get("mice") >>> plugin.fit_transform([[1, 1, 1, 1], [np.nan, np.nan, np.nan, np.nan], [1, 2, 2, 1], [2, 2, 2, 2]])
- _abc_impl = <_abc_data object>
- _fit(**kwargs: Any) Any
- _transform(**kwargs: Any) Any
- imputation_order_vals = ['ascending', 'descending', 'roman', 'arabic', 'random']
- initial_strategy_vals = ['mean', 'median', 'most_frequent', 'constant']
- module_relative_path: Optional[Path]
- static name() str
- plugin
alias of
MicePlugin
hyperimpute.plugins.imputers.plugin_missforest module
hyperimpute.plugins.imputers.plugin_sinkhorn module
hyperimpute.plugins.imputers.plugin_softimpute module
- class SoftImpute(maxit: int = 1000, convergence_threshold: float = 1e-05, max_rank: int = 2, shrink_lambda: float = 0, cv_len: int = 3, random_state: int = 0)
Bases:
TransformerMixin
The SoftImpute algorithm fits a low-rank matrix approximation to a matrix with missing values via nuclear-norm regularization. The algorithm can be used to impute quantitative data. To calibrate the the nuclear-norm regularization parameter(shrink_lambda), we perform cross-validation(_cv_softimpute)
- Parameters:
maxit – int, default=500 maximum number of imputation rounds to perform.
convergence_threshold – float, default=1e-5 Minimum ration difference between iterations before stopping.
max_rank – int, default=2 Perform a truncated SVD on each iteration with this value as its rank.
shrink_lambda – float, default=0 Value by which we shrink singular values on each iteration. If it’s missing, it is calibrated using cross validation.
cv_len – int, default=15 the length of the grid on which the cross-validation is performed.
Example
>>> import numpy as np >>> from hyperimpute.plugins.imputers import Imputers >>> plugin = Imputers().get("softimpute") >>> plugin.fit_transform([[1, 1, 1, 1], [np.nan, np.nan, np.nan, np.nan], [1, 2, 2, 1], [2, 2, 2, 2]])
Reference: “Spectral Regularization Algorithms for Learning Large Incomplete Matrices”, by Mazumder, Hastie, and Tibshirani.
- _approximate_shrink_val(X: ndarray) float
Try to calibrate the shrinkage step using cross-validation. It simulates more missing items and tests the performance of different shrinkage values.
- Parameters:
X – np.ndarray The dataset to use.
- Returns:
The value to use for the shrinkage step.
- Return type:
float
- _converged(Xold: ndarray, X: ndarray, mask: ndarray) bool
Checks if the SoftImpute algorithm has converged.
- Parameters:
Xold – np.ndarray The previous version of the imputed dataset.
X – np.ndarray The new version of the imputed dataset.
mask – np.ndarray The original missing mask.
- Returns:
True/False if the algorithm has converged.
- Return type:
bool
- _simulate_more_nan(X: ndarray, mask: ndarray) ndarray
Generate more missing values for cross-validation.
- Parameters:
X – np.ndarray The dataset to use.
mask – np.ndarray The existing missing positions
- Returns:
A new version of X with more missing values.
- Return type:
Xsim
- _softimpute(X: ndarray, shrink_val: float) ndarray
Core loop of the algorithm. It approximates the imputed X using the SVD decomposition in a loop, until the algorithm converges/the maxit iteration is reached.
- Parameters:
X – np.ndarray The previous version of the imputed dataset.
shrink_val – float The value by which we shrink singular values on each iteration.
- Returns:
The imputed dataset.
- Return type:
X_hat
- _svd(X: ndarray, shrink_val: float) ndarray
Reconstructs X from low-rank thresholded SVD.
- Parameters:
X – np.ndarray The previous version of the imputed dataset.
shrink_val – float The value by which we shrink singular values on each iteration.
- Raises:
RuntimeError – raised if the static checks on the final result fail.
- Returns:
new candidate for the result.
- Return type:
X_reconstructed
- fit(**kwargs: Any) Any
- fit_transform(**kwargs: Any) Any
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns:
X_new – Transformed array.
- Return type:
ndarray array of shape (n_samples, n_features_new)
- classmethod load(buff: bytes) SoftImpute
- save() bytes
- transform(**kwargs: Any) Any
- class SoftImputePlugin(maxit: int = 1000, convergence_threshold: float = 1e-05, max_rank: int = 2, shrink_lambda: float = 0, cv_len: int = 3, random_state: int = 0)
Bases:
ImputerPlugin
Imputation plugin for completing missing values using the SoftImpute strategy.
- Method:
Details in the SoftImpute class implementation.
Example
>>> import numpy as np >>> from hyperimpute.plugins.imputers import Imputers >>> plugin = Imputers().get("softimpute") >>> plugin.fit_transform([[1, 1, 1, 1], [np.nan, np.nan, np.nan, np.nan], [1, 2, 2, 1], [2, 2, 2, 2]]) 0 1 2 3 0 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1 3.820605e-16 1.708249e-16 1.708249e-16 3.820605e-16 2 1.000000e+00 2.000000e+00 2.000000e+00 1.000000e+00 3 2.000000e+00 2.000000e+00 2.000000e+00 2.000000e+00
- _abc_impl = <_abc_data object>
- _fit(**kwargs: Any) Any
- _transform(**kwargs: Any) Any
- module_relative_path: Optional[Path]
- static name() str
- plugin
alias of
SoftImputePlugin
hyperimpute.plugins.imputers.plugin_miwae module
- class MIWAEPlugin(n_epochs: int = 500, batch_size: int = 256, latent_size: int = 1, n_hidden: int = 1, random_state: int = 0, K: int = 20)
Bases:
ImputerPlugin
MIWAE imputation plugin
- Parameters:
n_epochs – int Number of training iterations
batch_size – int Batch size
latent_size – int dimension of the latent space
n_hidden – int number of hidden units
K – int number of IS during training
random_state – int random seed
Example
>>> import numpy as np >>> from hyperimpute.plugins.imputers import Imputers >>> plugin = Imputers().get("miwae") >>> plugin.fit_transform([[1, 1, 1, 1], [np.nan, np.nan, np.nan, np.nan], [1, 2, 2, 1], [2, 2, 2, 2]])
Reference: “MIWAE: Deep Generative Modelling and Imputation of Incomplete Data”, Pierre-Alexandre Mattei, Jes Frellsen Original code: https://github.com/pamattei/miwae
- _abc_impl = <_abc_data object>
- _fit(X: DataFrame, *args: Any, **kwargs: Any) MIWAEPlugin
- _miwae_impute(iota_x: Tensor, mask: Tensor, L: int) Tensor
- _miwae_loss(iota_x: Tensor, mask: Tensor) Tensor
- _transform(X: DataFrame) DataFrame
- module_relative_path: Optional[Path]
- static name() str
- plugin
alias of
MIWAEPlugin
- weights_init(layer: Any) None
hyperimpute.plugins.imputers.plugin_mean module
- class MeanPlugin(random_state: int = 0)
Bases:
ImputerPlugin
Imputation plugin for completing missing values using the Mean Imputation strategy.
- Method:
The Mean Imputation strategy replaces the missing values using the mean along each column.
Example
>>> import numpy as np >>> from hyperimpute.plugins.imputers import Imputers >>> plugin = Imputers().get("mean") >>> plugin.fit_transform([[1, 1, 1, 1], [np.nan, np.nan, np.nan, np.nan], [1, 2, 2, 1], [2, 2, 2, 2]])
- _abc_impl = <_abc_data object>
- _fit(X: DataFrame, *args: Any, **kwargs: Any) MeanPlugin
- _transform(X: DataFrame) DataFrame
- module_relative_path: Optional[Path]
- static name() str
- plugin
alias of
MeanPlugin
hyperimpute.plugins.imputers.plugin_median module
- class MedianPlugin(random_state: int = 0)
Bases:
ImputerPlugin
Imputation plugin for completing missing values using the Median Imputation strategy.
- Method:
The Median Imputation strategy replaces the missing values using the median along each column.
Example
>>> import numpy as np >>> from hyperimpute.plugins.imputers import Imputers >>> plugin = Imputers().get("median") >>> plugin.fit_transform([[1, 1, 1, 1], [np.nan, np.nan, np.nan, np.nan], [1, 2, 2, 1], [2, 2, 2, 2]]) 0 1 2 3 0 1.0 1.0 1.0 1.0 1 1.0 2.0 2.0 1.0 2 1.0 2.0 2.0 1.0 3 2.0 2.0 2.0 2.0
- _abc_impl = <_abc_data object>
- _fit(X: DataFrame, *args: Any, **kwargs: Any) MedianPlugin
- _transform(X: DataFrame) DataFrame
- module_relative_path: Optional[Path]
- static name() str
- plugin
alias of
MedianPlugin
Prediction models
Classifiers
hyperimpute.plugins.prediction.classifiers.plugin_logistic_regression module
- class LogisticRegressionPlugin(C: float = 1.0, solver: int = 1, multi_class: int = 0, class_weight: int = 0, max_iter: int = 10000, penalty: str = 'l2', model: Optional[Any] = None, random_state: int = 0, hyperparam_search_iterations: Optional[int] = None, **kwargs: Any)
Bases:
ClassifierPlugin
Classification plugin based on the Logistic Regression classifier.
- Method:
Logistic regression is a linear model for classification rather than regression. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.
- Parameters:
C – float Inverse of regularization strength; must be a positive float.
solver – str Algorithm to use in the optimization problem: [‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’]
multi_class – str If the option chosen is ‘ovr’, then a binary problem is fit for each label. For ‘multinomial’ the loss minimised is the multinomial loss fit across the entire probability distribution, even when the data is binary. ‘multinomial’ is unavailable when solver=’liblinear’. ‘auto’ selects ‘ovr’ if the data is binary, or if solver=’liblinear’, and otherwise selects ‘multinomial’.
class_weight – str Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one.
max_iter – int Maximum number of iterations taken for the solvers to converge.
Example
>>> from hyperimpute.plugins.prediction import Predictions >>> plugin = Predictions(category="classifiers").get("logistic_regression") >>> from sklearn.datasets import load_iris >>> X, y = load_iris(return_X_y=True) >>> plugin.fit_predict(X, y) # returns the probabilities for each class
- _abc_impl = <_abc_data object>
- _fit(X: DataFrame, *args: Any, **kwargs: Any) LogisticRegressionPlugin
- _predict(X: DataFrame, *args: Any, **kwargs: Any) DataFrame
- _predict_proba(X: DataFrame, *args: Any, **kwargs: Any) DataFrame
- classes = ['auto', 'ovr', 'multinomial']
- module_relative_path: Optional[Path]
- static name() str
- solvers = ['newton-cg', 'lbfgs', 'sag', 'saga']
- weights = ['balanced', None]
- plugin
alias of
LogisticRegressionPlugin
hyperimpute.plugins.prediction.classifiers.plugin_random_forest module
- class RandomForestPlugin(n_estimators: int = 100, criterion: int = 0, max_features: int = 0, min_samples_split: int = 2, min_samples_leaf: int = 1, max_depth: Optional[int] = 3, random_state: int = 0, hyperparam_search_iterations: Optional[int] = None, **kwargs: Any)
Bases:
ClassifierPlugin
Classification plugin based on Random forests.
- Method:
A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.
- Parameters:
n_estimators – int The number of trees in the forest.
criterion – str The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.
max_features – str The number of features to consider when looking for the best split.
min_samples_split – int The minimum number of samples required to split an internal node.
boostrap – bool Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.
min_samples_leaf – int The minimum number of samples required to be at a leaf node.
Example
>>> from hyperimpute.plugins.prediction import Predictions >>> plugin = Predictions(category="classifiers").get("random_forest") >>> from sklearn.datasets import load_iris >>> X, y = load_iris(return_X_y=True) >>> plugin.fit_predict(X, y)
- _abc_impl = <_abc_data object>
- _fit(X: DataFrame, *args: Any, **kwargs: Any) RandomForestPlugin
- _predict(X: DataFrame, *args: Any, **kwargs: Any) DataFrame
- _predict_proba(X: DataFrame, *args: Any, **kwargs: Any) DataFrame
- criterions = ['gini', 'entropy']
- features = ['sqrt', 'log2', None]
- module_relative_path: Optional[Path]
- static name() str
- plugin
alias of
RandomForestPlugin
hyperimpute.plugins.prediction.classifiers.plugin_xgboost module
- class XGBoostPlugin(n_estimators: int = 100, reg_lambda: Optional[float] = None, reg_alpha: Optional[float] = None, colsample_bytree: Optional[float] = None, colsample_bynode: Optional[float] = None, colsample_bylevel: Optional[float] = None, max_depth: Optional[int] = 3, subsample: Optional[float] = None, lr: Optional[float] = None, min_child_weight: Optional[int] = None, max_bin: int = 256, booster: int = 0, grow_policy: int = 0, nthread: int = 1, random_state: int = 0, eta: float = 0.3, hyperparam_search_iterations: Optional[int] = None, **kwargs: Any)
Bases:
ClassifierPlugin
Classification plugin based on the XGBoost classifier.
- Method:
Gradient boosting is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler and weaker models. The XGBoost algorithm has a robust handling of a variety of data types, relationships, distributions, and the variety of hyperparameters that you can fine-tune.
- Parameters:
n_estimators – int The maximum number of estimators at which boosting is terminated.
max_depth – int Maximum depth of a tree.
reg_lambda – float L2 regularization term on weights (xgb’s lambda).
reg_alpha – float L1 regularization term on weights (xgb’s alpha).
colsample_bytree – float Subsample ratio of columns when constructing each tree.
colsample_bynode – float Subsample ratio of columns for each split.
colsample_bylevel – float Subsample ratio of columns for each level.
subsample – float Subsample ratio of the training instance.
lr – float Boosting learning rate
booster – str Specify which booster to use: gbtree, gblinear or dart.
min_child_weight – int Minimum sum of instance weight(hessian) needed in a child.
max_bin – int Number of bins for histogram construction.
random_state – float Random number seed.
Example
>>> from hyperimpute.plugins.prediction import Predictions >>> plugin = Predictions(category="classifiers").get("xgboost") >>> from sklearn.datasets import load_iris >>> X, y = load_iris(return_X_y=True) >>> plugin.fit_predict(X, y)
- _abc_impl = <_abc_data object>
- _fit(X: DataFrame, *args: Any, **kwargs: Any) XGBoostPlugin
- _predict(X: DataFrame, *args: Any, **kwargs: Any) DataFrame
- _predict_proba(X: DataFrame, *args: Any, **kwargs: Any) DataFrame
- booster = ['gbtree', 'gblinear', 'dart']
- grow_policy = ['depthwise', 'lossguide']
- module_relative_path: Optional[Path]
- static name() str
- plugin
alias of
XGBoostPlugin
hyperimpute.plugins.prediction.classifiers.plugin_catboost module
- class CatBoostPlugin(n_estimators: Optional[int] = 10, depth: Optional[int] = None, grow_policy: int = 0, model: Optional[Any] = None, hyperparam_search_iterations: Optional[int] = None, random_state: int = 0, l2_leaf_reg: float = 3, learning_rate: float = 0.001, min_data_in_leaf: int = 1, random_strength: float = 1, **kwargs: Any)
Bases:
ClassifierPlugin
Classification plugin based on the CatBoost framework.
- Method:
CatBoost provides a gradient boosting framework which attempts to solve for Categorical features using a permutation driven alternative compared to the classical algorithm. It uses Ordered Boosting to overcome over fitting and Symmetric Trees for faster execution.
- Parameters:
learning_rate – float The learning rate used for training.
depth – int
iterations – int
grow_policy – int
Example
>>> from hyperimpute.plugins.prediction import Predictions >>> plugin = Predictions(category="classifiers").get("catboost") >>> from sklearn.datasets import load_iris >>> X, y = load_iris(return_X_y=True) >>> plugin.fit_predict(X, y) # returns the probabilities for each class
- _abc_impl = <_abc_data object>
- _fit(X: DataFrame, *args: Any, **kwargs: Any) CatBoostPlugin
- _predict(X: DataFrame, *args: Any, **kwargs: Any) DataFrame
- _predict_proba(X: DataFrame, *args: Any, **kwargs: Any) DataFrame
- grow_policies: List[Optional[str]] = [None, 'Depthwise', 'SymmetricTree', 'Lossguide']
- static name() str
- plugin
alias of
CatBoostPlugin
hyperimpute.plugins.prediction.classifiers.plugin_neural_nets module
- class BasicNet(n_unit_in: int, categories_cnt: int, n_layers_hidden: int = 1, n_units_hidden: int = 100, nonlin: str = 'relu', lr: float = 0.001, weight_decay: float = 0.001, n_iter: int = 300, batch_size: int = 1024, n_iter_print: int = 10, random_state: int = 0, patience: int = 10, n_iter_min: int = 100, dropout: float = 0.1, clipping_value: int = 1, batch_norm: bool = True, early_stopping: bool = True)
Bases:
Module
Basic neural net.
- Parameters:
n_unit_in (int) – Number of features
categories (int) –
n_layers_hidden (int) – Number of hypothesis layers (n_layers_hidden x n_units_hidden + 1 x Linear layer)
n_units_hidden (int) – Number of hidden units in each hypothesis layer
nonlin (string, default 'elu') – Nonlinearity to use in NN. Can be ‘elu’, ‘relu’, ‘selu’ or ‘leaky_relu’.
lr (float) – learning rate for optimizer. step_size equivalent in the JAX version.
weight_decay (float) – l2 (ridge) penalty for the weights.
n_iter (int) – Maximum number of iterations.
batch_size (int) – Batch size
n_iter_print (int) – Number of iterations after which to print updates and check the validation loss.
random_state (int) – random_state used
val_split_prop (float) – Proportion of samples used for validation split (can be 0)
patience (int) – Number of iterations to wait before early stopping after decrease in validation loss
n_iter_min (int) – Minimum number of iterations to go through before starting early stopping
clipping_value (int, default 1) – Gradients clipping value
- _backward_hooks: Dict[int, Callable]
- _buffers: Dict[str, Optional[Tensor]]
- _check_tensor(X: Tensor) Tensor
- _forward_hooks: Dict[int, Callable]
- _forward_pre_hooks: Dict[int, Callable]
- _is_full_backward_hook: Optional[bool]
- _load_state_dict_post_hooks: Dict[int, Callable]
- _load_state_dict_pre_hooks: Dict[int, Callable]
- _modules: Dict[str, Optional[Module]]
- _non_persistent_buffers_set: Set[str]
- _parameters: Dict[str, Optional[Parameter]]
- _state_dict_hooks: Dict[int, Callable]
- forward(X: Tensor) Tensor
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- train(X: Tensor, y: Tensor) BasicNet
Sets the module in training mode.
This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g.
Dropout
,BatchNorm
, etc.- Parameters:
mode (bool) – whether to set training mode (
True
) or evaluation mode (False
). Default:True
.- Returns:
self
- Return type:
Module
- training: bool
- class NeuralNetsPlugin(n_layers_hidden: int = 1, n_units_hidden: int = 100, nonlin: str = 'relu', lr: float = 0.001, weight_decay: float = 0.001, n_iter: int = 1000, batch_size: int = 128, n_iter_print: int = 10, random_state: int = 0, patience: int = 10, n_iter_min: int = 100, dropout: float = 0.1, clipping_value: int = 1, batch_norm: bool = True, early_stopping: bool = True, hyperparam_search_iterations: Optional[int] = None, **kwargs: Any)
Bases:
ClassifierPlugin
Classification plugin based on Neural networks.
Example
>>> from hyperimpute.plugins.prediction import Predictions >>> plugin = Predictions(category="classifiers").get("neural_nets") >>> from sklearn.datasets import load_iris >>> X, y = load_iris(return_X_y=True) >>> plugin.fit_predict(X, y) # returns the probabilities for each class
- _abc_impl = <_abc_data object>
- _fit(X: DataFrame, *args: Any, **kwargs: Any) NeuralNetsPlugin
- _predict(X: DataFrame, *args: Any, **kwargs: Any) DataFrame
- _predict_proba(X: DataFrame, *args: Any, **kwargs: Any) DataFrame
- module_relative_path: Optional[Path]
- static name() str
- plugin
alias of
NeuralNetsPlugin
Regressors
hyperimpute.plugins.prediction.regression.plugin_linear_regression module
- class LinearRegressionPlugin(solver: int = 0, max_iter: Optional[int] = 10000, tol: float = 0.001, hyperparam_search_iterations: Optional[int] = None, random_state: int = 0, **kwargs: Any)
Bases:
RegressionPlugin
Regression plugin based on the Linear Regression.
Example
>>> from hyperimpute.plugins.prediction import Predictions >>> plugin = Predictions(category="regression").get("linear_regression") >>> from sklearn.datasets import load_iris >>> X, y = load_iris(return_X_y=True) >>> plugin.fit_predict(X, y) # returns the probabilities for each class
- _abc_impl = <_abc_data object>
- _fit(X: DataFrame, *args: Any, **kwargs: Any) LinearRegressionPlugin
- _predict(X: DataFrame, *args: Any, **kwargs: Any) DataFrame
- module_relative_path: Optional[Path]
- static name() str
- solvers = ['auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga']
- plugin
alias of
LinearRegressionPlugin
hyperimpute.plugins.prediction.regression.plugin_random_forest_regressor module
- class RandomForestRegressionPlugin(n_estimators: int = 100, criterion: int = 0, max_features: int = 0, min_samples_split: int = 2, min_samples_leaf: int = 1, max_depth: Optional[int] = 3, hyperparam_search_iterations: Optional[int] = None, random_state: int = 0, **kwargs: Any)
Bases:
RegressionPlugin
Regression plugin based on Random forests.
- Method:
A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.
- Parameters:
n_estimators – int The number of trees in the forest.
criterion – str The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.
max_features – str The number of features to consider when looking for the best split.
min_samples_split – int The minimum number of samples required to split an internal node.
boostrap – bool Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.
min_samples_leaf – int The minimum number of samples required to be at a leaf node.
Example
>>> from hyperimpute.plugins.prediction import Predictions >>> plugin = Predictions(category="regression").get("random_forest") >>> from sklearn.datasets import load_iris >>> X, y = load_iris(return_X_y=True) >>> plugin.fit_predict(X, y)
- _abc_impl = <_abc_data object>
- _fit(X: DataFrame, *args: Any, **kwargs: Any) RandomForestRegressionPlugin
- _predict(X: DataFrame, *args: Any, **kwargs: Any) DataFrame
- criterions = ['squared_error', 'absolute_error', 'poisson']
- features = ['sqrt', 'log2', None]
- module_relative_path: Optional[Path]
- static name() str
- plugin
alias of
RandomForestRegressionPlugin
hyperimpute.plugins.prediction.regression.plugin_xgboost_regressor module
- class XGBoostRegressorPlugin(reg_lambda: Optional[float] = None, reg_alpha: Optional[float] = None, colsample_bytree: Optional[float] = None, colsample_bynode: Optional[float] = None, colsample_bylevel: Optional[float] = None, n_estimators: int = 100, max_depth: Optional[int] = 3, lr: Optional[float] = None, random_state: int = 0, subsample: Optional[float] = None, min_child_weight: Optional[int] = None, max_bin: int = 256, booster: int = 0, grow_policy: int = 0, eta: float = 0.3, hyperparam_search_iterations: Optional[int] = None, **kwargs: Any)
Bases:
RegressionPlugin
Classification plugin based on the XGBoostRegressor.
- Method:
Gradient boosting is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler and weaker models. The XGBoostRegressor algorithm has a robust handling of a variety of data types, relationships, distributions, and the variety of hyperparameters that you can fine-tune.
- Parameters:
n_estimators – int The maximum number of estimators at which boosting is terminated.
max_depth – int Maximum depth of a tree.
reg_lambda – float L2 regularization term on weights (xgb’s lambda).
reg_alpha – float L1 regularization term on weights (xgb’s alpha).
colsample_bytree – float Subsample ratio of columns when constructing each tree.
colsample_bynode – float Subsample ratio of columns for each split.
colsample_bylevel – float Subsample ratio of columns for each level.
subsample – float Subsample ratio of the training instance.
learning_rate – float Boosting learning rate
booster – str Specify which booster to use: gbtree, gblinear or dart.
min_child_weight – int Minimum sum of instance weight(hessian) needed in a child.
max_bin – int Number of bins for histogram construction.
tree_method – str Specify which tree method to use. Default to auto. If this parameter is set to default, XGBoostRegressor will choose the most conservative option available.
random_state – float Random number seed.
Example
>>> from hyperimpute.plugins.prediction import Predictions >>> plugin = Predictions(category="regressors").get("xgboost") >>> from sklearn.datasets import load_iris >>> X, y = load_iris(return_X_y=True) >>> plugin.fit_predict(X, y)
- _abc_impl = <_abc_data object>
- _fit(X: DataFrame, *args: Any, **kwargs: Any) XGBoostRegressorPlugin
- _predict(X: DataFrame, *args: Any, **kwargs: Any) DataFrame
- grow_policy = ['depthwise', 'lossguide']
- module_relative_path: Optional[Path]
- static name() str
- plugin
alias of
XGBoostRegressorPlugin
hyperimpute.plugins.prediction.regression.plugin_catboost_regressor module
- class CatBoostRegressorPlugin(depth: Optional[int] = None, grow_policy: int = 0, n_estimators: Optional[int] = 10, hyperparam_search_iterations: Optional[int] = None, random_state: int = 0, l2_leaf_reg: float = 3, learning_rate: float = 0.001, min_data_in_leaf: int = 1, random_strength: float = 1, **kwargs: Any)
Bases:
RegressionPlugin
Regression plugin based on the CatBoost framework.
- Method:
CatBoost provides a gradient boosting framework which attempts to solve for Categorical features using a permutation driven alternative compared to the classical algorithm. It uses Ordered Boosting to overcome over fitting and Symmetric Trees for faster execution.
Example
>>> from hyperimpute.plugins.prediction import Predictions >>> plugin = Predictions(category="regression").get("catboost_regressor") >>> from sklearn.datasets import load_iris >>> X, y = load_iris(return_X_y=True) >>> plugin.fit_predict(X, y) # returns the probabilities for each class
- _abc_impl = <_abc_data object>
- _fit(X: DataFrame, *args: Any, **kwargs: Any) CatBoostRegressorPlugin
- _predict(X: DataFrame, *args: Any, **kwargs: Any) DataFrame
- _predict_proba(X: DataFrame, *args: Any, **kwargs: Any) DataFrame
- grow_policies: List[Optional[str]] = [None, 'Depthwise', 'SymmetricTree', 'Lossguide']
- static name() str
- plugin
alias of
CatBoostRegressorPlugin
hyperimpute.plugins.prediction.regression.plugin_neural_nets_regression module
- class BasicNet(n_unit_in: int, n_layers_hidden: int = 1, n_units_hidden: int = 100, nonlin: str = 'relu', lr: float = 0.001, weight_decay: float = 0.001, n_iter: int = 300, batch_size: int = 1024, n_iter_print: int = 10, random_state: int = 0, patience: int = 10, n_iter_min: int = 100, dropout: float = 0.1, clipping_value: int = 1, batch_norm: bool = True, early_stopping: bool = True)
Bases:
Module
Basic neural net.
- Parameters:
n_unit_in (int) – Number of features
n_layers_hidden (int) – Number of hypothesis layers (n_layers_hidden x n_units_hidden + 1 x Linear layer)
n_units_hidden (int) – Number of hidden units in each hypothesis layer
nonlin (string, default 'elu') – Nonlinearity to use in NN. Can be ‘elu’, ‘relu’, ‘selu’ or ‘leaky_relu’.
lr (float) – learning rate for optimizer. step_size equivalent in the JAX version.
weight_decay (float) – l2 (ridge) penalty for the weights.
n_iter (int) – Maximum number of iterations.
batch_size (int) – Batch size
n_iter_print (int) – Number of iterations after which to print updates and check the validation loss.
seed (int) – Seed used
val_split_prop (float) – Proportion of samples used for validation split (can be 0)
patience (int) – Number of iterations to wait before early stopping after decrease in validation loss
n_iter_min (int) – Minimum number of iterations to go through before starting early stopping
clipping_value (int, default 1) – Gradients clipping value
- _backward_hooks: Dict[int, Callable]
- _buffers: Dict[str, Optional[Tensor]]
- _check_tensor(X: Tensor) Tensor
- _forward_hooks: Dict[int, Callable]
- _forward_pre_hooks: Dict[int, Callable]
- _is_full_backward_hook: Optional[bool]
- _load_state_dict_post_hooks: Dict[int, Callable]
- _load_state_dict_pre_hooks: Dict[int, Callable]
- _modules: Dict[str, Optional[Module]]
- _non_persistent_buffers_set: Set[str]
- _parameters: Dict[str, Optional[Parameter]]
- _state_dict_hooks: Dict[int, Callable]
- forward(X: Tensor) Tensor
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- train(X: Tensor, y: Tensor) BasicNet
Sets the module in training mode.
This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g.
Dropout
,BatchNorm
, etc.- Parameters:
mode (bool) – whether to set training mode (
True
) or evaluation mode (False
). Default:True
.- Returns:
self
- Return type:
Module
- training: bool
- class NeuralNetsRegressionPlugin(n_layers_hidden: int = 1, n_units_hidden: int = 100, nonlin: str = 'relu', lr: float = 0.001, weight_decay: float = 0.001, n_iter: int = 1000, batch_size: int = 512, n_iter_print: int = 10, patience: int = 10, n_iter_min: int = 100, dropout: float = 0.1, clipping_value: int = 1, batch_norm: bool = True, early_stopping: bool = True, hyperparam_search_iterations: Optional[int] = None, random_state: int = 0, **kwargs: Any)
Bases:
RegressionPlugin
Regression plugin based on Neural networks.
Example
>>> from hyperimpute.plugins.prediction import Predictions >>> plugin = Predictions(category="regression").get("neural_nets_regression") >>> from sklearn.datasets import load_iris >>> X, y = load_iris(return_X_y=True) >>> plugin.fit_predict(X, y) # returns the probabilities for each class
- _abc_impl = <_abc_data object>
- _fit(X: DataFrame, *args: Any, **kwargs: Any) NeuralNetsRegressionPlugin
- _predict(X: DataFrame, *args: Any, **kwargs: Any) DataFrame
- module_relative_path: Optional[Path]
- static name() str
- plugin
alias of
NeuralNetsRegressionPlugin
Utils
Utils
hyperimpute.plugins.utils.simulate module
Original code: https://rmisstastic.netlify.app/how-to/python/generate_html/how%20to%20generate%20missing%20values
- MAR_mask(X: ndarray, p: float, p_obs: float, sample_columns: bool = True) ndarray
Missing at random mechanism with a logistic masking model. First, a subset of variables with no missing values is randomly selected. The remaining variables have missing values according to a logistic model with random weights, re-scaled so as to attain the desired proportion of missing values on those variables.
- Parameters:
X – Data for which missing values will be simulated.
p – Proportion of missing values to generate for variables which will have missing values.
p_obs – Proportion of variables with no missing values that will be used for the logistic masking model.
- Returns:
Mask of generated missing values (True if the value is missing).
- Return type:
mask
- MNAR_mask_logistic(X: ndarray, p: float, p_params: float = 0.3, exclude_inputs: bool = True) ndarray
Missing not at random mechanism with a logistic masking model. It implements two mechanisms: (i) Missing probabilities are selected with a logistic model, taking all variables as inputs. Hence, values that are inputs can also be missing. (ii) Variables are split into a set of intputs for a logistic model, and a set whose missing probabilities are determined by the logistic model. Then inputs are then masked MCAR (hence, missing values from the second set will depend on masked values. In either case, weights are random and the intercept is selected to attain the desired proportion of missing values.
- Parameters:
X – Data for which missing values will be simulated.
p – Proportion of missing values to generate for variables which will have missing values.
p_params – Proportion of variables that will be used for the logistic masking model (only if exclude_inputs).
exclude_inputs – True: mechanism (ii) is used, False: (i)
- Returns:
Mask of generated missing values (True if the value is missing).
- Return type:
mask
- MNAR_mask_quantiles(X: ndarray, p: float, q: float, p_params: float, cut: str = 'both', MCAR: bool = False) ndarray
Missing not at random mechanism with quantile censorship. First, a subset of variables which will have missing variables is randomly selected. Then, missing values are generated on the q-quantiles at random. Since missingness depends on quantile information, it depends on masked values, hence this is a MNAR mechanism.
- Parameters:
X – Data for which missing values will be simulated.
p – Proportion of missing values to generate for variables which will have missing values.
q – Quantile level at which the cuts should occur
p_params – Proportion of variables that will have missing values
cut – ‘both’, ‘upper’ or ‘lower’. Where the cut should be applied. For instance, if q=0.25 and cut=’upper’, then missing values will be generated in the upper quartiles of selected variables.
MCAR – If true, masks variables that were not selected for quantile censorship with a MCAR mechanism.
- Returns:
Mask of generated missing values (True if the value is missing).
- Return type:
mask
- MNAR_self_mask_logistic(X: ndarray, p: float) ndarray
Missing not at random mechanism with a logistic self-masking model. Variables have missing values probabilities given by a logistic model, taking the same variable as input (hence, missingness is independent from one variable to another). The intercepts are selected to attain the desired missing rate.
- Parameters:
X – Data for which missing values will be simulated.
p – Proportion of missing values to generate for variables which will have missing values.
- Returns:
Mask of generated missing values (True if the value is missing).
- Return type:
mask
- fit_intercepts(X: ndarray, coeffs: ndarray, p: float, self_mask: bool = False) ndarray
- pick_coeffs(X: ndarray, idxs_obs: List[int] = [], idxs_nas: List[int] = [], self_mask: bool = False) ndarray
- simulate_nan(X: ndarray, p_miss: float, mecha: str = 'MCAR', opt: str = 'logistic', p_obs: float = 0.5, q: float = 0, sample_columns: bool = True) dict
Generate missing values for specifics missing-data mechanism and proportion of missing values.
- Parameters:
X – Data for which missing values will be simulated.
p_miss – Proportion of missing values to generate for variables which will have missing values.
mecha – Indicates the missing-data mechanism to be used. “MCAR” by default, “MAR”, “MNAR” or “MNARsmask”
opt – For mecha = “MNAR”, it indicates how the missing-data mechanism is generated: using a logistic regression (“logistic”), a quantile censorship (“quantile”) or logistic regression for generating a self-masked MNAR mechanism (“selfmasked”).
p_obs – If mecha = “MAR”, or mecha = “MNAR” with opt = “logistic” or “quantile”, proportion of variables with no missing values that will be used for the logistic masking model.
q – If mecha = “MNAR” and opt = “quanti”, quantile level at which the cuts should occur.
- Returns:
‘X_init’: the initial data matrix.
’X_incomp’: the data with the generated missing values.
’mask’: a matrix indexing the generated missing values.
- Return type:
A dictionnary containing
hyperimpute.utils.benchmarks module
hyperimpute.utils.tester module
- class Eval(metric: str = 'aucroc')
Bases:
object
Helper class for evaluating the performance of the models.
- Parameters:
metric – str, default=”aucroc” The type of metric to use for evaluation. Potential values: [“aucprc”, “aucroc”].
- average_precision_score(y_test: ndarray, y_pred_proba: ndarray) float
- get_metric() str
- roc_auc_score(y_test: ndarray, y_pred_proba: ndarray) float
- score_proba(y_test: ndarray, y_pred_proba: ndarray) float
- evaluate_estimator(estimator: Any, X: DataFrame, Y: Series, n_folds: int = 3, metric: str = 'aucroc', seed: int = 0, pretrained: bool = False, *args: Any, **kwargs: Any) Dict
- evaluate_regression(estimator: Any, X: DataFrame, Y: Series, n_folds: int = 3, seed: int = 0, *args: Any, **kwargs: Any) Dict
- score_classification_model(estimator: Any, X_train: DataFrame, X_test: Series, y_train: DataFrame, y_test: Series) float