hyperimpute.plugins.imputers.plugin_EM module
- class EM(maxit: int = 500, convergence_threshold: float = 1e-08)
Bases:
TransformerMixin
The EM algorithm is an optimization algorithm that assumes a distribution for the partially missing data and tries to maximize the expected complete data log-likelihood under that distribution.
- Steps:
For an input dataset X with missing values, we assume that the values are sampled from distribution N(Mu, Sigma).
We generate the “observed” and “missing” masks from X, and choose some initial values for Mu = Mu0 and Sigma = Sigma0.
The EM loop tries to approximate the (Mu, Sigma) pair by some iterative means under the conditional distribution of missing components.
The E step finds the conditional expectation of the “missing” data, given the observed values and current estimates of the parameters. These expectations are then substituted for the “missing” data.
In the M step, maximum likelihood estimates of the parameters are computed as though the missing data had been filled in.
The X_reconstructed contains the approximation after each iteration.
- Parameters:
maxit – int, default=500 maximum number of imputation rounds to perform.
convergence_threshold – float, default=1e-08 Minimum ration difference between iterations before stopping.
Paper: “Maximum Likelihood from Incomplete Data via the EM Algorithm”, A. P. Dempster, N. M. Laird and D. B. Rubin
- _converged(Mu: ndarray, Sigma: ndarray, Mu_new: ndarray, Sigma_new: ndarray) bool
Checks if the EM loop has converged.
- Parameters:
Mu – np.ndarray The previous value of the mean.
Sigma – np.ndarray The previous value of the variance.
Mu_new – np.ndarray The new value of the mean.
Sigma_new – np.ndarray The new value of the variance.
- Returns:
True/False if the algorithm has converged.
- Return type:
bool
- _em(X_reconstructed: ndarray, Mu: ndarray, Sigma: ndarray, observed: ndarray, missing: ndarray) Tuple[ndarray, ndarray, ndarray]
The EM step.
- Parameters:
X_reconstructed – np.ndarray The current imputation approximation.
Mu – np.ndarray The previous value of the mean.
Sigma – np.ndarray The previous value of the variance.
observed – np.ndarray Mask of the observed values in the original input.
missing – np.ndarray Mask of the missing values in the original input.
- Returns:
The new approximation of the mean. ndarray: The new approximation of the variance. ndarray: The new imputed dataset.
- Return type:
ndarray
- _impute_em(X: ndarray) ndarray
The EM imputation core loop.
- Parameters:
X – np.ndarray The dataset with missing values.
- Raises:
RuntimeError – raised if the static checks on the final result fail.
- Returns:
The dataset with imputed values.
- Return type:
ndarray
- fit_transform(**kwargs: Any) Any
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns:
X_new – Transformed array.
- Return type:
ndarray array of shape (n_samples, n_features_new)
- class EMPlugin(random_state: int = 0, maxit: int = 500, convergence_threshold: float = 1e-08)
Bases:
ImputerPlugin
Imputation plugin for completing missing values using the EM strategy.
The EM algorithm is an optimization algorithm that assumes a distribution for the partially missing data and tries to maximize the expected complete data log-likelihood under that distribution.
- Steps:
For an input dataset X with missing values, we assume that the values are sampled from distribution N(Mu, Sigma).
We generate the “observed” and “missing” masks from X, and choose some initial values for Mu = Mu0 and Sigma = Sigma0.
The EM loop tries to approximate the (Mu, Sigma) pair by some iterative means under the conditional distribution of missing components.
The E step finds the conditional expectation of the “missing” data, given the observed values and current estimates of the parameters. These expectations are then substituted for the “missing” data.
In the M step, maximum likelihood estimates of the parameters are computed as though the missing data had been filled in.
The X_reconstructed contains the approximation after each iteration.
- Parameters:
maxit – int, default=500 maximum number of imputation rounds to perform.
convergence_threshold – float, default=1e-08 Minimum ration difference between iterations before stopping.
Example
>>> import numpy as np >>> from hyperimpute.plugins.imputers import Imputers >>> plugin = Imputers().get("EM") >>> plugin.fit_transform([[1, 1, 1, 1], [np.nan, np.nan, np.nan, np.nan], [1, 2, 2, 1], [2, 2, 2, 2]])
Reference: “Maximum Likelihood from Incomplete Data via the EM Algorithm”, A. P. Dempster, N. M. Laird and D. B. Rubin
- _abc_impl = <_abc_data object>
- _fit(**kwargs: Any) Any
- _transform(**kwargs: Any) Any
- module_relative_path: Optional[Path]
- static name() str