hyperimpute.plugins.imputers.plugin_EM module

class EM(maxit: int = 500, convergence_threshold: float = 1e-08)

Bases: TransformerMixin

The EM algorithm is an optimization algorithm that assumes a distribution for the partially missing data and tries to maximize the expected complete data log-likelihood under that distribution.

Steps:
  1. For an input dataset X with missing values, we assume that the values are sampled from distribution N(Mu, Sigma).

  2. We generate the “observed” and “missing” masks from X, and choose some initial values for Mu = Mu0 and Sigma = Sigma0.

  3. The EM loop tries to approximate the (Mu, Sigma) pair by some iterative means under the conditional distribution of missing components.

  4. The E step finds the conditional expectation of the “missing” data, given the observed values and current estimates of the parameters. These expectations are then substituted for the “missing” data.

  5. In the M step, maximum likelihood estimates of the parameters are computed as though the missing data had been filled in.

  6. The X_reconstructed contains the approximation after each iteration.

Parameters:
  • maxit – int, default=500 maximum number of imputation rounds to perform.

  • convergence_threshold – float, default=1e-08 Minimum ration difference between iterations before stopping.

Paper: “Maximum Likelihood from Incomplete Data via the EM Algorithm”, A. P. Dempster, N. M. Laird and D. B. Rubin

_converged(Mu: ndarray, Sigma: ndarray, Mu_new: ndarray, Sigma_new: ndarray) bool

Checks if the EM loop has converged.

Parameters:
  • Mu – np.ndarray The previous value of the mean.

  • Sigma – np.ndarray The previous value of the variance.

  • Mu_new – np.ndarray The new value of the mean.

  • Sigma_new – np.ndarray The new value of the variance.

Returns:

True/False if the algorithm has converged.

Return type:

bool

_em(X_reconstructed: ndarray, Mu: ndarray, Sigma: ndarray, observed: ndarray, missing: ndarray) Tuple[ndarray, ndarray, ndarray]

The EM step.

Parameters:
  • X_reconstructed – np.ndarray The current imputation approximation.

  • Mu – np.ndarray The previous value of the mean.

  • Sigma – np.ndarray The previous value of the variance.

  • observed – np.ndarray Mask of the observed values in the original input.

  • missing – np.ndarray Mask of the missing values in the original input.

Returns:

The new approximation of the mean. ndarray: The new approximation of the variance. ndarray: The new imputed dataset.

Return type:

ndarray

_impute_em(X: ndarray) ndarray

The EM imputation core loop.

Parameters:

X – np.ndarray The dataset with missing values.

Raises:

RuntimeError – raised if the static checks on the final result fail.

Returns:

The dataset with imputed values.

Return type:

ndarray

fit_transform(**kwargs: Any) Any

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

class EMPlugin(random_state: int = 0, maxit: int = 500, convergence_threshold: float = 1e-08)

Bases: ImputerPlugin

Imputation plugin for completing missing values using the EM strategy.

The EM algorithm is an optimization algorithm that assumes a distribution for the partially missing data and tries to maximize the expected complete data log-likelihood under that distribution.

Steps:
  1. For an input dataset X with missing values, we assume that the values are sampled from distribution N(Mu, Sigma).

  2. We generate the “observed” and “missing” masks from X, and choose some initial values for Mu = Mu0 and Sigma = Sigma0.

  3. The EM loop tries to approximate the (Mu, Sigma) pair by some iterative means under the conditional distribution of missing components.

  4. The E step finds the conditional expectation of the “missing” data, given the observed values and current estimates of the parameters. These expectations are then substituted for the “missing” data.

  5. In the M step, maximum likelihood estimates of the parameters are computed as though the missing data had been filled in.

  6. The X_reconstructed contains the approximation after each iteration.

Parameters:
  • maxit – int, default=500 maximum number of imputation rounds to perform.

  • convergence_threshold – float, default=1e-08 Minimum ration difference between iterations before stopping.

Example

>>> import numpy as np
>>> from hyperimpute.plugins.imputers import Imputers
>>> plugin = Imputers().get("EM")
>>> plugin.fit_transform([[1, 1, 1, 1], [np.nan, np.nan, np.nan, np.nan], [1, 2, 2, 1], [2, 2, 2, 2]])

Reference: “Maximum Likelihood from Incomplete Data via the EM Algorithm”, A. P. Dempster, N. M. Laird and D. B. Rubin

_abc_impl = <_abc_data object>
_fit(**kwargs: Any) Any
_transform(**kwargs: Any) Any
static hyperparameter_space(*args: Any, **kwargs: Any) List[Params]
module_relative_path: Optional[Path]
static name() str
plugin

alias of EMPlugin