hyperimpute.plugins.imputers.plugin_EM module

class EM(maxit: int = 500, convergence_threshold: float = 1e-08)

Bases: TransformerMixin

The EM algorithm is an optimization algorithm that assumes a distribution for the partially missing data and tries to maximize the expected complete data log-likelihood under that distribution.

Steps:

For an input dataset X with missing values, we assume that the values are sampled from distribution N(Mu, Sigma).
We generate the “observed” and “missing” masks from X, and choose some initial values for Mu = Mu0 and Sigma = Sigma0.
The EM loop tries to approximate the (Mu, Sigma) pair by some iterative means under the conditional distribution of missing components.
The E step finds the conditional expectation of the “missing” data, given the observed values and current estimates of the parameters. These expectations are then substituted for the “missing” data.
In the M step, maximum likelihood estimates of the parameters are computed as though the missing data had been filled in.
The X_reconstructed contains the approximation after each iteration.

Parameters:

maxit – int, default=500 maximum number of imputation rounds to perform.
convergence_threshold – float, default=1e-08 Minimum ration difference between iterations before stopping.

Paper: “Maximum Likelihood from Incomplete Data via the EM Algorithm”, A. P. Dempster, N. M. Laird and D. B. Rubin

_converged(Mu: ndarray, Sigma: ndarray, Mu_new: ndarray, Sigma_new: ndarray) → bool

Checks if the EM loop has converged.

Parameters:

Mu – np.ndarray The previous value of the mean.
Sigma – np.ndarray The previous value of the variance.
Mu_new – np.ndarray The new value of the mean.
Sigma_new – np.ndarray The new value of the variance.

Returns:

True/False if the algorithm has converged.

Return type:

bool

_em(X_reconstructed: ndarray, Mu: ndarray, Sigma: ndarray, observed: ndarray, missing: ndarray) → Tuple[ndarray, ndarray, ndarray]

The EM step.

Parameters:

X_reconstructed – np.ndarray The current imputation approximation.
Mu – np.ndarray The previous value of the mean.
Sigma – np.ndarray The previous value of the variance.
observed – np.ndarray Mask of the observed values in the original input.
missing – np.ndarray Mask of the missing values in the original input.

Returns:

The new approximation of the mean. ndarray: The new approximation of the variance. ndarray: The new imputed dataset.

Return type:

ndarray

_impute_em(X: ndarray) → ndarray

The EM imputation core loop.

Parameters:: X – np.ndarray The dataset with missing values.
Raises:: RuntimeError – raised if the static checks on the final result fail.
Returns:: The dataset with imputed values.
Return type:: ndarray

fit_transform(**kwargs: Any) → Any

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

class EMPlugin(random_state: int = 0, maxit: int = 500, convergence_threshold: float = 1e-08)

Bases: ImputerPlugin

Imputation plugin for completing missing values using the EM strategy.

The EM algorithm is an optimization algorithm that assumes a distribution for the partially missing data and tries to maximize the expected complete data log-likelihood under that distribution.

Steps:

For an input dataset X with missing values, we assume that the values are sampled from distribution N(Mu, Sigma).
We generate the “observed” and “missing” masks from X, and choose some initial values for Mu = Mu0 and Sigma = Sigma0.
The EM loop tries to approximate the (Mu, Sigma) pair by some iterative means under the conditional distribution of missing components.
The E step finds the conditional expectation of the “missing” data, given the observed values and current estimates of the parameters. These expectations are then substituted for the “missing” data.
In the M step, maximum likelihood estimates of the parameters are computed as though the missing data had been filled in.
The X_reconstructed contains the approximation after each iteration.

Parameters:

maxit – int, default=500 maximum number of imputation rounds to perform.
convergence_threshold – float, default=1e-08 Minimum ration difference between iterations before stopping.

Example

>>> import numpy as np
>>> from hyperimpute.plugins.imputers import Imputers
>>> plugin = Imputers().get("EM")
>>> plugin.fit_transform([[1, 1, 1, 1], [np.nan, np.nan, np.nan, np.nan], [1, 2, 2, 1], [2, 2, 2, 2]])

Reference: “Maximum Likelihood from Incomplete Data via the EM Algorithm”, A. P. Dempster, N. M. Laird and D. B. Rubin

_abc_impl = <_abc_data object>

_fit(**kwargs: Any) → Any

_transform(**kwargs: Any) → Any

static hyperparameter_space(*args: Any, **kwargs: Any) → List[Params]

module_relative_path: Optional[Path]

static name() → str

plugin: alias of EMPlugin