hyperimpute.plugins.utils.simulate module

Original code: https://rmisstastic.netlify.app/how-to/python/generate_html/how%20to%20generate%20missing%20values

MAR_mask(X: ndarray, p: float, p_obs: float, sample_columns: bool = True) ndarray

Missing at random mechanism with a logistic masking model. First, a subset of variables with no missing values is randomly selected. The remaining variables have missing values according to a logistic model with random weights, re-scaled so as to attain the desired proportion of missing values on those variables.

Parameters:
  • X – Data for which missing values will be simulated.

  • p – Proportion of missing values to generate for variables which will have missing values.

  • p_obs – Proportion of variables with no missing values that will be used for the logistic masking model.

Returns:

Mask of generated missing values (True if the value is missing).

Return type:

mask

MNAR_mask_logistic(X: ndarray, p: float, p_params: float = 0.3, exclude_inputs: bool = True) ndarray

Missing not at random mechanism with a logistic masking model. It implements two mechanisms: (i) Missing probabilities are selected with a logistic model, taking all variables as inputs. Hence, values that are inputs can also be missing. (ii) Variables are split into a set of intputs for a logistic model, and a set whose missing probabilities are determined by the logistic model. Then inputs are then masked MCAR (hence, missing values from the second set will depend on masked values. In either case, weights are random and the intercept is selected to attain the desired proportion of missing values.

Parameters:
  • X – Data for which missing values will be simulated.

  • p – Proportion of missing values to generate for variables which will have missing values.

  • p_params – Proportion of variables that will be used for the logistic masking model (only if exclude_inputs).

  • exclude_inputs – True: mechanism (ii) is used, False: (i)

Returns:

Mask of generated missing values (True if the value is missing).

Return type:

mask

MNAR_mask_quantiles(X: ndarray, p: float, q: float, p_params: float, cut: str = 'both', MCAR: bool = False) ndarray

Missing not at random mechanism with quantile censorship. First, a subset of variables which will have missing variables is randomly selected. Then, missing values are generated on the q-quantiles at random. Since missingness depends on quantile information, it depends on masked values, hence this is a MNAR mechanism.

Parameters:
  • X – Data for which missing values will be simulated.

  • p – Proportion of missing values to generate for variables which will have missing values.

  • q – Quantile level at which the cuts should occur

  • p_params – Proportion of variables that will have missing values

  • cut – ‘both’, ‘upper’ or ‘lower’. Where the cut should be applied. For instance, if q=0.25 and cut=’upper’, then missing values will be generated in the upper quartiles of selected variables.

  • MCAR – If true, masks variables that were not selected for quantile censorship with a MCAR mechanism.

Returns:

Mask of generated missing values (True if the value is missing).

Return type:

mask

MNAR_self_mask_logistic(X: ndarray, p: float) ndarray

Missing not at random mechanism with a logistic self-masking model. Variables have missing values probabilities given by a logistic model, taking the same variable as input (hence, missingness is independent from one variable to another). The intercepts are selected to attain the desired missing rate.

Parameters:
  • X – Data for which missing values will be simulated.

  • p – Proportion of missing values to generate for variables which will have missing values.

Returns:

Mask of generated missing values (True if the value is missing).

Return type:

mask

fit_intercepts(X: ndarray, coeffs: ndarray, p: float, self_mask: bool = False) ndarray
pick_coeffs(X: ndarray, idxs_obs: List[int] = [], idxs_nas: List[int] = [], self_mask: bool = False) ndarray
simulate_nan(X: ndarray, p_miss: float, mecha: str = 'MCAR', opt: str = 'logistic', p_obs: float = 0.5, q: float = 0, sample_columns: bool = True) dict

Generate missing values for specifics missing-data mechanism and proportion of missing values.

Parameters:
  • X – Data for which missing values will be simulated.

  • p_miss – Proportion of missing values to generate for variables which will have missing values.

  • mecha – Indicates the missing-data mechanism to be used. “MCAR” by default, “MAR”, “MNAR” or “MNARsmask”

  • opt – For mecha = “MNAR”, it indicates how the missing-data mechanism is generated: using a logistic regression (“logistic”), a quantile censorship (“quantile”) or logistic regression for generating a self-masked MNAR mechanism (“selfmasked”).

  • p_obs – If mecha = “MAR”, or mecha = “MNAR” with opt = “logistic” or “quantile”, proportion of variables with no missing values that will be used for the logistic masking model.

  • q – If mecha = “MNAR” and opt = “quanti”, quantile level at which the cuts should occur.

Returns:

  • ‘X_init’: the initial data matrix.

  • ’X_incomp’: the data with the generated missing values.

  • ’mask’: a matrix indexing the generated missing values.

Return type:

A dictionnary containing