hyperimpute.plugins.utils.simulate module
Original code: https://rmisstastic.netlify.app/how-to/python/generate_html/how%20to%20generate%20missing%20values
- MAR_mask(X: ndarray, p: float, p_obs: float, sample_columns: bool = True) ndarray
Missing at random mechanism with a logistic masking model. First, a subset of variables with no missing values is randomly selected. The remaining variables have missing values according to a logistic model with random weights, re-scaled so as to attain the desired proportion of missing values on those variables.
- Parameters:
X – Data for which missing values will be simulated.
p – Proportion of missing values to generate for variables which will have missing values.
p_obs – Proportion of variables with no missing values that will be used for the logistic masking model.
- Returns:
Mask of generated missing values (True if the value is missing).
- Return type:
mask
- MNAR_mask_logistic(X: ndarray, p: float, p_params: float = 0.3, exclude_inputs: bool = True) ndarray
Missing not at random mechanism with a logistic masking model. It implements two mechanisms: (i) Missing probabilities are selected with a logistic model, taking all variables as inputs. Hence, values that are inputs can also be missing. (ii) Variables are split into a set of intputs for a logistic model, and a set whose missing probabilities are determined by the logistic model. Then inputs are then masked MCAR (hence, missing values from the second set will depend on masked values. In either case, weights are random and the intercept is selected to attain the desired proportion of missing values.
- Parameters:
X – Data for which missing values will be simulated.
p – Proportion of missing values to generate for variables which will have missing values.
p_params – Proportion of variables that will be used for the logistic masking model (only if exclude_inputs).
exclude_inputs – True: mechanism (ii) is used, False: (i)
- Returns:
Mask of generated missing values (True if the value is missing).
- Return type:
mask
- MNAR_mask_quantiles(X: ndarray, p: float, q: float, p_params: float, cut: str = 'both', MCAR: bool = False) ndarray
Missing not at random mechanism with quantile censorship. First, a subset of variables which will have missing variables is randomly selected. Then, missing values are generated on the q-quantiles at random. Since missingness depends on quantile information, it depends on masked values, hence this is a MNAR mechanism.
- Parameters:
X – Data for which missing values will be simulated.
p – Proportion of missing values to generate for variables which will have missing values.
q – Quantile level at which the cuts should occur
p_params – Proportion of variables that will have missing values
cut – ‘both’, ‘upper’ or ‘lower’. Where the cut should be applied. For instance, if q=0.25 and cut=’upper’, then missing values will be generated in the upper quartiles of selected variables.
MCAR – If true, masks variables that were not selected for quantile censorship with a MCAR mechanism.
- Returns:
Mask of generated missing values (True if the value is missing).
- Return type:
mask
- MNAR_self_mask_logistic(X: ndarray, p: float) ndarray
Missing not at random mechanism with a logistic self-masking model. Variables have missing values probabilities given by a logistic model, taking the same variable as input (hence, missingness is independent from one variable to another). The intercepts are selected to attain the desired missing rate.
- Parameters:
X – Data for which missing values will be simulated.
p – Proportion of missing values to generate for variables which will have missing values.
- Returns:
Mask of generated missing values (True if the value is missing).
- Return type:
mask
- fit_intercepts(X: ndarray, coeffs: ndarray, p: float, self_mask: bool = False) ndarray
- pick_coeffs(X: ndarray, idxs_obs: List[int] = [], idxs_nas: List[int] = [], self_mask: bool = False) ndarray
- simulate_nan(X: ndarray, p_miss: float, mecha: str = 'MCAR', opt: str = 'logistic', p_obs: float = 0.5, q: float = 0, sample_columns: bool = True) dict
Generate missing values for specifics missing-data mechanism and proportion of missing values.
- Parameters:
X – Data for which missing values will be simulated.
p_miss – Proportion of missing values to generate for variables which will have missing values.
mecha – Indicates the missing-data mechanism to be used. “MCAR” by default, “MAR”, “MNAR” or “MNARsmask”
opt – For mecha = “MNAR”, it indicates how the missing-data mechanism is generated: using a logistic regression (“logistic”), a quantile censorship (“quantile”) or logistic regression for generating a self-masked MNAR mechanism (“selfmasked”).
p_obs – If mecha = “MAR”, or mecha = “MNAR” with opt = “logistic” or “quantile”, proportion of variables with no missing values that will be used for the logistic masking model.
q – If mecha = “MNAR” and opt = “quanti”, quantile level at which the cuts should occur.
- Returns:
‘X_init’: the initial data matrix.
’X_incomp’: the data with the generated missing values.
’mask’: a matrix indexing the generated missing values.
- Return type:
A dictionnary containing