Page 45 - MSAM-2-1

P. 45

Materials Science in Additive Manufacturing Data imputation strategies of PBF Ti64

variables in a sequential fashion such that prior imputed
values are used as part of the model in predicting
subsequent variables. Hence, each variable can be modeled
conforming to its distribution with continuous variables
modeled using linear regression, while binary variables are
modeled with logistic regression.
To carry out MICE, multiple copies of the dataset have
to be created first. The following steps are then carried out
on each copy of the dataset :
[29]
(i) Missing values for each variable are imputed using
non-missing values from the variable as a placeholder.
(ii) Set the imputed placeholders for one variable back to
missing and model the selected variable as a function
of the other variables. For each variable with missing
values, the IterativeImputer class sets the imputed
values for that variable to missing and models the
Figure 2. Process flow for k-nearest neighbor imputation.
selected variable as a function of the other variables
using ExtraTreesRegressor. The model is trained
are scaled-up . The distance metric used to calculate
[27]
the similarity between samples is the Euclidean distance. on the complete cases, which are the cases where all
When calculating the distance involving missing values, variables are observed.
the coordinates of the missing values are ignored. This (iii) Using the fitted “ExtraTreesRegressor” model, predict,
means that when calculating the distance between two and impute missing values for the selected variable.
samples, only the coordinates where both samples have (iv) Repeat steps (ii) and (iii) for each variable in the
values are considered. The missing values are effectively dataset.
treated as if they do not exist. (v) The imputation cycle is repeated for 10 cycles, with the
imputed values being updated at the end of each cycle.
To account for the missing values, the weights of
the remaining coordinates are scaled-up. This means The imputed copies of each dataset are then analyzed
that the distances between samples are adjusted to and the results combined using rules specific to the
[28]
[30]
account for the missing values, so that samples that are results , calculated using Rubin’s Rules . Rubin’s
similar in the remaining coordinates but have missing Rule states that the estimated variance of the combined
values in different locations are still considered similar. estimate is equal to the average of the within-imputation
The scaling-up of weights is done by multiplying the variance (the variability of the estimate within each
weights of the remaining coordinates by a factor that is imputed dataset) and the between-imputation variance
proportional to the number of non-missing coordinates (the variability of the estimates across the imputed
in the samples being compared. Specifically, for each datasets). To calculate the combined estimate, the point
sample being compared, the weights of the remaining estimates from each imputed dataset are averaged, and the
coordinates are divided by the proportion of non-missing variance is calculated using Rubin’s Rule. This approach
coordinates in that sample. This means that the weights accounts for the uncertainty due to missing data and
of the remaining coordinates are scaled-up by a factor provides estimates that are more accurate than those from
equal to the reciprocal of the proportion of non-missing the traditional complete case analysis.
coordinates in the sample. This adjustment ensures that Imputation was executed using Sckit-learn’s
the distance metric takes into account the missing values IterativeImputer class, with the process flow as shown
in a meaningful way, without allowing the missing values in Figure 3. Its implementation is similar to the R MICE
to dominate the calculation. Each sample’s missing values package but returns only one imputed dataset instead
[28]
are imputed using the mean value from n_neighbors of multiple imputed datasets . The estimator used for
[31]
nearest neighbors, with n_neighbors = 5. the sequential imputation was ExtraTreesRegressor,
which builds an ensemble of regression trees, with
2.1.2. Multivariate imputation by chained equations default hyperparameters. Using ExtraTreesRegressor as
Multivariate imputation by chained equations (MICE) the estimator for the IterativeImputer class, non-linear
[28]
is an imputation technique that iteratively imputes missing relationships between the variables in the dataset can be
data for one variable modeled as a function of the other captured, which can result in improved imputations.

Volume 2 Issue 1 (2023) 4 https://doi.org/10.36922/msam.50

40 41 42 43 44 45 46 47 48 49 50