Page 47 - MSAM-2-1

P. 47

Materials Science in Additive Manufacturing Data imputation strategies of PBF Ti64

before imputation as the information is of limited use if is especially useful to visualize large quantities of high
there is insufficient data. dimensional data and model the relationship between
The proportion of missing data is first calculated for them in a low, two-dimensional map, helping to advance
each variable, and variables with more than 92% missing the understanding of process-property relationships for
data are dropped (Table 1). materials.
In general, process parameter variables have fewer The implementation of the SOM is from a Python
missing values as the print parameters are normally package Tfprop_sompy, developed by Kikugawa and
[35]
reported regardless of the type of mechanical tests being Nishimura, based on an open-source package SOMPY .
conducted, whereas material property variables have high The training data were normalized by x x j where
ij
number of missing values as not every study has reported ij
j
the same material properties. The threshold number of 92% x is the i-th row of the j-th variable in the data, and µ and
j
ij
is determined having considered the importance of the σ are the mean and standard deviations of the j-th variable,
j
variables and the pattern of missing data. It is understood respectively. The size of the map was set to be 50 × 50 with
that the accuracy and reliability of the imputations may be the weights initialized using principal component analysis.
lower when a large proportion of the data is missing. In Different sizes of the SOM were attempted and a map size
general, imputation methods tend to perform better when of 100 × 100 was chosen such that each node of the map
the amount of missing data is lower and may struggle to corresponds to at most one point of data in the dataset .
[36]
accurately impute large amounts of missing data. Therefore,
in our case, we attempted alternative method such as 3. Results and discussion
multiple imputation followed by a median approach to
improve the accuracy. Of the remaining variables, scanning 3.1. Validation of the imputation models
strategy and microstructure are dropped as they are too Validation of the imputed values can be done through
varied and unable to be generalized. Duplicate rows in the graphical plots that show the distribution of data, as well
remaining dataset are then dropped. 401 datapoints were as numerical displays such as summary statistics of the
retained. imputed dataset .
[37]
There are 18 variables retained: energy density (J/mm ), Graphical evaluation of the imputed datasets is
3
exposure duration (µs), hatch spacing (µm), laser focus performed through data visualization using three
(mm), laser power (W), laser spot (µm), laser type (0 for plots: Boxplot, kernel density plot with histogram, and
continuous wavelength [cw], 1 for pulsed wavelength cumulative distribution plot. By comparing the statistical
[pw]), layer thickness (µm), point distance (µm), scan visualization plots, one can have an idea of the distribution
speed (mm/s), density (%), elongation (%), microhardness of the imputed values and determine if they fall within
(HV), macrohardness (HV), ultimate tensile strength expected boundaries.
(MPa), yield strength (MPa), Young’s modulus (GPa), and
porosity (%). Only two variables, laser power and laser Figure 5 shows the visualization plots for the energy
type, do not have any missing values. density of the original dataset (observed) against the
complete imputed dataset (imputed) for kNN imputations.
Cells that have a range of data inputted as a string The imputed values have close distributions to the original
(e.g., “60 – 180”) are replaced with the mean values. As dataset and can be said to have reasonable values.
the exact value was not given for the variable, the use of
mean values for the range is the only option, although it The three visualization plots for all the incomplete
will lead to some degree of uncertainty. Cells with standard variables are plotted for each of the imputed datasets,
deviations (e.g., “0.12 ± 0.03”) are replaced to retain only and the imputed values for energy density, hatch spacing,
the numeric values in front. laser spot, layer thickness, scan speed, elongation,
microhardness, macrohardness, yield strength, and
2.3. Visualizing relationships in imputed dataset porosity for all three imputed datasets are found to be
After obtaining the imputed dataset using the median adequately close to the distribution of the original dataset
values obtained from the 3 algorithms, the process- (Figures S1–S3). However, the distributions for exposure
properties linkages for SLM Ti64 can be obtained using duration, laser focus, point distance, and Young’s modulus
data-mining through a self-organizing map (SOM). A SOM deviate from the original distribution to varying degrees
is an unsupervised machine learning model developed for the different imputation techniques, with MICE
by Kohonen that reduces the dimensionality of an input (Figure 6) showing the greatest deviation followed by
space while maintaining its underlying structure . This GINN (Figure 7) and then kNN (Figure 8).
[34]

Volume 2 Issue 1 (2023) 6 https://doi.org/10.36922/msam.50

42 43 44 45 46 47 48 49 50 51 52