Page 47 - MSAM-2-1
P. 47

Materials Science in Additive Manufacturing                           Data imputation strategies of PBF Ti64



            before imputation as the information is of limited use if   is  especially useful to  visualize  large  quantities  of  high
            there is insufficient data.                        dimensional data and model the relationship between
              The proportion of missing data is first calculated for   them in a low, two-dimensional map, helping to advance
            each variable, and variables with more than 92% missing   the  understanding  of  process-property  relationships  for
            data are dropped (Table 1).                        materials.
              In general, process parameter variables have fewer   The implementation of the SOM is from a Python
            missing values as the print parameters are normally   package Tfprop_sompy, developed by Kikugawa and
                                                                                                           [35]
            reported regardless of the type of mechanical tests being   Nishimura, based on an open-source package SOMPY .
            conducted, whereas material property variables have high   The training data were normalized by  x   x   j   where
                                                                                                   ij
            number of missing values as not every study has reported                           ij
                                                                                                     j
            the same material properties. The threshold number of 92%   x  is the i-th row of the j-th variable in the data, and µ  and
                                                                                                          j
                                                                ij
            is determined having considered the importance of the   σ  are the mean and standard deviations of the j-th variable,
                                                                j
            variables and the pattern of missing data. It is understood   respectively. The size of the map was set to be 50 × 50 with
            that the accuracy and reliability of the imputations may be   the weights initialized using principal component analysis.
            lower when a large proportion of the data is missing. In   Different sizes of the SOM were attempted and a map size
            general, imputation methods tend to perform better when   of 100 × 100 was chosen such that each node of the map
            the amount of missing data is lower and may struggle to   corresponds to at most one point of data in the dataset .
                                                                                                          [36]
            accurately impute large amounts of missing data. Therefore,
            in our case, we attempted alternative method such as   3. Results and discussion
            multiple imputation followed by a  median approach to
            improve the accuracy. Of the remaining variables, scanning   3.1. Validation of the imputation models
            strategy and microstructure are dropped as they are too   Validation of the imputed values can be done through
            varied and unable to be generalized. Duplicate rows in the   graphical plots that show the distribution of data, as well
            remaining dataset are then dropped. 401 datapoints were   as numerical displays such as summary statistics of the
            retained.                                          imputed dataset .
                                                                            [37]
              There are 18 variables retained: energy density (J/mm ),   Graphical evaluation of the imputed datasets is
                                                         3
            exposure duration (µs), hatch spacing (µm), laser focus   performed through data visualization using three
            (mm), laser power (W), laser spot (µm), laser type (0 for   plots:  Boxplot,  kernel  density  plot  with  histogram,  and
            continuous wavelength [cw], 1 for pulsed wavelength   cumulative distribution plot. By comparing the statistical
            [pw]), layer thickness (µm), point distance (µm), scan   visualization plots, one can have an idea of the distribution
            speed (mm/s), density (%), elongation (%), microhardness   of the imputed values and determine if they fall within
            (HV), macrohardness (HV), ultimate tensile strength   expected boundaries.
            (MPa), yield strength (MPa), Young’s modulus (GPa), and
            porosity (%). Only two variables, laser power and laser   Figure  5 shows the visualization plots for the energy
            type, do not have any missing values.              density of the original dataset (observed) against the
                                                               complete imputed dataset (imputed) for kNN imputations.
              Cells that have a range of data inputted as a string   The imputed values have close distributions to the original
            (e.g., “60 – 180”) are replaced with the mean values. As   dataset and can be said to have reasonable values.
            the exact value was not given for the variable, the use of
            mean values for the range is the only option, although it   The three visualization plots for all the incomplete
            will lead to some degree of uncertainty. Cells with standard   variables  are  plotted  for each of  the  imputed  datasets,
            deviations (e.g., “0.12 ± 0.03”) are replaced to retain only   and the imputed values for energy density, hatch spacing,
            the numeric values in front.                       laser spot, layer thickness, scan speed, elongation,
                                                               microhardness,  macrohardness,  yield  strength, and
            2.3. Visualizing relationships in imputed dataset  porosity  for  all  three  imputed  datasets  are  found  to  be
            After obtaining the imputed dataset using the median   adequately close to the distribution of the original dataset
            values obtained from the 3 algorithms, the process-  (Figures S1–S3). However, the distributions for exposure
            properties linkages for SLM Ti64 can be obtained using   duration, laser focus, point distance, and Young’s modulus
            data-mining through a self-organizing map (SOM). A SOM   deviate from the original distribution to varying degrees
            is an unsupervised machine learning model developed   for the different imputation techniques, with MICE
            by Kohonen that reduces the dimensionality of an input   (Figure  6) showing the greatest deviation followed by
            space while maintaining its underlying structure . This   GINN (Figure 7) and then kNN (Figure 8).
                                                    [34]

            Volume 2 Issue 1 (2023)                         6                        https://doi.org/10.36922/msam.50
   42   43   44   45   46   47   48   49   50   51   52