Page 44 - MSAM-2-1
P. 44

Materials Science in Additive Manufacturing                           Data imputation strategies of PBF Ti64



              Machine learning in 3D printing is growing rapidly and   realistic scores that preserve the variable distribution .
                                                                                                           [21]
            has been used to perform process and design optimization,   Some widely-used imputation methods include: imputing
            anomalies detection, etc. . It relies heavily on the dataset   using zero, mean, median, or mode; imputing using
                                [18]
            to train a good machine learning model to have good   randomly selected value; and imputing using a model .
                                                                                                           [22]
            prediction. Given the vast number of literature investigating   These techniques often impute a single and constant
            the process parameters’ effects on the different properties   value for each variable without capturing or reflecting the
            of SLM Ti6Al4V, there is potential in collating the data   relationship among the variables. This will likely result in
            and using machine learning to perform data analytics on   an incorrect process-properties relationship.
            the dataset to determine the process-structure-properties   Model-based imputation methods can be categorized
            relationship. There are missing values present in the   into two types: those that make predictions for the missing
            collated SLM Ti6Al4V dataset as each property/parameter   values based on similar data points, and those that attempt
            has been studied in isolation, but the quantity of data is   to construct a global model to infer the missing data. The
            insufficient for machine learning; therefore, imputation is   former  includes  algorithms such  as k-nearest neighbors
            required to bolster the data volume. Hence, the data from   (kNN), while the latter encompasses deep learning neural
            the literature are considered incomplete, and imputation of   networks.
            the missing data is required as a pre-processing step before
            subsequent analysis can be carried out.              The present study is focused on the investigation of the
                                                               effect of different model-based imputation techniques on
              Researchers have utilized various kinds of techniques   the process-structure relationship of the SLM Ti6Al4V
            to  impute missing  data in  manufacturing  processes.   dataset. The results of the imputation were evaluated to
            For instance, Steiner  et al. aimed to develop real-time   determine the best strategy for the dataset. This article
            predictive models of two key strength properties of a   will first present the methodology, followed by results and
            wood  composite  manufacturing  process  using  real-time   discussion about  the  different  imputation methods, and
            process and destructive test data collected from a wood   finally the investigation of the imputed dataset.
            composite manufacturer . However, sensor malfunction
                                [19]
            and  data  “send/retrieval”  problems  lead  to  null  fields   2. Methodology
            in the company’s data warehouse, which resulted in
            information loss. To overcome this challenge, two missing   2.1. Imputation methods
            data  imputation  methods,  expectation-maximization   2.1.1. k-Nearest neighbors (kNN) imputation
            (EM) algorithm and multiple imputation (MI) using   kNN imputation is one of the most common methods to
            Markov Chain Monte Carlo (MCMC) simulation, were   impute missing values. It is used for both classification and
            used to impute the missing data. Predictive models   regression problems . The algorithm identifies k number
                                                                               [23]
            based on the imputed datasets generated more precise   of neighboring points using a distance metric and estimates
            prediction results than  models  of non-imputed datasets.   the missing values using the values of these k neighboring
            In addition, Bayesian Additive Regression Tree (BART)   observations .
                                                                         [24]
            produced the most precise prediction results among four
            predictive modeling methods. In another work, Wang   The  distance  metric  is  generally  Euclidean,  and  the
            et al. discuss the importance of data mining in intelligent   function can be defined as
                                                                                       m
            manufacturing and introduce an energy monitoring                  Ex y,        x (  y ) 2  (I)
            platform for small-  and medium-sized enterprises that                     i1  i  i
            records energy consumption data at various levels of
                     [20]
            granularity . However, incomplete data can lead to an   Where  x  and  y  are the point of interest and a case
                                                                         i
                                                                               i
            inaccurate portrayal of the system, so Wang et al. propose   point from the dataset, and m is the number of input
                                                                      [25]
            a novel orthogonal-least-square-based autoencoder  to   variables . The process flow for the imputation is shown
            generate new samples for the imputation of missing   in Figure 2.
            values.  The  proposed  approach  outperforms  alternative   Since the kNN algorithm is non-parametric , there is
                                                                                                    [23]
            methods significantly for missing ratios >0.05 based on   no underlying assumption on the distribution of data, and
            experimental results using real industrial datasets.  hence, kNN is suitable for datasets with varied distributions.
              There  are many data  imputation strategies,  from   Imputation was done using Scikit-learn’s KNN
            simple statistical methods such as mean imputation and   Imputer class . For calculation of the distance involving
                                                                          [26]
            regression imputation to more complex methods such as   missing values, the coordinates of the missing value are
            hot-deck imputation, which imputes the missing data by   ignored and the weights of the remaining coordinates


            Volume 2 Issue 1 (2023)                         3                        https://doi.org/10.36922/msam.50
   39   40   41   42   43   44   45   46   47   48   49