Page 14 - IJAMD-1-2
P. 14

International Journal of AI for
            Materials and Design
                                                                                    Sustainable electronics using AI/ML


            trees (CARTs), support vector machine (SVM), K-nearest   can be computationally intensive for large datasets and
            neighbor (kNN), regulated logistic regression (RLR)   sensitive to the choice of k and feature scaling. Despite
            model,  graph  neural network (GNN),  transformers,  and   these challenges, kNN remains a valuable tool for
            Bayesian optimization (BO). Each of these techniques is   biodegradability prediction due to its straightforward
            briefly discussed below. 75                        implementation and interpretability. 81
            4.1.1. CART technique                              4.1.4. RLR
            CART is a robust and interpretable method for predicting   RLR technique typically models the relationship between
            the biodegradability of compounds using decision tree   input variables and a binary outcome using a logistic
            learning concepts; it constructs a binary tree by recursively   function. An S-shaped curve is produced by a logistic
            splitting the dataset based on feature values to maximize   function that maps the input to a probability value between
            the  separation  of  biodegradable  and  non-biodegradable   0 and 1 representing the predicted probability of a positive
                                                         76
            compounds (in case of biodegradability prediction).    outcome. The model estimates the logistic function’s
            Each node represents a decision based on a molecular   parameters using maximum likelihood estimation. This
            descriptor  (molecular  weight, number of certain atoms,   technique prevents overfitting by utilizing regularization
            hydrophobicity, and other physicochemical properties),   and improves generalization by adding a penalty term to
            and the leaves represent the final prediction. This method   the cost function. This penalty term reduces the magnitude
            is highly interpretable, allowing researchers to easily   of coefficients and prevents them from growing too large.
            understand the decision-making process.  However, it can   The logistic  regression model has two most popular
                                            77
            be prone to overfitting, which can be mitigated through   regularization terms such as L1 (Lasso) or L2 (Ridge).
            pruning. CART’s ability to handle non-linear relationships   The former adds the absolute values of coefficients to the
            and its simplicity make it a valuable tool for assessing the   cost function, causing some coefficients to become exactly
            environmental impact of chemicals.                 zero and latter adding the squared values of coefficients
                                                               to the cost function. Typically, the model estimates the
            4.1.2. SVMs                                        probability that a compound is biodegradable based on

            SVM is a powerful ML technique used to predict the   its features, providing a clear probabilistic interpretation.
            biodegradability of compounds by finding the optimal   Regularization helps in managing multicollinearity and
            hyperplane  that  separates  biodegradable  and  non-  ensures that the model generalizes well to new data,
            biodegradable compounds in a high-dimensional      making it a reliable choice for biodegradability prediction
            feature space. By transforming molecular descriptors   in environmental science and chemical informatics.
            into this space, SVM maximizes the margin between
            the two classes, ensuring robust classification even with   4.1.5. GNNs
            complex, non-linear relationships.  Kernel functions,   GNNs  are  increasingly  utilized  in  predicting
                                         78
            such as radial basis functions, are often employed   biodegradability due to their ability to model the complex,
            to handle non-linearity. SVM is highly effective for   non-linear relationships inherent in chemical structures.
            its accuracy and ability to manage high-dimensional   By representing molecules as  graphs, where atoms are
            data, making it a popular choice for biodegradability   nodes and bonds are edges, GNNs can effectively capture
            prediction in environmental science and chemical   the intricate connectivity and properties of compounds.
            informatics. 79                                    This allows for accurate predictions of biodegradability by
                                                               learning from the structural features and patterns within
            4.1.3. kNN                                         large datasets of chemical compounds. Consequently,
            kNN is another technique to determine the          GNNs facilitate the design of environmentally friendly
            biodegradability of compounds by classifying a compound   chemicals by enabling researchers to identify and optimize
            based on the majority class of its k closest neighbors in the   biodegradable properties early in the development process. 82
            feature space, in which the features are typically molecular
            descriptors. The choice of k (the number of neighbors)   4.1.6. Transformers
            and the distance function are crucial hyperparameters   By utilizing self-attention mechanisms, it effectively captures
            that can be tuned to optimize the performance.  The   the relationships within molecular structures, analyzing
                                                     80
            distance metric, often Euclidean distance, determines   how substructures influence biodegradability. This allows
            the similarity between the compounds. kNN is intuitive   for a detailed analysis of how various substructures within
            and non-parametric without using any assumptions   a molecule influence its biodegradability. The transformer
            about the underlying data distribution. However, it   model processes the molecule as a sequence of tokens, where


            Volume 1 Issue 2 (2024)                         8                              doi: 10.36922/ijamd.3173
   9   10   11   12   13   14   15   16   17   18   19