Page 14 - IJAMD-1-2

P. 14

International Journal of AI for
Materials and Design
Sustainable electronics using AI/ML

trees (CARTs), support vector machine (SVM), K-nearest can be computationally intensive for large datasets and
neighbor (kNN), regulated logistic regression (RLR) sensitive to the choice of k and feature scaling. Despite
model, graph neural network (GNN), transformers, and these challenges, kNN remains a valuable tool for
Bayesian optimization (BO). Each of these techniques is biodegradability prediction due to its straightforward
briefly discussed below. 75 implementation and interpretability. 81
4.1.1. CART technique 4.1.4. RLR
CART is a robust and interpretable method for predicting RLR technique typically models the relationship between
the biodegradability of compounds using decision tree input variables and a binary outcome using a logistic
learning concepts; it constructs a binary tree by recursively function. An S-shaped curve is produced by a logistic
splitting the dataset based on feature values to maximize function that maps the input to a probability value between
the separation of biodegradable and non-biodegradable 0 and 1 representing the predicted probability of a positive
76
compounds (in case of biodegradability prediction). outcome. The model estimates the logistic function’s
Each node represents a decision based on a molecular parameters using maximum likelihood estimation. This
descriptor (molecular weight, number of certain atoms, technique prevents overfitting by utilizing regularization
hydrophobicity, and other physicochemical properties), and improves generalization by adding a penalty term to
and the leaves represent the final prediction. This method the cost function. This penalty term reduces the magnitude
is highly interpretable, allowing researchers to easily of coefficients and prevents them from growing too large.
understand the decision-making process. However, it can The logistic regression model has two most popular
77
be prone to overfitting, which can be mitigated through regularization terms such as L1 (Lasso) or L2 (Ridge).
pruning. CART’s ability to handle non-linear relationships The former adds the absolute values of coefficients to the
and its simplicity make it a valuable tool for assessing the cost function, causing some coefficients to become exactly
environmental impact of chemicals. zero and latter adding the squared values of coefficients
to the cost function. Typically, the model estimates the
4.1.2. SVMs probability that a compound is biodegradable based on

SVM is a powerful ML technique used to predict the its features, providing a clear probabilistic interpretation.
biodegradability of compounds by finding the optimal Regularization helps in managing multicollinearity and
hyperplane that separates biodegradable and non- ensures that the model generalizes well to new data,
biodegradable compounds in a high-dimensional making it a reliable choice for biodegradability prediction
feature space. By transforming molecular descriptors in environmental science and chemical informatics.
into this space, SVM maximizes the margin between
the two classes, ensuring robust classification even with 4.1.5. GNNs
complex, non-linear relationships. Kernel functions, GNNs are increasingly utilized in predicting
78
such as radial basis functions, are often employed biodegradability due to their ability to model the complex,
to handle non-linearity. SVM is highly effective for non-linear relationships inherent in chemical structures.
its accuracy and ability to manage high-dimensional By representing molecules as graphs, where atoms are
data, making it a popular choice for biodegradability nodes and bonds are edges, GNNs can effectively capture
prediction in environmental science and chemical the intricate connectivity and properties of compounds.
informatics. 79 This allows for accurate predictions of biodegradability by
learning from the structural features and patterns within
4.1.3. kNN large datasets of chemical compounds. Consequently,
kNN is another technique to determine the GNNs facilitate the design of environmentally friendly
biodegradability of compounds by classifying a compound chemicals by enabling researchers to identify and optimize
based on the majority class of its k closest neighbors in the biodegradable properties early in the development process. 82
feature space, in which the features are typically molecular
descriptors. The choice of k (the number of neighbors) 4.1.6. Transformers
and the distance function are crucial hyperparameters By utilizing self-attention mechanisms, it effectively captures
that can be tuned to optimize the performance. The the relationships within molecular structures, analyzing
80
distance metric, often Euclidean distance, determines how substructures influence biodegradability. This allows
the similarity between the compounds. kNN is intuitive for a detailed analysis of how various substructures within
and non-parametric without using any assumptions a molecule influence its biodegradability. The transformer
about the underlying data distribution. However, it model processes the molecule as a sequence of tokens, where

Volume 1 Issue 2 (2024) 8 doi: 10.36922/ijamd.3173

9 10 11 12 13 14 15 16 17 18 19