Page 31 - IJAMD-2-1

P. 31

International Journal of AI for
Materials and Design
ML molecular modeling of Ru: A KAN approach

2.3. Training data processing eigenvalues. The data were then projected onto the PCs to

To effectively process crystal structures for ML applications, obtain a reduced data matrix :
35
the Python Materials Genomics (pymatgen) Library X=X V (Ⅵ)
was utilized in conjunction with the Smooth Overlap centered reduced
of Atomic Positions (SOAP) descriptor, implemented where V reduced includes eigenvectors corresponding to
through the DScribe Python package, to convert the largest eigenvalues that capture 95% of the variance.
40
structural data into tensor representations. SOAP provides In the PCA of crystal structures, Figure 2A elucidates
a robust framework for representing atomic environments, the cumulative explained variance ratio as a function of
as it ensures rotational, translational, and permutational component number. The first PC accounts for 80.5% of the
invariance of the structural descriptors. This invariance total variance, demonstrating its dominant role in capturing
means that equivalent atomic configurations yield identical the dataset’s primary features. The second PC contributes
representations regardless of rigid body transformations an additional 12.2%, bringing the cumulative explained
or reordering of atoms, which is essential for reliable variance to 92.7%. Notably, the first three PCs collectively
structure-property predictions in crystalline systems. Each explain 97.6% of the dataset’s variance, surpassing the
structure’s atomic positions were denoted as r = x ,y ,z , often-used 95% threshold for dimensionality reduction.
41
i
i
i
i
where denotes the index of the atom. These position This rapid accumulation of explained variance, visualized
vectors were flattened into a one-dimensional tensor for by the steep rise in the cumulative contribution line,
each structure: exhibits a characteristic elbow-shaped curve. The shape
T structure =x y z … x y z n (Ⅰ) of this curve, with its sharp initial increase followed by a
1
n
n
1
1
plateau, indicates highly efficient dimensionality reduction,
where is the total number of atoms in the structure. suggesting that the complex crystal structure data can be
The tensors from each structure were then stacked to form effectively represented using just these three PCs.
a two-dimensional tensor , representing the entire dataset
and making it suitable for ML models: Figure S1 presents a heatmap of structural PCA
loadings, revealing the complex relationships between
 T structure1  the original structural features and the three PCs. In this
T =   ...   (Ⅱ) context, “loadings” refer to the coefficients that describe
 T  how much each original feature contributes to a given PC,
 structure m  with red indicating positive correlations and blue indicating
negative correlations. The first PC displays a nuanced
where is the number of structures. pattern of strong positive and negative loadings across
To address the high dimensionality of our data, we features, suggesting that it encapsulates a multifaceted
applied principal component analysis (PCA). We centered combination of structural attributes. This component
the data by subtracting the mean of each feature to produce exhibits the most variation in loadings, indicating its
a mean-centered data matrix: capacity to capture complex, opposing relationships among
the original features. In contrast, the second PC exhibits
X centered =X−μ (Ⅲ) predominantly positive loadings, with several features

where X is the original data matrix and µ is a vector displaying very strong positive correlations. Importantly,
containing the mean values of each feature. the magnitude of loadings across all PCs indicates that
each original structural feature is well-represented in the
Next, we computed the covariance matrix from the
mean-centered data: PC space, suggesting that the PCA effectively captures
the key variance in the dataset without substantial loss of
1 information from any feature.
C X T centered X centered (Ⅳ)
m 1 Meanwhile, for energy and force data extracted from
DFT calculations, we employed different preprocessing
where is the number of structures.
techniques. The energy data were directly converted
We then performed eigen decomposition to extract into the tensor format without further modifications. In
eigenvalues () and eigenvectors (), satisfying: contrast, the force data, representing a three-dimensional
vector (one per atom with components F= (F ,F ,F )
CV = VΛ (Ⅴ) i ix iy iz
representing the forces along the -, -, and -directions),
where is the matrix of eigenvectors (principal required additional processing. We also applied PCA to
components [PCs]) and is the diagonal matrix of reduce its dimensionality, as discussed above. Figure 2B

Volume 2 Issue 1 (2025) 25 doi: 10.36922/ijamd.8291

26 27 28 29 30 31 32 33 34 35 36