Page 66 - IJAMD-2-2

P. 66

International Journal of AI for
Materials and Design
A unified industrial AI foundation framework

help with knowledge graph construction, while LLMs, such focuses on four key aspects: data preprocessing, data
as GPT-3/4, 28,29 Llama1/2, 30,31 PaLM, and DeepSeek, quality, feature engineering, and data visualization.
32
33
can automate the summarization and extraction of key
technical information from technical reports and research 4.2.1. Data preprocessing
articles. This is a fundamental prerequisite for successful industrial
AI applications, as raw data often contains issues such
4.1.2. Dataset documentation as industrial as noise, missing values, anomalies, class imbalances,
knowledge and labeling inconsistencies. A well-structured data
Dataset documentation is a critical part of building preprocessing pipeline typically involves several key
structured industrial knowledge. It involves systematically techniques, including outlier detection methods (such
recording dataset metadata, collection conditions, sensor as isolation forests, local outlier factor, and statistical
configurations, labeling schemes, and linking datasets thresholding), data imputation methods (including mean,
with domain knowledge. This ensures that datasets evolve median, multiple imputation, k-nearest neighbors, and
from isolated assets into long-term, reusable knowledge regression-based approaches), signal denoising methods
resources that can be properly understood, reused, and (such as wavelet transforms and filtering), and resampling
referenced in future work. Platforms such as GitHub methods (including synthetic minority over-sampling
and Hugging Face demonstrate best practices in dataset technique, random over-sampling, and under-sampling).
39
documentation, providing clear descriptions, structured In addition, researchers are encouraged to adopt, design,
metadata fields, and version histories. In addition, LLMs and document reusable preprocessing pipelines using
could enrich dataset documentation. 34 standardized tools such as Pandas, scikit-learn, PyTorch,
and TensorFlow to ensure reproducibility and scalability.
4.1.3. AI/ML development knowledge repository
An AI/ML development knowledge repository provides 4.2.2. Data quality
a centralized space to store reusable code templates, Data quality directly impacts the reliability of AI models,
implementation guides, hyperparameter tuning records, as poor-quality data can degrade model performance and
and experiment logs. This repository accelerates lead to poor decision-making. Key dimensions of data
development by allowing engineers to reuse proven quality include consistent representation, completeness,
techniques and implementation methods. Examples uniqueness, feature accuracy, target accuracy, and target
include maintaining shared code repositories on GitHub class balance. Therefore, developers should adopt
40
combined with experiment documentation platforms and systematic data quality checks as part of their pipelines. For
using platforms such as MLflow or Weights and Biases for instance, CleanML demonstrates how various data quality
tracking experiments, packaging code, managing models, issues can significantly affect the performance of common
and sharing results. Moreover, LLMs also enhance multiple ML models. Furthermore, Foroni et al. extended
42
41
aspects, including code generation, information retrieval, conventional data quality definitions by evaluating not
and interactive AI-assisted exploration. 35-38 only how data deviate from an ideal clean dataset but also
how these deviations influence task outcomes.
4.1.4. New knowledge generation and integration
As models are developed and validated, they produce new 4.2.3. Feature engineering
insights and interpretations that should be continuously Feature engineering focuses on extracting meaningful
integrated into the existing knowledge base. This process representations from raw data, improving model
involves capturing lessons learned, model interpretation performance, interpretability, and efficiency. In industrial
43
outputs, model results, and deployment feedback, feeding applications, well-designed features can greatly improve
them back into structured documentation and knowledge model accuracy. Common techniques include domain-
graphs. This iterative process ensures that the knowledge specific feature extraction, such as time-domain statistical
module remains dynamic and evolves alongside measures, frequency-domain features derived from
technological progress and deployment experiences. Fourier transforms, and time-frequency domain features.
Dimensionality reduction methods, including principal
4.2. Data module component analysis (PCA), t-distributed stochastic
The data module focuses on transforming raw industrial neighbor embedding (t-SNE), and linear discriminant
data into AI-ready datasets to improve data usability analysis, help reduce complexity while maintaining
and reliability, enhanced by the domain knowledge from important information. Feature importance ranking
the knowledge module. In processing data, this module methods are used to identify key variables that most

Volume 2 Issue 2 (2025) 60 doi: 10.36922/IJAMD025080006

61 62 63 64 65 66 67 68 69 70 71