Page 66 - IJAMD-2-2
P. 66

International Journal of AI for
            Materials and Design
                                                                               A unified industrial AI foundation framework


            help with knowledge graph construction, while LLMs, such   focuses on four key aspects: data preprocessing, data
            as GPT-3/4, 28,29  Llama1/2, 30,31  PaLM,  and DeepSeek,    quality, feature engineering, and data visualization.
                                          32
                                                         33
            can automate the summarization and extraction of key
            technical information from technical reports and research   4.2.1. Data preprocessing
            articles.                                          This is a fundamental prerequisite for successful industrial
                                                               AI applications, as raw data often contains issues such
            4.1.2. Dataset documentation as industrial         as noise, missing values, anomalies, class imbalances,
            knowledge                                          and labeling inconsistencies. A  well-structured data
            Dataset documentation is a critical part of building   preprocessing pipeline typically involves several key
            structured industrial knowledge. It involves systematically   techniques,  including  outlier  detection  methods  (such
            recording dataset metadata, collection conditions, sensor   as  isolation  forests,  local  outlier  factor,  and  statistical
            configurations, labeling schemes, and linking datasets   thresholding), data imputation methods (including mean,
            with domain knowledge. This ensures that datasets evolve   median, multiple imputation, k-nearest neighbors, and
            from isolated assets into long-term, reusable knowledge   regression-based approaches), signal denoising methods
            resources that can be properly understood, reused, and   (such as wavelet transforms and filtering), and resampling
            referenced in future work. Platforms such as GitHub   methods  (including  synthetic  minority  over-sampling
            and Hugging Face demonstrate best practices in dataset   technique, random over-sampling, and under-sampling).
                                                                                                            39
            documentation, providing clear descriptions, structured   In addition, researchers are encouraged to adopt, design,
            metadata fields, and version histories. In addition, LLMs   and document reusable preprocessing pipelines using
            could enrich dataset documentation. 34             standardized tools such as Pandas, scikit-learn, PyTorch,
                                                               and TensorFlow to ensure reproducibility and scalability.
            4.1.3. AI/ML development knowledge repository
            An AI/ML development knowledge repository provides   4.2.2. Data quality
            a centralized space to store reusable code templates,   Data quality directly impacts the reliability of AI models,
            implementation guides, hyperparameter tuning records,   as poor-quality data can degrade model performance and
            and experiment logs. This repository accelerates   lead to poor decision-making. Key dimensions of data
            development by allowing  engineers to  reuse  proven   quality include consistent representation, completeness,
            techniques and implementation methods. Examples    uniqueness, feature accuracy, target accuracy, and target
            include maintaining shared code repositories on GitHub   class balance.  Therefore, developers should adopt
                                                                           40
            combined with experiment documentation platforms and   systematic data quality checks as part of their pipelines. For
            using platforms such as MLflow or Weights and Biases for   instance, CleanML demonstrates how various data quality
            tracking experiments, packaging code, managing models,   issues can significantly affect the performance of common
            and sharing results. Moreover, LLMs also enhance multiple   ML models.  Furthermore, Foroni  et al.  extended
                                                                                                   42
                                                                         41
            aspects, including code generation, information retrieval,   conventional data quality definitions by evaluating not
            and interactive AI-assisted exploration. 35-38     only how data deviate from an ideal clean dataset but also
                                                               how these deviations influence task outcomes.
            4.1.4. New knowledge generation and integration
            As models are developed and validated, they produce new   4.2.3. Feature engineering
            insights and  interpretations that  should  be  continuously   Feature engineering focuses on extracting meaningful
            integrated into the existing knowledge base. This process   representations from raw data, improving model
            involves capturing lessons learned, model interpretation   performance, interpretability, and efficiency.  In industrial
                                                                                                  43
            outputs, model results, and deployment feedback, feeding   applications, well-designed features can greatly improve
            them back into structured documentation and knowledge   model accuracy. Common techniques include domain-
            graphs. This iterative process ensures that the knowledge   specific feature extraction, such as time-domain statistical
            module remains dynamic and evolves alongside       measures, frequency-domain features derived from
            technological progress and deployment experiences.  Fourier transforms, and time-frequency domain features.
                                                               Dimensionality reduction methods, including principal
            4.2. Data module                                   component analysis (PCA), t-distributed stochastic
            The data module focuses on transforming raw industrial   neighbor embedding (t-SNE), and linear discriminant
            data into AI-ready datasets to improve data usability   analysis, help reduce complexity while maintaining
            and reliability, enhanced by the domain knowledge from   important information. Feature importance ranking
            the knowledge module. In processing data, this module   methods are used to identify key variables that most


            Volume 2 Issue 2 (2025)                         60                        doi: 10.36922/IJAMD025080006
   61   62   63   64   65   66   67   68   69   70   71