Page 142 - AIH-1-2
P. 142

Artificial Intelligence in Health                                    Movement detection with sensors and AI



              Numeric features represent columns with numerical   or oversampling is performed would yield consistent
            values, encompassing both continuous and discrete data.   results each time the code is run with the same session
            In this context, the dataset comprises 202 numeric features,   ID.
            with the “Predict” feature serving as a categorical target   •   Data format and preparation: The input data are
            variable (Table 1). The value “True” for preprocessing   provided as a CSV file, which is a standard, easy-to-
            indicates that preprocessing  steps are applied to the   work-with data format. The data include both the
            data  during  the  setup  process. For  the  current study,   features (e.g., sensor readings) and the target. The
            preprocessing  steps  known  as  “LabelEncoder”  and   features are the inputs that the model will learn from,
            “SimpleImputer” were applied. The label encoder is a   while the target is the output category that the model
            preprocessing step applied to convert categorical target   is trained to predict.
            variables (if any) into numerical format. It transforms   •   Target variable: The target variable is categorical,
            categorical labels into integer values, making them suitable   meaning that it does not have a natural order or
            for training the machine learning model. The simple   numerical value; the assigned numbers are just labels
            imputer is also a preprocessing step applied to handle   for the classes. As the target is represented numerically,
            missing values in the dataset. It fills in missing values using   each number corresponds to a discrete category of
            simple  strategies  such as  the  feature’s  mean,  median,  or   patient movement, and the model learns to predict
            most frequent value. In cases where numeric features have   these categories.
            missing values, the “mean” imputation method is used,   •   Workflow steps: The typical workflow in PyCaret for
            replacing the missing numeric values with the mean of the   classification consists of five steps: setup, compare
            corresponding feature. Conversely, for categorical features   models, analyze model, save model, and prediction.
            with  missing  values,  the  “mode”  imputation  method  is   (i)  Setup: This crucial first step initializes the analysis
            used, replacing the missing categorical values with the   environment by setting up the data and defining
            mode (most frequent category) of the corresponding        the target. It also performs basic processing like
            feature.                                                  handling missing values, encoding categorical
                                                                      variables,  normalizing  the  data,  and  potentially
              The “Compare Models” function trains and evaluates      feature engineering.
            the performance of all available estimators using cross-  (ii)  Compare models: This step systematically trains
            validation, providing a scoring grid with average cross-  and evaluates different machine learning models
            validated scores. To analyze the performance of a trained   using the preprocessed data, subsequently
            model on the test set, the “plot_model” function can be   ranking them according to a chosen evaluation
            used. It offers different plot types, such as confusion matrix   metric, usually accuracy for classification tasks.
            and AUC, for assessing model performance. In certain   (iii) Analyze model: For the chosen model, its
            cases, re-training the model may be required for plotting   performance metrics, decision boundary, feature
            specific visualizations. Finally, the model with the entire   importance, confusion matrix, and other insights
            pipeline is saved on disk for future use, especially for   are analyzed to understand how well the model
            prediction of unseen data.                                works. This step provides information about the
              Hence, the typical workflow in PyCaret for a            classifier’s behavior under various conditions
            classification task involves several steps, beginning with   through ROC curves, precision-recall curves, and
            the “Setup.” During “Setup,” the user initiates the training   classification errors, allowing the user to deeply
            environment by defining the dataset (data) and the variable   interrogate specific models and understand areas
            to be predicted (target). In this case, the target refers to   for improvement.
            various movements like “Roll right,” “Roll left,” “Drop   (iv)  Save and predict model: With the model saved,
            right,” “Drop left,” “Breathing,” and “Seizure,” encoded   predictions can be made on new data that the
            numerically from 0 to 5, respectively. The following is how   model has not seen before. This is the ultimate
            PyCaret handles the classification workflow:              goal of the machine learning workflow—
            •   Session  ID:  In the  setup  stage,  specifying  a session   applying the constructed model to make accurate
               ID as a pseudorandom number (e.g., 123) serves         classifications on real-world data.
               as a seed for all randomness within the pipeline,   The training and test datasets are created during the
               ensuring that the experiment is reproducible. This   setup, with PyCaret automatically splitting the input data
               setup process implies that the random division of   into these subsets. The typical default splits allocate 70%
               data into folds when applying cross-validation or the   of the data for training and 30% for testing. The session
               random selection of data points if any undersampling   ID ensures consistency in any randomization during this


            Volume 1 Issue 2 (2024)                        136                               doi: 10.36922/aih.2790
   137   138   139   140   141   142   143   144   145   146   147