Page 71 - TD-4-3
P. 71

Tumor Discovery                                               Highly accurate gene panels for cancer screening




            Table 2. Summary of classifier genes per tissue
            Set of genes  LIHC  BRCA  COAD    HNSC   KIRC    KIRP   LUAD    LUSC   PRAD   STAD   THCA   UCEC
            Only-T-above  a 3/23,986  a 6/15,361  a 2/17,536  a 4/13,293  a 4/22,654  a 3/11,447  a 3/20,274  a 2/19,596  a 8/8,093  a 3/13,773  a 5/5,744  a 1/7,825
            Only-N-above  11/40  a 10/739  a 1/876  a 8/1,903  a 3/780  a 1/1,140  a 5/613  a 3/1,198  a 14/1,415  a 5/1,244  a 11/794  a 1/993
            Only-T-below  a 5/3,812  a 6/6,701  a 1/8,418  a 5/2,093  a 3/9,132  a 1/10,263  a 4/8,285  a 2/9,404  a 15/3,865  a 5/1,499  a 6/5,376  a 1/7,443
            Only-N-below  a 5/1,246  12/682  a 2/297  6/1,339  8/191  5/214  8/449  a 3/985  15/915  a 5/2,536  17/92  a 1/506
            Note: Each column identifies a cancer type based on The Cancer Genome Atlas terminology. Each row represents a different set of classifier genes
            (see main text for shorthand notation). Within each cell, we show the minimal number of genes that classify the largest number of samples, together
                                           a
            with the total number of genes of the same sort.  marks the minimal gene sets that constitute perfect panels.
            Abbreviations: BRCA: Breast invasive carcinoma; COAD: Colon adenocarcinoma; HNSC: Head and neck squamous cell carcinoma; KIRC: Kidney
            renal clear cell carcinoma; KIRP: Kidney renal papillary cell carcinoma; LIHC: Liver hepatocellular carcinoma; LUAD: Lung adenocarcinoma;
            LUSC: Lung squamous cell carcinoma; N: Normal; PRAD: Prostate adenocarcinoma; STAD: Stomach adenocarcinoma; T: Tumor; THCA: Thyroid
            carcinoma; UCEC: Uterine corpus endometrial carcinoma.

              In cancer research, differential expression is often   that markers are introduced specifically for whole normal
            deemed significant only when the deviation from    tissue samples.
            normal expression is substantial, consistent (i.e., always
            upregulated or downregulated), and present across most   4.2. Panel validation
            tumors.  A common practice is to define the lower and   We provided two examples of panel validation using other
                  55
            upper bounds of normal gene expression as ×0.5 and ×2   datasets. The first involves the  SCARA5 gene in COAD.
            a reference level, respectively.  Therefore, a gene can be   Microarray readings from Khamas et al.  demonstrate the
                                    56
                                                                                               57
            differentially expressed only if most tumors express either   perfect classification capability of SCARA5 (data available
            >×2 or <×0.5 the reference value. In turn, any such gene   at NCBI GEO,  accession GDS4382). Notably, this gene
                                                                           58
            is considered differentially expressed when its expression   has also been independently identified as a biomarker for
            level crosses the specific threshold above or below which   colorectal cancer. 59
            most tumors are expressed.
                                                                 The second case concerns the LUAD dataset from a
              From the outset, we contend that gene expression   comprehensive study of a Chinese cohort,  which includes
                                                                                                60
            dysregulation  comprises  broader  patterns  than  RNA-seq profiles from 51 tumor and 49 control samples.
            conventional differential expression. Certain dysregulation   We evaluated the performance of our perfect only-T-above
            forms do not conform to either our definition of differential   panel on this dataset. As shown in Figure S1, the genes
            expression or the conventional one used in the field. As a   TRIM27,  PYCR1, and  ALDH18A1 fall within the only-
            result, these patterns are often overlooked in the analysis of   T-above class, as they exhibit significantly populated
            gene expression data.                              T-exclusive intervals above the shared N–T expression

              For example, consider a gene with bimodal expression   range. The histogram in Figure S1 confirms that the panel
            distribution under normal conditions, such as those   remains perfect, achieving both maximal sensitivity and
            governed by circadian oscillations. If these oscillations   specificity in classification. However, within this particular
            are lost in tumor tissue, the gene may fall into the only-T-  cohort, the  TRIM27 gene proves redundant and can be
            inside category. While such genes were identified through   removed without any loss in classification accuracy.
            our data mining, they are not reported in this paper. Other   This finding raises an important question regarding the
            underreported categories, like the only-T-outside genes,
            were also encountered.                             minimal number of genes required to assemble a perfect
                                                               panel, and the extent to which that number remains robust
              Conversely, what we term as non-differential     to variations in cohort size.
            dysregulation, corresponding to N-genes, is typically
            overlooked. In our study, we focused on the only-N-above   4.3. The minimal number of genes needed to
            and only-N-below classes, although the only-N-outside   identify a tumor
            and only-N-inside groups may likewise be present in   The LUAD dataset  is particularly noteworthy, not only
                                                                              60
            specific tissues.                                  because its cohort differs markedly from that of TCGA
              It is worth emphasizing that in single-cell RNA-seq   but also due to its substantially smaller size, approximately
            expression analyses,  gene markers are routinely identified   an  order  of  magnitude  fewer  samples.  Specifically,  the
                            3
            for  individual  cell  types  under  normal  conditions.   TCGA LUAD dataset comprises 59 normal and 535 tumor
            However, to the best of our knowledge, this is the 1  time   samples. This prompts the question: how does the number
                                                     st

            Volume 4 Issue 3 (2025)                         63                           doi: 10.36922/TD025190035
   66   67   68   69   70   71   72   73   74   75   76