Page 68 - TD-4-3
P. 68

Tumor Discovery                                               Highly accurate gene panels for cancer screening



              The remainder of the article is structured as follows.   kidneys, lungs, prostate, stomach, thyroid, and uterus. We
            Section  2  (Materials  and  Methods)  provides  a  detailed   included five (out of six) of the most common cancer types
            and thorough account of our methodology. In Section 3   (breast, lung, colon, prostate, and stomach), each with
            (Results), we illustrate our workflow for a case in point   an incidence of over a million cases in 2020. Among the
            (namely, LUAD) and summarize our findings for the   selected cancer types, there were also the most common
            other selected cancer types. In addition, we provide a   causes of cancer death (lung, colon, liver, stomach, and
            validation analysis of our gene panels in different datasets.   breast), each accounting for over half a million deaths in
            Section 4 (Discussion) addresses gene dysregulation as   2020 (worldwide statistics reported by the World Health
            conceptualized in this study, highlighting how it enables us   Organization ).
                                                                         46
            to better understand homeostasis and cancer. We further
            examine potential applications of the proposed gene panels   The selection of cancer types for our systematic study
            and  their  role  in  tumorigenesis.  Section  5  (Conclusion)   was motivated by the number of normal samples available
            summarizes the key findings and offers an outlook for   in the data. For the cases under study, TCGA reports more
            future translational research based on this framework.  than 20 normal samples per cancer type. Notably, achieving
                                                               a  reliable discrimination between normal and  tumor
            2. Materials and methods                           tissues based on gene expression profiles required both
                                                               normal and tumor samples to be adequately represented
            2.1. Data                                          in the datasets.
            TCGA is a publicly accessible database of gene expression
            profiles drawn from cohort studies involving hundreds of   2.2. Pre-processing of data
            normal tissue and solid tumor biopsy samples, classified   Gene expression distributions tend to be heavy-tailed,
            by histopathological techniques.  Expression profiles were   with many low-frequency outliers.  RNA-seq is known to
                                      6
                                                                                          47
            obtained through RNA-seq, capturing 60,483 genes per   be inaccurate at detecting low expression levels and may
            sample. TCGA reports gene expression values using the   produce spurious null readings for genes that are nearly
            standard units of fragments per kilobase of transcript per   silenced.  To avoid artifacts associated with the low-
                                                                      48
            million mapped reads. The size of the dataset varies with   expression region, we set all values below 0.1 fragments per
            the cancer type and is consistently skewed toward tumor   kilobase of transcript per million mapped reads to zero.
            samples.                                           Moreover, we excluded all genes with non-zero expression

              We selected 12 cancer types from TCGA for a systematic   in fewer than 5% of normal samples and fewer than 10% of
            analysis (Table 1). These cancers manifest as solid tumors,   tumor samples from the analysis.
            particularly affecting the liver, breast, colon, head and neck,
                                                               2.3. Expression dysregulation patterns
            Table 1. Cancer types, the cancer genome atlas abbreviations,   We searched for genes that exhibit specific dysregulation
            and the number of samples                          patterns. In our framework, a gene conforms to a
            Cancer types           Abbreviation Normal  Tumor   “differential expression” pattern if  all  normal  samples
                                             samples samples   express it in a certain manner (specified below), while a
            Breast invasive carcinoma  BRCA    112   1,096     significant number of tumor samples exhibit a distinctly
            Colon adenocarcinoma      COAD     41     473      different expression. Conversely, a gene conforms to a “non-
            Head and neck squamous cell   HNSC  44    502      differential dysregulation” pattern if all tumor samples
            carcinoma                                          express it in a certain way, while a substantial number of
            Kidney renal clear cell carcinoma  KIRC  74  539   normal samples express it differently. Non-differential
                                                               dysregulation can be interpreted as the dual category of
            Kidney renal papillary cell   KIRP  32    289      differential expression, achieved by swapping the roles of
            carcinoma                                          normal and tumor samples. By monitoring the expression
            Liver hepatocellular carcinoma  LIHC  50  374      values of a differentially expressed or non-differentially
            Lung adenocarcinoma       LUAD     59     535      dysregulated gene, we can classify samples with no type I
            Lung squamous cell carcinoma  LUSC  49    502      errors – i.e., no false positives for tumors in the case of
            Prostate adenocarcinoma   PRAD     52     499      differential expression and no false positives for normal
            Stomach adenocarcinoma    STAD     32     375      samples in the case of non-differential dysregulation.
            Thyroid carcinoma         THCA     58     510        For simplicity, this study focuses on four types of gene
            Uterine corpus endometrial   UCEC  23     552      sets, each named to reflect the classificatory potential of its
            carcinoma                                          individual gene members. Let x represent a class of samples,


            Volume 4 Issue 3 (2025)                         60                           doi: 10.36922/TD025190035
   63   64   65   66   67   68   69   70   71   72   73