Page 68 - TD-4-3
P. 68
Tumor Discovery Highly accurate gene panels for cancer screening
The remainder of the article is structured as follows. kidneys, lungs, prostate, stomach, thyroid, and uterus. We
Section 2 (Materials and Methods) provides a detailed included five (out of six) of the most common cancer types
and thorough account of our methodology. In Section 3 (breast, lung, colon, prostate, and stomach), each with
(Results), we illustrate our workflow for a case in point an incidence of over a million cases in 2020. Among the
(namely, LUAD) and summarize our findings for the selected cancer types, there were also the most common
other selected cancer types. In addition, we provide a causes of cancer death (lung, colon, liver, stomach, and
validation analysis of our gene panels in different datasets. breast), each accounting for over half a million deaths in
Section 4 (Discussion) addresses gene dysregulation as 2020 (worldwide statistics reported by the World Health
conceptualized in this study, highlighting how it enables us Organization ).
46
to better understand homeostasis and cancer. We further
examine potential applications of the proposed gene panels The selection of cancer types for our systematic study
and their role in tumorigenesis. Section 5 (Conclusion) was motivated by the number of normal samples available
summarizes the key findings and offers an outlook for in the data. For the cases under study, TCGA reports more
future translational research based on this framework. than 20 normal samples per cancer type. Notably, achieving
a reliable discrimination between normal and tumor
2. Materials and methods tissues based on gene expression profiles required both
normal and tumor samples to be adequately represented
2.1. Data in the datasets.
TCGA is a publicly accessible database of gene expression
profiles drawn from cohort studies involving hundreds of 2.2. Pre-processing of data
normal tissue and solid tumor biopsy samples, classified Gene expression distributions tend to be heavy-tailed,
by histopathological techniques. Expression profiles were with many low-frequency outliers. RNA-seq is known to
6
47
obtained through RNA-seq, capturing 60,483 genes per be inaccurate at detecting low expression levels and may
sample. TCGA reports gene expression values using the produce spurious null readings for genes that are nearly
standard units of fragments per kilobase of transcript per silenced. To avoid artifacts associated with the low-
48
million mapped reads. The size of the dataset varies with expression region, we set all values below 0.1 fragments per
the cancer type and is consistently skewed toward tumor kilobase of transcript per million mapped reads to zero.
samples. Moreover, we excluded all genes with non-zero expression
We selected 12 cancer types from TCGA for a systematic in fewer than 5% of normal samples and fewer than 10% of
analysis (Table 1). These cancers manifest as solid tumors, tumor samples from the analysis.
particularly affecting the liver, breast, colon, head and neck,
2.3. Expression dysregulation patterns
Table 1. Cancer types, the cancer genome atlas abbreviations, We searched for genes that exhibit specific dysregulation
and the number of samples patterns. In our framework, a gene conforms to a
Cancer types Abbreviation Normal Tumor “differential expression” pattern if all normal samples
samples samples express it in a certain manner (specified below), while a
Breast invasive carcinoma BRCA 112 1,096 significant number of tumor samples exhibit a distinctly
Colon adenocarcinoma COAD 41 473 different expression. Conversely, a gene conforms to a “non-
Head and neck squamous cell HNSC 44 502 differential dysregulation” pattern if all tumor samples
carcinoma express it in a certain way, while a substantial number of
Kidney renal clear cell carcinoma KIRC 74 539 normal samples express it differently. Non-differential
dysregulation can be interpreted as the dual category of
Kidney renal papillary cell KIRP 32 289 differential expression, achieved by swapping the roles of
carcinoma normal and tumor samples. By monitoring the expression
Liver hepatocellular carcinoma LIHC 50 374 values of a differentially expressed or non-differentially
Lung adenocarcinoma LUAD 59 535 dysregulated gene, we can classify samples with no type I
Lung squamous cell carcinoma LUSC 49 502 errors – i.e., no false positives for tumors in the case of
Prostate adenocarcinoma PRAD 52 499 differential expression and no false positives for normal
Stomach adenocarcinoma STAD 32 375 samples in the case of non-differential dysregulation.
Thyroid carcinoma THCA 58 510 For simplicity, this study focuses on four types of gene
Uterine corpus endometrial UCEC 23 552 sets, each named to reflect the classificatory potential of its
carcinoma individual gene members. Let x represent a class of samples,
Volume 4 Issue 3 (2025) 60 doi: 10.36922/TD025190035

