Page 108 - GTM-2-3
P. 108

Global Translational Medicine                                     TEs link to Parkinson’s risk and progression



              We used two different conservation scores, PhastCons    insertion events. Principal component analysis (PCA)
                                                        [46]
            and Phylop , to assess the conservation of TE insertion   of TEs was performed using PLINK, with principal
                     [47]
            regions. The UCSC Genome Browser (https://genome.ucsc.  component numbers calculated using the EIGENSOFT
            edu/)  provides  genome  conservation  score  annotations,   package (version 7.2.1) . The significant PC1 was used
                                                                                  [52]
            namely, PhastCons100way and Phylop100way. These    as a covariate to correct for population structure. Finally,
            scores are derived from multi-sequence alignments of the   PC1,  sex,  age,  and  cohort  information  were  included as
            human genome with 99 different species using PhastCons   covariates in the logistic regression model for TE-GWAS
            and PhyloP, respectively. The conservation score of a   analysis. The original  P-values were corrected using the
            specific region in the human genome is calculated based   false discovery rate (FDR), with an FDR <0.05 set as the
            on PhastCons100way and PhyloP100way . In this study,   significance threshold. A detailed description of the QC
                                             [48]
            regions with PhyloP scores >0.76 or PhastCons scores >0.2   process before TE-GWAS analysis is shown in Section S4.
            were considered highly conserved, while the remaining
            regions were categorized as non-conserved regions, as   2.6. TE- linear mixed model
            described in Qiao et al. .                         We used a linear mixed model (LMM) to investigate the
                              [49]
                                                               correlation between TE polymorphisms and the progression
            2.3. LD between TEs and SNPs                       of PD. Nine years of clinical follow-up data from the PPMI
            The AMP-PD database employed the GATK best practices   cohort and 5 years of clinical follow-up data from the PDBP
            workflow and the GATK joint genotyping model  to   cohort were combined. The analysis included six different
                                                      [38]
            generate  SNP  genotype  data  in  the  transformed  PLINK   clinical scales: MoCA score, MDS-UPDRS score (Part I to
            binary format. Our original TE VCF file was converted   Part IV), and Hoehn-Yahr staging scale. Details of the QC
            into the PLINK binary format, followed by integrating   of the clinical scale are presented in Section S5.
            SNP genotype data and TE data from corresponding     To construct the TE-LMM model, we used the lmer
            subjects. The LD between each TE and its associated SNPs   function from the lme4 package (version 1.1-34) .
                                                                                                           [53]
            within 1 Mb window size was calculated using PLINK. The   The model included fixed effects such as PC1, sex, age,
            90 independent PD risk variants were obtained from the   study name, and interaction between TE and years in
            GWAS study conducted by Nalls et al. .             the study in each model. In addition, education level was
                                          [15]
                                                               incorporated as a fixed effect in the MoCA score model and
            2.4. Reproducibility of TEs in the 1KGP and gnomAD
            database                                           the MDS-UPDRS Part I score model. Individual subjects
                                                               were included as random terms in each of the six LMM
            The reproducibility of this study was evaluated by examining   models. The original P-values of the TE-LMM model were
            whether the 500 bp window upstream and downstream of   calculated using the lmertest package (version 3.1-3)
                                                                                                           [54]
            the identified TE regions contained the same type of TE   based on the Satterthwaite algorithm and subsequently
            as annotated in the 1KGP  and Genome Aggregation   corrected using the FDR approach.
                                  [41]
            Database (gnomAD) databases . The public TE annotation
                                    [50]
            data from the 1KGP (dbVar: nstd144) and the structural   2.7. Transcriptome data processing and TE-eQTL
            variation annotation data from gnomAD (dbVar: nstd166)   mapping
            were downloaded from the dbVar database (https://www.  The peripheral blood transcriptome (total RNA-seq)
            ncbi.nlm.nih.gov/dbvar/). The Bedtools (version 2.26.0)    dataset was available for the subjects in the PPMI, PDBP,
                                                        [51]
            intersect function was used with a window parameter set   and BioIFND cohorts. Subjects with both total RNA-seq
            to 500 bp to determine whether the TE insertions belong to   and WGS were retained for analysis. We used the quality-
            the same TE insertion category.                    controlled and normalized gene expression transcripts per
                                                               million (TPM) data for subsequent TE-eQTL analysis. The
            2.5. TE Genome-wide Association Study
                                                               TPM data were generated from the transcriptome’s raw
            TE Genome-Wide Association Study (TE-GWAS) was     FASTQ files using the Salmon workflow, which is based
            performed using the logistic regression function in R   on the genome annotation file Gencode.v29 (https://www.
            software to investigate the genetic association between   gencodegenes.org/). We evaluated the association between
            TE insertion and the risk of PD. In consideration of the   TE polymorphism and gene expression following the TE
            uniqueness of TE insertions, TE genotypes in this study   expression  quantitative  trait  locus  (TE-eQTL)  analysis
            were classified as having a TE insertion event (coded as “1”)   workflow proposed by Wang et al. . TE polymorphisms
                                                                                           [55]
            or no TE insertion event (coded as “0”). This classification   were classified as no TE insertion (coded as “0”) or TE
            included both heterozygous (0/1) and homozygous    insertion as (“1”), similar to the TE-GWAS study. Due to
            (1/1) TE insertions at specific genomic regions as TE   the significantly longer length of TE sequences compared


            Volume 2 Issue 3 (2023)                         4                        https://doi.org/10.36922/gtm.1583
   103   104   105   106   107   108   109   110   111   112   113