Page 108 - GTM-2-3
P. 108
Global Translational Medicine TEs link to Parkinson’s risk and progression
We used two different conservation scores, PhastCons insertion events. Principal component analysis (PCA)
[46]
and Phylop , to assess the conservation of TE insertion of TEs was performed using PLINK, with principal
[47]
regions. The UCSC Genome Browser (https://genome.ucsc. component numbers calculated using the EIGENSOFT
edu/) provides genome conservation score annotations, package (version 7.2.1) . The significant PC1 was used
[52]
namely, PhastCons100way and Phylop100way. These as a covariate to correct for population structure. Finally,
scores are derived from multi-sequence alignments of the PC1, sex, age, and cohort information were included as
human genome with 99 different species using PhastCons covariates in the logistic regression model for TE-GWAS
and PhyloP, respectively. The conservation score of a analysis. The original P-values were corrected using the
specific region in the human genome is calculated based false discovery rate (FDR), with an FDR <0.05 set as the
on PhastCons100way and PhyloP100way . In this study, significance threshold. A detailed description of the QC
[48]
regions with PhyloP scores >0.76 or PhastCons scores >0.2 process before TE-GWAS analysis is shown in Section S4.
were considered highly conserved, while the remaining
regions were categorized as non-conserved regions, as 2.6. TE- linear mixed model
described in Qiao et al. . We used a linear mixed model (LMM) to investigate the
[49]
correlation between TE polymorphisms and the progression
2.3. LD between TEs and SNPs of PD. Nine years of clinical follow-up data from the PPMI
The AMP-PD database employed the GATK best practices cohort and 5 years of clinical follow-up data from the PDBP
workflow and the GATK joint genotyping model to cohort were combined. The analysis included six different
[38]
generate SNP genotype data in the transformed PLINK clinical scales: MoCA score, MDS-UPDRS score (Part I to
binary format. Our original TE VCF file was converted Part IV), and Hoehn-Yahr staging scale. Details of the QC
into the PLINK binary format, followed by integrating of the clinical scale are presented in Section S5.
SNP genotype data and TE data from corresponding To construct the TE-LMM model, we used the lmer
subjects. The LD between each TE and its associated SNPs function from the lme4 package (version 1.1-34) .
[53]
within 1 Mb window size was calculated using PLINK. The The model included fixed effects such as PC1, sex, age,
90 independent PD risk variants were obtained from the study name, and interaction between TE and years in
GWAS study conducted by Nalls et al. . the study in each model. In addition, education level was
[15]
incorporated as a fixed effect in the MoCA score model and
2.4. Reproducibility of TEs in the 1KGP and gnomAD
database the MDS-UPDRS Part I score model. Individual subjects
were included as random terms in each of the six LMM
The reproducibility of this study was evaluated by examining models. The original P-values of the TE-LMM model were
whether the 500 bp window upstream and downstream of calculated using the lmertest package (version 3.1-3)
[54]
the identified TE regions contained the same type of TE based on the Satterthwaite algorithm and subsequently
as annotated in the 1KGP and Genome Aggregation corrected using the FDR approach.
[41]
Database (gnomAD) databases . The public TE annotation
[50]
data from the 1KGP (dbVar: nstd144) and the structural 2.7. Transcriptome data processing and TE-eQTL
variation annotation data from gnomAD (dbVar: nstd166) mapping
were downloaded from the dbVar database (https://www. The peripheral blood transcriptome (total RNA-seq)
ncbi.nlm.nih.gov/dbvar/). The Bedtools (version 2.26.0) dataset was available for the subjects in the PPMI, PDBP,
[51]
intersect function was used with a window parameter set and BioIFND cohorts. Subjects with both total RNA-seq
to 500 bp to determine whether the TE insertions belong to and WGS were retained for analysis. We used the quality-
the same TE insertion category. controlled and normalized gene expression transcripts per
million (TPM) data for subsequent TE-eQTL analysis. The
2.5. TE Genome-wide Association Study
TPM data were generated from the transcriptome’s raw
TE Genome-Wide Association Study (TE-GWAS) was FASTQ files using the Salmon workflow, which is based
performed using the logistic regression function in R on the genome annotation file Gencode.v29 (https://www.
software to investigate the genetic association between gencodegenes.org/). We evaluated the association between
TE insertion and the risk of PD. In consideration of the TE polymorphism and gene expression following the TE
uniqueness of TE insertions, TE genotypes in this study expression quantitative trait locus (TE-eQTL) analysis
were classified as having a TE insertion event (coded as “1”) workflow proposed by Wang et al. . TE polymorphisms
[55]
or no TE insertion event (coded as “0”). This classification were classified as no TE insertion (coded as “0”) or TE
included both heterozygous (0/1) and homozygous insertion as (“1”), similar to the TE-GWAS study. Due to
(1/1) TE insertions at specific genomic regions as TE the significantly longer length of TE sequences compared
Volume 2 Issue 3 (2023) 4 https://doi.org/10.36922/gtm.1583

