我在简书的文章【r<-包|数据集|公开数据库】UCSCXenaTools包用法介绍——搜索与下载TCGA、GDC、ICGC等公开数据库数据集讲解了使用UCSCXenaTools包进行UCSC Xena数据下载的一般方法。在使用上,还是需要一定R编程基础的人才能用起来,因为想要灵活的选择数据集需要一定的正则表达式基础,除此之外,还需要理解Xena上你要的数据集它在Xena服务器上保存的规律。

写这个包的初衷就是不想频繁的点击浏览器下载,而且能够批量下载和保存。既然写了包我又想让包的使用更简单一些,这样只要有R,装好包就能用起来。能不能更简单一点呢?我思考了很久,想要偷懒的人勤奋地码起了代码。

因为自己常用TCGA数据,大部分人也是如此,所以我最近首先尝试解决这个问题。Xena服务器上存储的文件虽然有一些规律,但在取名上非常混乱,我不得不寻找规律利用正则表达式进行匹配和解析并调试确保输入输出正确,并扔掉了。如果你想要看看,不妨点击并将网页拉到最后。

我根据Xena提供的文件名将数据解析为主要两种类型的组合:数据类型文件类型,前者标志某个文件是拷贝数数据、突变数据还是基因表达数据等等,后者更为具体,比如基因高通量测序表达数据Xena分为数据集内标准化,Pancan标准化和百分位标准化(用于和其他非TCGA数据比较)。

在此基础上,我写了downloadTCGA函数,只要你知道想要下载数据的数据类型和文件类型,你就能轻松下载(对于新手还是有难度哈)。

下面简单看看,我们先安装下包并导入:

install.packages("UCSCXenaTools", dependencies = TRUE)

library(UCSCXenaTools)

查看下downloadTCGA参数:

args(downloadTCGA)
#> function (project = NULL, data_type = NULL, file_type = NULL, 
#>     destdir = tempdir(), force = FALSE, ...) 
#> NULL

主要参数就4个,指定癌症类型,指定数据类型,指定文件类型以及下载到的本地路径。如果对参数不了解,可以使用help函数或者打??downloadTCGA,希望我写的英文你能看懂。

因为第一次用,我们可能不清楚这个包怎么指定这些参数,使用availTCGA可以帮助你:

availTCGA()
#> Note not all projects have listed data types and file types, you can use showTCGA function to check if exist
#> $ProjectID
#>  [1] "LAML"     "ACC"      "CHOL"     "BLCA"     "BRCA"     "CESC"    
#>  [7] "COADREAD" "COAD"     "UCEC"     "ESCA"     "FPPP"     "GBM"     
#> [13] "HNSC"     "KICH"     "KIRC"     "KIRP"     "DLBC"     "LIHC"    
#> [19] "LGG"      "GBMLGG"   "LUAD"     "LUNG"     "LUSC"     "SKCM"    
#> [25] "MESO"     "UVM"      "OV"       "PANCAN"   "PAAD"     "PCPG"    
#> [31] "PRAD"     "READ"     "SARC"     "STAD"     "TGCT"     "THYM"    
#> [37] "THCA"     "UCS"     
#> 
#> $DataType
#>  [1] "DNA Methylation"                       
#>  [2] "Gene Level Copy Number"                
#>  [3] "Somatic Mutation"                      
#>  [4] "Gene Expression RNASeq"                
#>  [5] "miRNA Mature Strand Expression RNASeq" 
#>  [6] "Gene Somatic Non-silent Mutation"      
#>  [7] "Copy Number Segments"                  
#>  [8] "Exon Expression RNASeq"                
#>  [9] "Phenotype"                             
#> [10] "PARADIGM Pathway Activity"             
#> [11] "Protein Expression RPPA"               
#> [12] "Transcription Factor Regulatory Impact"
#> [13] "Gene Expression Array"                 
#> [14] "Signatures"                            
#> [15] "iCluster"                              
#> 
#> $FileType
#>  [1] "Methylation27K"                            
#>  [2] "Methylation450K"                           
#>  [3] "Gistic2"                                   
#>  [4] "wustl hiseq automated"                     
#>  [5] "IlluminaGA RNASeq"                         
#>  [6] "IlluminaHiSeq RNASeqV2 in percentile rank" 
#>  [7] "IlluminaHiSeq RNASeqV2 pancan normalized"  
#>  [8] "IlluminaHiSeq RNASeqV2"                    
#>  [9] "After remove germline cnv"                 
#> [10] "PANCAN AWG analyzed"                       
#> [11] "Clinical Information"                      
#> [12] "wustl automated"                           
#> [13] "Gistic2 thresholded"                       
#> [14] "Before remove germline cnv"                
#> [15] "Use only RNASeq"                           
#> [16] "Use RNASeq plus Copy Number"               
#> [17] "bcm automated"                             
#> [18] "IlluminaHiSeq RNASeq"                      
#> [19] "bcm curated"                               
#> [20] "broad curated"                             
#> [21] "RPPA"                                      
#> [22] "bsgsc automated"                           
#> [23] "broad automated"                           
#> [24] "bcgsc automated"                           
#> [25] "ucsc automated"                            
#> [26] "RABIT Use IlluminaHiSeq RNASeqV2"          
#> [27] "RABIT Use IlluminaHiSeq RNASeq"            
#> [28] "RPPA normalized by RBN"                    
#> [29] "RABIT Use Agilent 244K Microarray"         
#> [30] "wustl curated"                             
#> [31] "Use Microarray plus Copy Number"           
#> [32] "Use only Microarray"                       
#> [33] "Agilent 244K Microarray"                   
#> [34] "IlluminaGA RNASeqV2"                       
#> [35] "bcm SOLiD"                                 
#> [36] "RABIT Use IlluminaGA RNASeqV2"             
#> [37] "RABIT Use IlluminaGA RNASeq"               
#> [38] "RABIT Use Affymetrix U133A Microarray"     
#> [39] "Affymetrix U133A Microarray"               
#> [40] "MethylMix"                                 
#> [41] "bcm SOLiD curated"                         
#> [42] "Gene Expression Subtype"                   
#> [43] "Platform-corrected PANCAN12 dataset"       
#> [44] "Batch effects normalized"                  
#> [45] "MC3 Public Version"                        
#> [46] "TCGA Sample Type and Primary Disease"      
#> [47] "RPPA pancan normalized"                    
#> [48] "Tumor copy number"                         
#> [49] "Genome-wide DNA Damage Footprint HRD Score"
#> [50] "TCGA Molecular Subtype"                    
#> [51] "iCluster cluster assignments"              
#> [52] "iCluster latent variables"                 
#> [53] "RNA based StemnessScore"                   
#> [54] "DNA methylation based StemnessScore"       
#> [55] "Pancan Gene Programs"                      
#> [56] "Immune Model Based Subtype"                
#> [57] "Immune Signature Scores"

这些数据都和Xena对应,不熟悉可以上(https://xenabrowser.net/datapages/)随便找个TCGA数据集点点看看

注意下Xena提供了一些组合肿瘤类型,比如COADREAD,还有PANCAN,就是TCGA所有的都包含了。另外不是所有的项目都包含了上面显示的数据类型和文件类型,如果你不知道有没有,可以使用目前提供的shiny搜索下:

XenaShiny()

Shiny我还刚学习,很多不懂,以后再搞。

有一些知识基础后我们现在可以下数据了,比如下个OV的临床数据:

downloadTCGA(project = "OV", data_type = "Phenotype", file_type = "Clinical Information", destdir = tempdir())
#> We will download files to directory /var/folders/mx/rfkl27z90c96wbmn3_kjk8c80000gn/T//RtmpCgzBHS.
#> Downloading TCGA.OV.sampleMap__OV_clinicalMatrix.gz
#> Note fileNames transfromed from datasets name and / chracter all changed to __ character.

因为下载的文件名包含了/符号,我全部替换成了__

下载基因表达试试,选择LUAD和LUSC的泛癌标准化数据:

luad_lusc = downloadTCGA(project = c("LUAD", "LUSC"), data_type = "Gene Expression RNASeq", 
                         file_type = "IlluminaHiSeq RNASeqV2 pancan normalized", force = TRUE)

记得UCSCXenaTools提供的下载都是可以把结果返回到一个符号中的,通过它你直接可以将数据载入R,像下面这样:

XenaPrepare(luad_lusc)

上面简单是简单些了,但还需要记东西,所以我想更简单一点,所以又有了下面的函数:

args(getTCGAdata)
#> function (project = NULL, clinical = TRUE, download = FALSE, 
#>     forceDownload = FALSE, destdir = tempdir(), mRNASeq = FALSE, 
#>     mRNAArray = FALSE, mRNASeqType = "normalized", miRNASeq = FALSE, 
#>     exonRNASeq = FALSE, RPPAArray = FALSE, ReplicateBaseNormalization = FALSE, 
#>     Methylation = FALSE, MethylationType = c("27K", "450K"), 
#>     GeneMutation = FALSE, SomaticMutation = FALSE, GisticCopyNumber = FALSE, 
#>     Gistic2Threshold = TRUE, CopyNumberSegment = FALSE, RemoveGermlineCNV = TRUE, 
#>     ...) 
#> NULL

该函数提供的下载数据没有前一个函数多,一些不常用的数据我没有加入进来,这个函数的创建就是用来简单地下载常用的组学数据,你只需要设定好project这个很熟悉的选项,其他基本上是TRUE和FALSE的问题。

记得函数默认下载临床信息:

getTCGAdata(project = 'OV', download = TRUE)
#> We will download files to directory /var/folders/mx/rfkl27z90c96wbmn3_kjk8c80000gn/T//RtmpCgzBHS.
#> /var/folders/mx/rfkl27z90c96wbmn3_kjk8c80000gn/T//RtmpCgzBHS/TCGA.OV.sampleMap__OV_clinicalMatrix.gz, the file has been download!
#> Note fileNames transfromed from datasets name and / chracter all changed to __ character.

因为刚下载好了,它不下载了,我们可以强制:

getTCGAdata(project = 'OV', download = TRUE, forceDownload = TRUE)
#> We will download files to directory /var/folders/mx/rfkl27z90c96wbmn3_kjk8c80000gn/T//RtmpCgzBHS.
#> Downloading TCGA.OV.sampleMap__OV_clinicalMatrix.gz
#> Note fileNames transfromed from datasets name and / chracter all changed to __ character.

默认download = FALSE,所以�不会自动下载,这是我考虑到有可能出现怕下错的情况,我们可以先看看:

getTCGAdata(project = 'OV')
#> $Xena
#> class: XenaHub 
#> hosts():
#>   https://tcga.xenahubs.net
#> cohorts() (1 total):
#>   TCGA Ovarian Cancer (OV)
#> datasets() (1 total):
#>   TCGA.OV.sampleMap/OV_clinicalMatrix
#> 
#> $DataInfo
#>                   XenaHosts XenaHostNames              XenaCohorts
#> 1 https://tcga.xenahubs.net          TCGA TCGA Ovarian Cancer (OV)
#>                          XenaDatasets ProjectID  DataType
#> 1 TCGA.OV.sampleMap/OV_clinicalMatrix        OV Phenotype
#>               FileType
#> 1 Clinical Information

你看它只包含一个数据集,而且跟https://xenabrowser.net/datapages/?dataset=TCGA.OV.sampleMap%2FOV_clinicalMatrix&host=https%3A%2F%2Ftcga.xenahubs.net&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu显示的一样(除了.gz)结尾。

导入也是很简单的事情,跟前面一样,我们把下载和导入连起来就两句话:

ov = getTCGAdata(project = 'OV', download = TRUE, forceDownload = TRUE)
#> We will download files to directory /var/folders/mx/rfkl27z90c96wbmn3_kjk8c80000gn/T//RtmpCgzBHS.
#> Downloading TCGA.OV.sampleMap__OV_clinicalMatrix.gz
#> Note fileNames transfromed from datasets name and / chracter all changed to __ character.
ov_clinical = XenaPrepare(ov)

查看数据:

head(ov_clinical)
#> # A tibble: 6 x 111
#>   sampleID `_EVENT` `_INTEGRATION` OS.time    OS OS.unit `_PANCAN_CNA_PA…
#>   <chr>       <int> <chr>            <int> <int> <chr>   <chr>           
#> 1 TCGA-01…       NA TCGA-01-0628-…      NA    NA days    <NA>            
#> 2 TCGA-01…       NA TCGA-01-0629-…      NA    NA days    <NA>            
#> 3 TCGA-01…       NA TCGA-01-0630-…      NA    NA days    <NA>            
#> 4 TCGA-01…       NA TCGA-01-0631-…      NA    NA days    <NA>            
#> 5 TCGA-01…       NA TCGA-01-0633-…      NA    NA days    <NA>            
#> 6 TCGA-01…       NA TCGA-01-0636-…      NA    NA days    <NA>            
#> # ... with 104 more variables: `_PANCAN_Cluster_Cluster_PANCAN` <chr>,
#> #   `_PANCAN_DNAMethyl_PANCAN` <chr>, `_PANCAN_RPPA_PANCAN_K8` <chr>,
#> #   `_PANCAN_UNC_RNAseq_PANCAN_K16` <chr>, `_PANCAN_miRNA_PANCAN` <chr>,
#> #   `_PANCAN_mirna_OV` <chr>, `_PANCAN_mutation_PANCAN` <chr>,
#> #   `_PATIENT` <chr>, RFS.time <int>, RFS <int>, RFS.unit <chr>,
#> #   `_TIME_TO_EVENT` <int>, `_TIME_TO_EVENT_UNIT` <chr>, `_cohort` <chr>,
#> #   `_primary_disease` <chr>, `_primary_site` <chr>,
#> #   additional_pharmaceutical_therapy <chr>,
#> #   additional_radiation_therapy <chr>,
#> #   age_at_initial_pathologic_diagnosis <int>,
#> #   anatomic_neoplasm_subdivision <chr>, bcr_followup_barcode <chr>,
#> #   bcr_patient_barcode <chr>, bcr_sample_barcode <chr>,
#> #   clinical_stage <chr>, days_to_birth <int>, days_to_collection <int>,
#> #   days_to_death <int>, days_to_initial_pathologic_diagnosis <int>,
#> #   days_to_last_followup <int>,
#> #   days_to_new_tumor_event_additional_surgery_procedure <int>,
#> #   days_to_new_tumor_event_after_initial_treatment <int>,
#> #   eastern_cancer_oncology_group <int>,
#> #   followup_case_report_form_submission_reason <chr>,
#> #   followup_treatment_success <chr>, form_completion_date <chr>,
#> #   gender <chr>, histological_type <chr>,
#> #   history_of_neoadjuvant_treatment <chr>, icd_10 <chr>,
#> #   icd_o_3_histology <chr>, icd_o_3_site <chr>,
#> #   informed_consent_verified <chr>,
#> #   initial_pathologic_diagnosis_method <chr>, initial_weight <int>,
#> #   intermediate_dimension <dbl>, is_ffpe <chr>,
#> #   karnofsky_performance_score <int>, longest_dimension <dbl>,
#> #   lost_follow_up <chr>, lymphatic_invasion <chr>,
#> #   neoplasm_histologic_grade <chr>, new_neoplasm_event_type <chr>,
#> #   new_tumor_event_additional_surgery_procedure <chr>,
#> #   new_tumor_event_after_initial_treatment <chr>, oct_embedded <chr>,
#> #   other_dx <chr>, pathology_report_file_name <chr>, patient_id <chr>,
#> #   performance_status_scale_timing <chr>,
#> #   person_neoplasm_cancer_status <chr>, postoperative_rx_tx <chr>,
#> #   primary_therapy_outcome_success <chr>,
#> #   progression_determined_by <chr>, radiation_therapy <chr>,
#> #   residual_tumor <chr>, sample_type <chr>, sample_type_id <chr>,
#> #   shortest_dimension <dbl>,
#> #   tissue_prospective_collection_indicator <chr>,
#> #   tissue_retrospective_collection_indicator <chr>,
#> #   tissue_source_site <chr>, tumor_residual_disease <chr>,
#> #   tumor_tissue_site <chr>, venous_invasion <chr>, vial_number <chr>,
#> #   vital_status <chr>, year_of_initial_pathologic_diagnosis <int>,
#> #   `_GENOMIC_ID_TCGA_OV_PDMRNAseq` <chr>,
#> #   `_GENOMIC_ID_data/public/TCGA/OV/miRNA_HiSeq_gene` <chr>,
#> #   `_GENOMIC_ID_TCGA_OV_mutation_bcm_solid_gene` <chr>,
#> #   `_GENOMIC_ID_TCGA_OV_exp_u133a` <chr>,
#> #   `_GENOMIC_ID_TCGA_OV_hMethyl450` <chr>,
#> #   `_GENOMIC_ID_TCGA_OV_miRNA_HiSeq` <chr>,
#> #   `_GENOMIC_ID_TCGA_OV_mutation_curated_bcm_solid_gene` <chr>,
#> #   `_GENOMIC_ID_TCGA_OV_hMethyl27` <chr>,
#> #   `_GENOMIC_ID_TCGA_OV_mutation_wustl_hiseq_gene` <chr>,
#> #   `_GENOMIC_ID_TCGA_OV_RPPA_RBN` <chr>,
#> #   `_GENOMIC_ID_TCGA_OV_mutation_wustl_gene` <chr>,
#> #   `_GENOMIC_ID_TCGA_OV_exp_HiSeqV2_percentile` <chr>,
#> #   `_GENOMIC_ID_TCGA_OV_gistic2thd` <chr>,
#> #   `_GENOMIC_ID_TCGA_OV_PDMarray` <chr>,
#> #   `_GENOMIC_ID_TCGA_OV_RPPA` <chr>,
#> #   `_GENOMIC_ID_TCGA_OV_exp_HiSeq` <chr>,
#> #   `_GENOMIC_ID_TCGA_OV_gistic2` <chr>,
#> #   `_GENOMIC_ID_TCGA_OV_exp_HiSeqV2_PANCAN` <chr>,
#> #   `_GENOMIC_ID_TCGA_OV_exp_HiSeq_exon` <chr>,
#> #   `_GENOMIC_ID_TCGA_OV_exp_HiSeqV2` <chr>,
#> #   `_GENOMIC_ID_TCGA_OV_mutation_broad_gene` <chr>,
#> #   `_GENOMIC_ID_TCGA_OV_PDMarrayCNV` <chr>,
#> #   `_GENOMIC_ID_TCGA_OV_PDMRNAseqCNV` <chr>, …

其他功能慢慢加~

有问题不妨看看文档https://shixiangwang.github.io/UCSCXenaTools/,应该不难。