Download: TCGA

Annotated variant files are downloaded from the ISB-CGC Big Query repository.

Fields

Field descriptions for Big Query output available in field_names_descriptions.csv

Additional field descriptions available on GDC docs

Studies

TCGA Study ID

TCGA Study Name

ACC

Adrenocortical carcinoma

BLCA

Bladder Urothelial Carcinoma

BRCA

Breast invasive carcinoma

CESC

Cervical squamous cell carcinoma and endocervical adenocarcinoma

CHOL

Cholangiocarcinoma

COAD

Colon adenocarcinoma

DLBC

Lymphoid Neoplasm Diffuse Large B-cell Lymphoma

ESCA

Esophageal carcinoma

GBM

Glioblastoma multiforme

HNSC

Head and Neck squamous cell carcinoma

KICH

Kidney Chromophobe

KIRC

Kidney renal clear cell carcinoma

KIRP

Kidney renal papillary cell carcinoma

LAML

Acute Myeloid Leukemia

LGG

Brain Lower Grade Glioma

LIHC

Liver hepatocellular carcinoma

LUAD

Lung adenocarcinoma

LUSC

Lung squamous cell carcinoma

MESO

Mesothelioma

OV

Ovarian serous cystadenocarcinoma

PAAD

Pancreatic adenocarcinoma

PCPG

Pheochromocytoma and Paraganglioma

PRAD

Prostate adenocarcinoma

READ

Rectum adenocarcinoma

SARC

Sarcoma

SKCM

Skin Cutaneous Melanoma

STAD

Stomach adenocarcinoma

TGCT

Testicular Germ Cell Tumors

THCA

Thyroid carcinoma

THYM

Thymoma

UCEC

Uterine Corpus Endometrial Carcinoma

UCS

Uterine Carcinosarcoma

UVM

Uveal Melanoma

Downloading through Big Query

For complete documentation, see the ISB-CGC Read the Docs pages

Step 1 - Gain All Access Requiremenets

Contact Dr. Fabian Seidle and ask for access to the ISB-CGC Big Query repository
  • Example: For the run in Spring-Summer 2022 my (Ned’s) personal gwu account was added to the project ‘isb-cgc-training-001’

  • All users have up to 1 TB of downloads free, for our purposes we are well under this limit so should not need to pay

Gain access to dbGaP data
  • Apply for access to controlled data at this website

  • You will need to be approved by a PI that already has access to dbGaP controlled data

For further information see the ISB-CGC documentation on gaining access

Step 2 - Run downloader R script using R Studio

TCGA_mutation_download.R

Run each line one after the other, instead of the whole script together

Running library(bigrquery) and calling this library with bq_project_query() (later in the script) will open a browser to login with google credentials
  • Use the google account registered with Fabian for a ISB-CGC project and with dbGAP authorization

  • After logging in, a token will be saved so that you can login through R studio instead

This script will download all mutation data for TCGA.

There were issues in running this script because the downloaded file was so large.

In this case run the following scripts in the folder mutation_download_subscripts:
  • TCGA_mutation_download_part1.R

  • TCGA_mutation_download_part2.R

  • TCGA_mutation_download_part3.R

  • TCGA_mutation_download_part4.R

These scripts will download a set of the TCGA studies, so that the downloaded file size is smaller.

Additional Information

Go to MyBinder

For ‘Github repository name or URL’ enter https://github.com/isb-cgc/ISB-CGC-Demos, then click ‘Launch’.

The methods in this tutorial were used to generate the R scripts used to download the data.

get_field_names.R

Download a list of all field names for the mutation data, many fields are excluded in the mutation downloader script.

TCGA_clinical_info_download.R

Download clinical information for all patients included in the mutation file download.

The downloaded file will contain both sample ID and patient ID, that can be mapped together in the final mutation file to calculate patient frequency for mutations from TCGA.

get_field_names_clinical_info.R

Download a list of all field names for the corresponding clinical data, many fields are excluded in the clinical information downloader script.