Download: TCGA

Annotated variant files are downloaded from the ISB-CGC Big Query repository.

Fields

Field descriptions for Big Query output available in field_names_descriptions.csv

Additional field descriptions available on GDC docs

Studies

TCGA Study ID	TCGA Study Name
ACC	Adrenocortical carcinoma
BLCA	Bladder Urothelial Carcinoma
BRCA	Breast invasive carcinoma
CESC	Cervical squamous cell carcinoma and endocervical adenocarcinoma
CHOL	Cholangiocarcinoma
COAD	Colon adenocarcinoma
DLBC	Lymphoid Neoplasm Diffuse Large B-cell Lymphoma
ESCA	Esophageal carcinoma
GBM	Glioblastoma multiforme
HNSC	Head and Neck squamous cell carcinoma
KICH	Kidney Chromophobe
KIRC	Kidney renal clear cell carcinoma
KIRP	Kidney renal papillary cell carcinoma
LAML	Acute Myeloid Leukemia
LGG	Brain Lower Grade Glioma
LIHC	Liver hepatocellular carcinoma
LUAD	Lung adenocarcinoma
LUSC	Lung squamous cell carcinoma
MESO	Mesothelioma
OV	Ovarian serous cystadenocarcinoma
PAAD	Pancreatic adenocarcinoma
PCPG	Pheochromocytoma and Paraganglioma
PRAD	Prostate adenocarcinoma
READ	Rectum adenocarcinoma
SARC	Sarcoma
SKCM	Skin Cutaneous Melanoma
STAD	Stomach adenocarcinoma
TGCT	Testicular Germ Cell Tumors
THCA	Thyroid carcinoma
THYM	Thymoma
UCEC	Uterine Corpus Endometrial Carcinoma
UCS	Uterine Carcinosarcoma
UVM	Uveal Melanoma

Downloading through Big Query

For complete documentation, see the ISB-CGC Read the Docs pages

Step 1 - Gain All Access Requiremenets

Contact Dr. Fabian Seidle and ask for access to the ISB-CGC Big Query repository

Example: For the run in Spring-Summer 2022 my (Ned’s) personal gwu account was added to the project ‘isb-cgc-training-001’
All users have up to 1 TB of downloads free, for our purposes we are well under this limit so should not need to pay

Gain access to dbGaP data

Apply for access to controlled data at this website
You will need to be approved by a PI that already has access to dbGaP controlled data

For further information see the ISB-CGC documentation on gaining access

Step 2 - Run downloader R script using R Studio

TCGA_mutation_download.R

Run each line one after the other, instead of the whole script together

Running library(bigrquery) and calling this library with bq_project_query() (later in the script) will open a browser to login with google credentials

Use the google account registered with Fabian for a ISB-CGC project and with dbGAP authorization
After logging in, a token will be saved so that you can login through R studio instead

This script will download all mutation data for TCGA.

There were issues in running this script because the downloaded file was so large.

In this case run the following scripts in the folder mutation_download_subscripts:

TCGA_mutation_download_part1.R
TCGA_mutation_download_part2.R
TCGA_mutation_download_part3.R
TCGA_mutation_download_part4.R

These scripts will download a set of the TCGA studies, so that the downloaded file size is smaller.

Additional Information

Go to MyBinder

For ‘Github repository name or URL’ enter https://github.com/isb-cgc/ISB-CGC-Demos, then click ‘Launch’.

The methods in this tutorial were used to generate the R scripts used to download the data.

get_field_names.R

Download a list of all field names for the mutation data, many fields are excluded in the mutation downloader script.

TCGA_clinical_info_download.R

Download clinical information for all patients included in the mutation file download.

The downloaded file will contain both sample ID and patient ID, that can be mapped together in the final mutation file to calculate patient frequency for mutations from TCGA.

get_field_names_clinical_info.R

Download a list of all field names for the corresponding clinical data, many fields are excluded in the clinical information downloader script.