Convert: TCGA
Scripts
process_tcga_download.py
Procedure
Run process_tcga_download.py
Summary
- The python script process_tcga_download.py will take the output of the TCGA download step and:
- Map the data to:
uniprot accession
doid parent terms
Rename fields
- Reformat fields
amino acid change and position
chromosome id
Filter out unnecessary fields
Script Specifications
The script must be called from the command line and takes specific command line arguments
Input
-i : A path to the input csv to reformat
-m : A path to the folder containing mapping files
-d : A path to the tcga study to doid mapping file
-e : A path to the ENSP to uniprot mapping file
-o : A path to the output folder
Output
A data report comparing new AA sites to old AA sites for Biomuta
Usage
Additional Notes
All the mapping files are available in the repository folder pipeline/convert_step2/mapping
The mapping files used for converting TCGA are:
DOID: tcga_doid_mapping.csv
TCGA Projects were mapped to DOID parent terms using the following table (generated from previous Biomuta mapping):
DO_slim_id |
DO_slim_name |
TCGA_project |
|---|---|---|
DOID:5041 |
esophageal cancer |
TCGA-ESCA |
DOID:2531 |
hematologic cancer |
TCGA-DLBC |
DOID:9256 |
colorectal cancer |
TCGA-READ |
DOID:1319 |
brain cancer |
TCGA-GBM |
DOID:1319 |
brain cancer |
TCGA-LGG |
DOID:1781 |
thyroid cancer |
TCGA-THCA |
DOID:11054 |
urinary bladder cancer |
TCGA-BLCA |
DOID:363 |
uterine cancer |
TCGA-UCEC |
DOID:169 |
neuroendocrine tumor |
TCGA-PCPG |
DOID:4362 |
cervical cancer |
TCGA-CESC |
DOID:363 |
uterine cancer |
TCGA-UCS |
DOID:3277 |
thymus cancer |
TCGA-THYM |
DOID:3571 |
liver cancer |
TCGA-LIHC |
DOID:11934 |
head and neck cancer |
TCGA-HNSC |
DOID:2174 |
ocular cancer |
TCGA-UVM |
DOID:4159 |
skin cancer |
TCGA-SKCM |
DOID:9256 |
colorectal cancer |
TCGA-COAD |
DOID:3953 |
adrenal gland cancer |
TCGA-ACC |
DOID:1793 |
pancreatic cancer |
TCGA-PAAD |
DOID:2994 |
germ cell cancer |
TCGA-TGCT |
DOID:1324 |
lung cancer |
TCGA-LUSC |
DOID:1790 |
malignant mesothelioma |
TCGA-MESO |
DOID:2394 |
ovarian cancer |
TCGA-OV |
DOID:1115 |
sarcoma |
TCGA-SARC |
DOID:263 |
kidney cancer |
TCGA-KIRP |
DOID:10534 |
stomach cancer |
TCGA-STAD |
DOID:2531 |
hematologic cancer |
TCGA-LAML |
DOID:10283 |
prostate cancer |
TCGA-PRAD |
DOID:1324 |
lung cancer |
TCGA-LUAD |
DOID:1612 |
breast cancer |
TCGA-BRCA |
DOID:263 |
kidney cancer |
TCGA-KIRC |
DOID:263 |
kidney cancer |
TCGA-KICH |
Uniprot Accession: human_protein_transcriptlocus.csv
Peptide ID (starts with ENSP) was mapped to uniprot isoform accession
Mapping was NOT performed to uniprot canonical accession as this resulted in an issue with the final dataset in which a mutation for the same canonical accession would be listed with different amino acid changes