Convert: COSMIC
Scripts
map_cosmic_tsv.py
Procedure
Run map_cosmic_tsv.py
Summary
- The python script map_cosmic_tsv.py will take the output of the TCGA download step and:
- Map the data to:
uniprot accessions
doid parent terms
Rename fields
- Reformat fields
amino acid change and position
chromosome id
genomic location
nucleotide change
Script Specifications
The script must be called from the command line and takes specific command line arguments
Input
-i : A path to the cosmic tsv mutation file
-m : A path to the folder containing mapping files
-d : The name of the doid to cosmic cancer type mapping file
-e : The name of the enst to uniprot accession mapping file
-o : A path to the the folder to export the final mapped mutations
Output
A mutation file with COSMIC mutations mapped to doid terms and uniprot accessions
Usage
map_cosmic_tsv -h
*Gives a description of the neccessary commands
python map_cosmic_tsv.py -i <path/cosmic_file_name.tsv> -m <path/mapping_folder> -d <doid_mapping_file_name> -e <enst_mapping_file_name> -o <path/output_folder>
*Runs the script with the given input tsv and outputs a tsv with Biomuta formatting.
Additional Notes
All the mapping files are available in the scripts repository in the folder pipeline/convert_step2/mapping
The mapping files used for converting the COSMIC tsv are:
DOID: cosmic_doid_mapping.csv
COSMIC tissue site terms were mapped to DOID parent terms using the following table (generated from previous Biomuta mapping):
Primary Site |
Top_Level_Organ_system |
|---|---|
NS |
NA |
adrenal_gland |
DOID:3953 / adrenal gland cancer |
autonomic_ganglia |
NA |
biliary_tract |
DOID:4606 / bile duct cancer |
bone |
DOID:184 / bone cancer |
breast |
DOID:1612 / breast cancer |
central_nervous_system |
DOID:1319 / brain cancer |
cervix |
DOID:4362 / cervical cancer |
endometrium |
DOID:363 / uterine cancer |
eye |
DOID:2174 / ocular cancer |
fallopian_tube |
DOID:1964 / fallopian tube cancer |
female_genital_tract_(site_indeterminate) |
|
female_genitourinary_system |
NA |
gastrointestinal_tract_(site_indeterminate) |
DOID:3119 / gastrointestinal system cancer |
genital_tract |
NA |
haematopoietic_and_lymphoid_tissue |
DOID:2531 / hematologic cancer |
kidney |
DOID:263 / kidney cancer |
large_intestine |
DOID:9256 / colorectal cancer |
liver |
DOID:3571 / liver cancer |
lung |
DOID:1324 / lung cancer |
mediastinum |
NA |
meninges |
DOID:3565 / meningioma |
oesophagus |
DOID:5041 / esophageal cancer |
ovary |
DOID:2394 / ovarian cancer |
pancreas |
DOID:1793 / pancreatic cancer |
paratesticular_tissues |
NA |
parathyroid |
DOID:1540 / parathyroid carcinoma |
penis |
DOID:11615 / penile cancer |
pericardium |
NA |
perineum |
DOID:4045 / muscle cancer |
peritoneum |
DOID:1725 / peritoneum cancer |
pituitary |
DOID:1785 / pituitary cancer |
placenta |
DOID:2021 / placenta cancer |
pleura |
DOID:5158 / pleural cancer |
prostate |
DOID:10283 / prostate cancer |
retroperitoneum |
DOID:5875 / retroperitoneal cancer |
salivary_gland |
DOID:8618 / oral cavity cancer |
skin |
DOID:4159 / skin cancer |
small_intestine |
DOID:9253 / gastrointestinal stromal tumor |
soft_tissue |
NA |
stomach |
DOID:10534 / stomach cancer |
testis |
DOID:2998 / testicular cancer |
thymus |
DOID:3277 / thymus cancer |
thyroid |
DOID:1781 / thyroid gland cancer |
upper_aerodigestive_tract |
DOID:8618 / oral cavity cancer |
urinary_tract |
DOID:11054 / urinary bladder cancer |
uterine_adnexa |
NA |
vagina |
DOID:119 / vaginal cancer |
vulva |
DOID:1245 / vulva cancer |
Uniprot Accession: human_protein_transcriptlocus.csv
Transcript ID (starts with ENST) was mapped to uniprot isoform accession
Mapping was NOT performed to uniprot canonical accession as this resulted in an issue with the final dataset in which a mutation for the same canonical accession would be listed with different amino acid changes