Convert: CIVIC
Scripts
genomic liftover > convert_civic_vcf.py > map_civic_csv.py
Procedure
Perform liftover of mutations from GRCh37 to GRCh38
Summary
The most recent data release for CIVIC is aligned to the GRCH37 human reference genome. For this update however, we are using the human reference genome GRCh38.
To convert coordinates between the two reference genomes, we use a ‘liftover’ tool to remap the genomic coordinates. The CIVIC file is very small in size, so we can use the ENSEMBL online liftover tool: https://useast.ensembl.org/Homo_sapiens/Tools/AssemblyConverter?db=core
Run the downloaded VCF through the tool with the default parameters (change the file type to VCF).
Redownload the transformed VCF and use that VCF for the next step.
Run convert_civic_vcf.py
Summary
The python scipt convert_civic_vcf.py will convert the vcf formatted file to a csv file.
With the vcf format, each mutation line in the file can contain multiple annotations and annotation-specific information.
The output csv format will contain only one annotation per line with associated annotation-sepcific information.
In order to know how the information for the mutation and annotation fields are structured, a schema describing the fields is provided to the script.
Example Line Tranformation
Input VCF lines
mutation A info | mutation A annotation 1 info | mutation A annotation 2 info
mutation B info | mutation B annotation 1 info | mutation B annotation 2 info | mutation B annotation 3 info
Output CSV lines
mutation A info,annotation 1 info
mutation A info,annotation 2 info
mutation B info,annotation 1 info
mutation B info,annotation 2 info
mutation B info,annotation 3 info
Script Specifications
The script must be called from the command line and takes specific command line arguments
Input
-i : A path to the CIVIC .vcf file
-p : A prefix used for naming the output files
-o : A path to the output folder, where the mutation data csv will go
Output
A .csv file with mutation data
Usage
Run map_civic_csv.py
Summary
- The python script map_cosmic_tsv.py will take the output of the TCGA download step and:
- Map the data to:
uniprot accessions
doid parent terms
Rename fields
- Reformat fields
amino acid change and position
chromosome id
genomic location
nucleotide change
remove indels
transform NA values
Script Specifications
The script must be called from the command line and takes specific command line arguments
Input
-i : A path to the CIVIC .csv file
-m : A path to the folder containing mapping files
-d : The name of the doid mapping file
-e : The name of the ensp to uniprot accession mapping file
-o : A path to the output folder
Output
A .csv file with mutation data mapped to doid terms and uniprot accessions
Usage
python map_civic_csv.py -h
*Gives a description of the neccessary commands
python map_civic_csv.py -i <path/input_file.vcf> -m <path/mapping_folder> -d <doid_mapping_file_name> -e <enst_mapping_file_name> -o <path/>
*Runs the script with the given input csv and outputs a csv with mutation mapped to doid terms and uniprot accession
Additional Notes
All the mapping files are available in the scripts repository in the folder pipeline/convert_step2/mapping
The mapping files used for converting CIVIC are:
DOID: tcga_doid_mapping.csv
CIVIC DOID child terms were mapped to DOID parent terms using the following table (generated from the OncoMX DOID mapping table):
CIViC Entity Disease |
DOID Term |
|---|---|
Acral_Lentiginous_Melanoma_(DOID_6367) |
DOID:4159 / skin cancer |
Acute_Lymphoblastic_Leukemia_(DOID_9952) |
DOID:2531 / hematologic cancer |
Acute_Myeloid_Leukemia_(DOID_9119) |
DOID:2531 / hematologic cancer |
Acute_Promyelocytic_Leukemia_(DOID_0060318) |
DOID:2531 / hematologic cancer |
Adrenal_Gland_Pheochromocytoma_(DOID_0050892) |
DOID:3953 / adrenal gland cancer |
Angiosarcoma_(DOID_0001816) |
NA |
Basal_Cell_Carcinoma_(DOID_2513) |
DOID:4159 / skin cancer |
Biliary_Tract_Cancer_(DOID_4607) |
NA |
Bladder_Carcinoma_(DOID_4007) |
DOID:11054 / urinary bladder cancer |
Bladder_Urothelial_Carcinoma_(DOID_4006) |
DOID:11054 / urinary bladder cancer |
Bone_Marrow_Cancer_(DOID_4960) |
DOID:2531 / hematologic cancer |
Brain_Glioma_(DOID_0060108) |
DOID:1319 / brain cancer |
Breast_Cancer_(DOID_1612) |
DOID:1612 / breast cancer |
Breast_Carcinoma_(DOID_3459) |
DOID:1612 / breast cancer |
Bronchiolo-alveolar_Adenocarcinoma_(DOID_4926) |
NA |
Cancer_(DOID_162) |
NA |
Cervical_Cancer_(DOID_4362) |
DOID:4362 / cervical cancer |
Cervix_Carcinoma_(DOID_2893) |
DOID:4362 / cervical cancer |
Childhood_Acute_Lymphocytic_Leukemia_(DOID_0080144) |
DOID:2531 / hematologic cancer |
Childhood_Low-grade_Glioma_(DOID_0080830) |
NA |
Childhood_Pilocytic_Astrocytoma_(DOID_6812) |
NA |
Cholangiocarcinoma_(DOID_4947) |
DOID:4606 / bile duct cancer |
Chromophobe_Renal_Cell_Carcinoma_(DOID_4471) |
DOID:263 / kidney cancer |
Chronic_Leukemia_(DOID_1036) |
DOID:2531 / hematologic cancer |
Chronic_Lymphocytic_Leukemia_(DOID_1040) |
DOID:2531 / hematologic cancer |
Chronic_Myeloid_Leukemia_(DOID_8552) |
DOID:2531 / hematologic cancer |
Chronic_Neutrophilic_Leukemia_(DOID_0080187) |
DOID:2531 / hematologic cancer |
Chuvash_Polycythemia_(DOID_0060474) |
DOID:2531 / hematologic cancer |
Clear_Cell_Renal_Cell_Carcinoma_(DOID_4467) |
DOID:263 / kidney cancer |
Colon_Cancer_(DOID_219) |
DOID:9256 / colorectal cancer |
Colon_Mucinous_Adenocarcinoma_(DOID_3029) |
DOID:9253 / gastrointestinal stromal tumor |
Colorectal_Adenocarcinoma_(DOID_0050861) |
DOID:9256 / colorectal cancer |
Colorectal_Cancer_(DOID_9256) |
DOID:9256 / colorectal cancer |
Desmoid_Tumor_(DOID_0080366) |
NA |
Diffuse_Midline_Glioma_H3_K27M-mutant_(DOID_0080684) |
NA |
Endometrial_Adenocarcinoma_(DOID_2870) |
DOID:363 / uterine cancer |
Endometrial_Cancer_(DOID_1380) |
DOID:363 / uterine cancer |
Endometrial_Hyperplasia_(DOID_0080365) |
NA |
Endometrioid_Ovary_Carcinoma_(DOID_5828) |
NA |
Epithelial_Ovarian_Cancer_(DOID_2152) |
DOID:2394 / ovarian cancer |
Esophageal_Cancer_(DOID_5041) |
DOID:5041 / esophageal cancer |
Esophagus_Squamous_Cell_Carcinoma_(DOID_3748) |
DOID:5041 / esophageal cancer |
Estrogen-receptor_Positive_Breast_Cancer_(DOID_0060075) |
DOID:1612 / breast cancer |
Ewing_Sarcoma_Of_Bone_(DOID_3368) |
DOID:184 / bone cancer |
Follicular_Lymphoma_(DOID_0050873) |
DOID:2531 / hematologic cancer |
Gastrointestinal_Neuroendocrine_Tumor_(DOID_0050626) |
DOID:9253 / gastrointestinal stromal tumor |
Gastrointestinal_Stromal_Tumor_(DOID_9253) |
DOID:9253 / gastrointestinal stromal tumor |
Glioblastoma_(DOID_3068) |
DOID:1319 / brain cancer |
Hairy_Cell_Leukemia_(DOID_285) |
DOID:2531 / hematologic cancer |
Head_And_Neck_Cancer_(DOID_11934) |
DOID:11934 / head and neck cancer |
Head_And_Neck_Squamous_Cell_Carcinoma_(DOID_5520) |
DOID:11934 / head and neck cancer |
Hematologic_Cancer_(DOID_2531) |
DOID:2531 / hematologic cancer |
Hepatocellular_Carcinoma_(DOID_684) |
DOID:3571 / liver cancer |
Her2-receptor_Positive_Breast_Cancer_(DOID_0060079) |
DOID:1612 / breast cancer |
High_Grade_Glioma_(DOID_3070) |
DOID:1319 / brain cancer |
Inflammatory_Myofibroblastic_Tumor_(DOID_0050905) |
NA |
Intrahepatic_Cholangiocarcinoma_(DOID_4928) |
DOID:4606 / bile duct cancer |
Langerhans_Cell_Sarcoma_(DOID_7146) |
DOID:2531 / hematologic cancer |
Laryngeal_Squamous_Cell_Carcinoma_(DOID_2876) |
DOID:2596 / larynx cancer |
Leukemia_(DOID_1240) |
DOID:2531 / hematologic cancer |
Li-Fraumeni_Syndrome_(DOID_3012) |
NA |
Lung_Adenocarcinoma_(DOID_3910) |
DOID:1324 / lung cancer |
Lung_Cancer_(DOID_1324) |
DOID:1324 / lung cancer |
Lung_Carcinoma_(DOID_3905) |
DOID:1324 / lung cancer |
Lung_Non-small_Cell_Carcinoma_(DOID_3908) |
DOID:1324 / lung cancer |
Lung_Small_Cell_Carcinoma_(DOID_5409) |
DOID:1324 / lung cancer |
Lymphoid_Leukemia_(DOID_1037) |
DOID:2531 / hematologic cancer |
Lymphoma_(DOID_0060058) |
DOID:2531 / hematologic cancer |
Malignant_Exocrine_Pancreas_Neoplasm_(DOID_1795) |
DOID:1793 / pancreatic cancer |
Malignant_Mesothelioma_(DOID_1790) |
DOID:1790 / malignant mesothelioma |
Mammary_Analogue_Secretory_Carcinoma_(DOID_0080808) |
NA |
Medulloblastoma_(DOID_0050902) |
DOID:1319 / brain cancer |
Melanoma_(DOID_1909) |
DOID:4159 / skin cancer |
Merkel_Cell_Carcinoma_(DOID_3965) |
DOID:4159 / skin cancer |
Mucosal_Melanoma_(DOID_0050929) |
DOID:4159 / skin cancer |
Multiple_Myeloma_(DOID_9538) |
DOID:2531 / hematologic cancer |
Myelodysplastic_Syndrome_(DOID_0050908) |
DOID:2531 / hematologic cancer |
Myeloid_And_Lymphoid_Neoplasms_With_Eosinophilia_And_Abnormalities_Of_PDGFRA_PDGFRB_And_FGFR1_(DOID_0080164) |
DOID:2531 / hematologic cancer |
Neuroblastoma_(DOID_769) |
NA |
Oligodendroglioma_(DOID_3181) |
DOID:3070 / malignant glioma |
Osteosarcoma_(DOID_3347) |
DOID:184 / bone cancer |
Ovarian_Cancer_(DOID_2394) |
DOID:2394 / ovarian cancer |
Ovarian_Clear_Cell_Carcinoma_(DOID_0050934) |
DOID:2394 / ovarian cancer |
Ovarian_Granulosa_Cell_Tumor_(DOID_2999) |
DOID:2394 / ovarian cancer |
Ovarian_Serous_Carcinoma_(DOID_0050933) |
DOID:2394 / ovarian cancer |
Ovarian_Sex-cord_Stromal_Tumor_(DOID_0080369) |
DOID:2394 / ovarian cancer |
Ovary_Serous_Adenocarcinoma_(DOID_5744) |
DOID:2394 / ovarian cancer |
PTEN_Hamartoma_Tumor_Syndrome_(DOID_0080191) |
NA |
Pancreatic_Adenocarcinoma_(DOID_4074) |
DOID:1793 / pancreatic cancer |
Pancreatic_Cancer_(DOID_1793) |
DOID:1793 / pancreatic cancer |
Pancreatic_Carcinoma_(DOID_4905) |
DOID:1793 / pancreatic cancer |
Pancreatic_Ductal_Adenocarcinoma_(DOID_3498) |
DOID:1793 / pancreatic cancer |
Pancreatic_Ductal_Carcinoma_(DOID_3587) |
DOID:1793 / pancreatic cancer |
Paraganglioma_(DOID_0050773) |
DOID:3953 / adrenal gland cancer |
Parietal_Lobe_Ependymoma_(DOID_0050903) |
NA |
Peritoneal_Mesothelioma_(DOID_1788) |
DOID:1725 / peritoneum cancer |
Polycythemia_Vera_(DOID_8997) |
DOID:2531 / hematologic cancer |
Prostate_Cancer_(DOID_10283) |
DOID:10283 / prostate cancer |
Rectum_Cancer_(DOID_1993) |
DOID:9256 / colorectal cancer |
Renal_Carcinoma_(DOID_4451) |
DOID:263 / kidney cancer |
Renal_Cell_Carcinoma_(DOID_4450) |
DOID:263 / kidney cancer |
Rhabdomyosarcoma_(DOID_3247) |
DOID:4045 / muscle cancer |
Sertoli-Leydig_Cell_Tumor_(DOID_2997) |
NA |
Skin_Melanoma_(DOID_8923) |
DOID:4159 / skin cancer |
Skin_Squamous_Cell_Carcinoma_(DOID_3151) |
DOID:4159 / skin cancer |
Solid_Tumor |
NA |
Spindle_Cell_Rhabdomyosarcoma_(DOID_3260) |
DOID:4045 / muscle cancer |
Stomach_Cancer_(DOID_10534) |
DOID:10534 / stomach cancer |
Stomach_Carcinoma_(DOID_5517) |
DOID:10534 / stomach cancer |
Systemic_Mastocytosis_(DOID_349) |
DOID:2531 / hematologic cancer |
T-cell_Acute_Lymphoblastic_Leukemia_(DOID_5603) |
NA |
Thymic_Carcinoma_(DOID_3284) |
DOID:3277 / thymus cancer |
Thyroid_Gland_Anaplastic_Carcinoma_(DOID_0080522) |
DOID:1781 / thyroid gland cancer |
Thyroid_Gland_Cancer_(DOID_1781) |
DOID:1781 / thyroid gland cancer |
Thyroid_Gland_Carcinoma_(DOID_3963) |
DOID:1781 / thyroid gland cancer |
Thyroid_Gland_Hurthle_Cell_Carcinoma_(DOID_8161) |
DOID:1781 / thyroid gland cancer |
Thyroid_Gland_Medullary_Carcinoma_(DOID_3973) |
DOID:1781 / thyroid gland cancer |
Thyroid_Gland_Papillary_Carcinoma_(DOID_3969) |
DOID:1781 / thyroid gland cancer |
Transitional_Cell_Carcinoma_(DOID_2671) |
NA |
Tuberous_Sclerosis_(DOID_13515) |
NA |
Villous_Adenoma |
NA |
Von_Hippel-Lindau_Disease_(DOID_14175) |
NA |
Waldenstroem’s_Macroglobulinemia_(DOID_0060901) |
DOID:2531 / hematologic cancer |
Uniprot Accession: human_protein_transcriptlocus.csv
Transcript ID (starts with ENST) was mapped to uniprot isoform accession
Mapping was NOT performed to uniprot canonical accession as this resulted in an issue with the final dataset in which a mutation for the same canonical accession would be listed with different amino acid changes