Convert: CIVIC

Scripts

genomic liftover > convert_civic_vcf.py > map_civic_csv.py

Procedure

Perform liftover of mutations from GRCh37 to GRCh38

Summary

The most recent data release for CIVIC is aligned to the GRCH37 human reference genome. For this update however, we are using the human reference genome GRCh38.

To convert coordinates between the two reference genomes, we use a ‘liftover’ tool to remap the genomic coordinates. The CIVIC file is very small in size, so we can use the ENSEMBL online liftover tool: https://useast.ensembl.org/Homo_sapiens/Tools/AssemblyConverter?db=core

Run the downloaded VCF through the tool with the default parameters (change the file type to VCF).

Redownload the transformed VCF and use that VCF for the next step.

Run convert_civic_vcf.py

Summary

The python scipt convert_civic_vcf.py will convert the vcf formatted file to a csv file.

With the vcf format, each mutation line in the file can contain multiple annotations and annotation-specific information.

The output csv format will contain only one annotation per line with associated annotation-sepcific information.

In order to know how the information for the mutation and annotation fields are structured, a schema describing the fields is provided to the script.

Example Line Tranformation

Input VCF lines

mutation A info | mutation A annotation 1 info | mutation A annotation 2 info

mutation B info | mutation B annotation 1 info | mutation B annotation 2 info | mutation B annotation 3 info

Output CSV lines

mutation A info,annotation 1 info

mutation A info,annotation 2 info

mutation B info,annotation 1 info

mutation B info,annotation 2 info

mutation B info,annotation 3 info

Script Specifications

The script must be called from the command line and takes specific command line arguments

Input
  • -i : A path to the CIVIC .vcf file

  • -p : A prefix used for naming the output files

  • -o : A path to the output folder, where the mutation data csv will go

Output
  • A .csv file with mutation data

Usage
  • python convert_civic_vcf.py -h

*Gives a description of the neccessary commands

  • python convert_civic_vcf.py -i <path/input_file.vcf> -s <path/schema.json> -o <path/>

*Runs the script with the given input vcf and outputs a csv file.

Run map_civic_csv.py

Summary

The python script map_cosmic_tsv.py will take the output of the TCGA download step and:
  • Map the data to:
    • uniprot accessions

    • doid parent terms

  • Rename fields

  • Reformat fields
    • amino acid change and position

    • chromosome id

    • genomic location

    • nucleotide change

    • remove indels

    • transform NA values

Script Specifications

The script must be called from the command line and takes specific command line arguments

Input
  • -i : A path to the CIVIC .csv file

  • -m : A path to the folder containing mapping files

  • -d : The name of the doid mapping file

  • -e : The name of the ensp to uniprot accession mapping file

  • -o : A path to the output folder

Output
  • A .csv file with mutation data mapped to doid terms and uniprot accessions

Usage
  • python map_civic_csv.py -h

*Gives a description of the neccessary commands

  • python map_civic_csv.py -i <path/input_file.vcf> -m <path/mapping_folder> -d <doid_mapping_file_name> -e <enst_mapping_file_name> -o <path/>

*Runs the script with the given input csv and outputs a csv with mutation mapped to doid terms and uniprot accession

Additional Notes

All the mapping files are available in the scripts repository in the folder pipeline/convert_step2/mapping

The mapping files used for converting CIVIC are:

DOID: tcga_doid_mapping.csv

CIVIC DOID child terms were mapped to DOID parent terms using the following table (generated from the OncoMX DOID mapping table):

CIViC Entity Disease

DOID Term

Acral_Lentiginous_Melanoma_(DOID_6367)

DOID:4159 / skin cancer

Acute_Lymphoblastic_Leukemia_(DOID_9952)

DOID:2531 / hematologic cancer

Acute_Myeloid_Leukemia_(DOID_9119)

DOID:2531 / hematologic cancer

Acute_Promyelocytic_Leukemia_(DOID_0060318)

DOID:2531 / hematologic cancer

Adrenal_Gland_Pheochromocytoma_(DOID_0050892)

DOID:3953 / adrenal gland cancer

Angiosarcoma_(DOID_0001816)

NA

Basal_Cell_Carcinoma_(DOID_2513)

DOID:4159 / skin cancer

Biliary_Tract_Cancer_(DOID_4607)

NA

Bladder_Carcinoma_(DOID_4007)

DOID:11054 / urinary bladder cancer

Bladder_Urothelial_Carcinoma_(DOID_4006)

DOID:11054 / urinary bladder cancer

Bone_Marrow_Cancer_(DOID_4960)

DOID:2531 / hematologic cancer

Brain_Glioma_(DOID_0060108)

DOID:1319 / brain cancer

Breast_Cancer_(DOID_1612)

DOID:1612 / breast cancer

Breast_Carcinoma_(DOID_3459)

DOID:1612 / breast cancer

Bronchiolo-alveolar_Adenocarcinoma_(DOID_4926)

NA

Cancer_(DOID_162)

NA

Cervical_Cancer_(DOID_4362)

DOID:4362 / cervical cancer

Cervix_Carcinoma_(DOID_2893)

DOID:4362 / cervical cancer

Childhood_Acute_Lymphocytic_Leukemia_(DOID_0080144)

DOID:2531 / hematologic cancer

Childhood_Low-grade_Glioma_(DOID_0080830)

NA

Childhood_Pilocytic_Astrocytoma_(DOID_6812)

NA

Cholangiocarcinoma_(DOID_4947)

DOID:4606 / bile duct cancer

Chromophobe_Renal_Cell_Carcinoma_(DOID_4471)

DOID:263 / kidney cancer

Chronic_Leukemia_(DOID_1036)

DOID:2531 / hematologic cancer

Chronic_Lymphocytic_Leukemia_(DOID_1040)

DOID:2531 / hematologic cancer

Chronic_Myeloid_Leukemia_(DOID_8552)

DOID:2531 / hematologic cancer

Chronic_Neutrophilic_Leukemia_(DOID_0080187)

DOID:2531 / hematologic cancer

Chuvash_Polycythemia_(DOID_0060474)

DOID:2531 / hematologic cancer

Clear_Cell_Renal_Cell_Carcinoma_(DOID_4467)

DOID:263 / kidney cancer

Colon_Cancer_(DOID_219)

DOID:9256 / colorectal cancer

Colon_Mucinous_Adenocarcinoma_(DOID_3029)

DOID:9253 / gastrointestinal stromal tumor

Colorectal_Adenocarcinoma_(DOID_0050861)

DOID:9256 / colorectal cancer

Colorectal_Cancer_(DOID_9256)

DOID:9256 / colorectal cancer

Desmoid_Tumor_(DOID_0080366)

NA

Diffuse_Midline_Glioma_H3_K27M-mutant_(DOID_0080684)

NA

Endometrial_Adenocarcinoma_(DOID_2870)

DOID:363 / uterine cancer

Endometrial_Cancer_(DOID_1380)

DOID:363 / uterine cancer

Endometrial_Hyperplasia_(DOID_0080365)

NA

Endometrioid_Ovary_Carcinoma_(DOID_5828)

NA

Epithelial_Ovarian_Cancer_(DOID_2152)

DOID:2394 / ovarian cancer

Esophageal_Cancer_(DOID_5041)

DOID:5041 / esophageal cancer

Esophagus_Squamous_Cell_Carcinoma_(DOID_3748)

DOID:5041 / esophageal cancer

Estrogen-receptor_Positive_Breast_Cancer_(DOID_0060075)

DOID:1612 / breast cancer

Ewing_Sarcoma_Of_Bone_(DOID_3368)

DOID:184 / bone cancer

Follicular_Lymphoma_(DOID_0050873)

DOID:2531 / hematologic cancer

Gastrointestinal_Neuroendocrine_Tumor_(DOID_0050626)

DOID:9253 / gastrointestinal stromal tumor

Gastrointestinal_Stromal_Tumor_(DOID_9253)

DOID:9253 / gastrointestinal stromal tumor

Glioblastoma_(DOID_3068)

DOID:1319 / brain cancer

Hairy_Cell_Leukemia_(DOID_285)

DOID:2531 / hematologic cancer

Head_And_Neck_Cancer_(DOID_11934)

DOID:11934 / head and neck cancer

Head_And_Neck_Squamous_Cell_Carcinoma_(DOID_5520)

DOID:11934 / head and neck cancer

Hematologic_Cancer_(DOID_2531)

DOID:2531 / hematologic cancer

Hepatocellular_Carcinoma_(DOID_684)

DOID:3571 / liver cancer

Her2-receptor_Positive_Breast_Cancer_(DOID_0060079)

DOID:1612 / breast cancer

High_Grade_Glioma_(DOID_3070)

DOID:1319 / brain cancer

Inflammatory_Myofibroblastic_Tumor_(DOID_0050905)

NA

Intrahepatic_Cholangiocarcinoma_(DOID_4928)

DOID:4606 / bile duct cancer

Langerhans_Cell_Sarcoma_(DOID_7146)

DOID:2531 / hematologic cancer

Laryngeal_Squamous_Cell_Carcinoma_(DOID_2876)

DOID:2596 / larynx cancer

Leukemia_(DOID_1240)

DOID:2531 / hematologic cancer

Li-Fraumeni_Syndrome_(DOID_3012)

NA

Lung_Adenocarcinoma_(DOID_3910)

DOID:1324 / lung cancer

Lung_Cancer_(DOID_1324)

DOID:1324 / lung cancer

Lung_Carcinoma_(DOID_3905)

DOID:1324 / lung cancer

Lung_Non-small_Cell_Carcinoma_(DOID_3908)

DOID:1324 / lung cancer

Lung_Small_Cell_Carcinoma_(DOID_5409)

DOID:1324 / lung cancer

Lymphoid_Leukemia_(DOID_1037)

DOID:2531 / hematologic cancer

Lymphoma_(DOID_0060058)

DOID:2531 / hematologic cancer

Malignant_Exocrine_Pancreas_Neoplasm_(DOID_1795)

DOID:1793 / pancreatic cancer

Malignant_Mesothelioma_(DOID_1790)

DOID:1790 / malignant mesothelioma

Mammary_Analogue_Secretory_Carcinoma_(DOID_0080808)

NA

Medulloblastoma_(DOID_0050902)

DOID:1319 / brain cancer

Melanoma_(DOID_1909)

DOID:4159 / skin cancer

Merkel_Cell_Carcinoma_(DOID_3965)

DOID:4159 / skin cancer

Mucosal_Melanoma_(DOID_0050929)

DOID:4159 / skin cancer

Multiple_Myeloma_(DOID_9538)

DOID:2531 / hematologic cancer

Myelodysplastic_Syndrome_(DOID_0050908)

DOID:2531 / hematologic cancer

Myeloid_And_Lymphoid_Neoplasms_With_Eosinophilia_And_Abnormalities_Of_PDGFRA_PDGFRB_And_FGFR1_(DOID_0080164)

DOID:2531 / hematologic cancer

Neuroblastoma_(DOID_769)

NA

Oligodendroglioma_(DOID_3181)

DOID:3070 / malignant glioma

Osteosarcoma_(DOID_3347)

DOID:184 / bone cancer

Ovarian_Cancer_(DOID_2394)

DOID:2394 / ovarian cancer

Ovarian_Clear_Cell_Carcinoma_(DOID_0050934)

DOID:2394 / ovarian cancer

Ovarian_Granulosa_Cell_Tumor_(DOID_2999)

DOID:2394 / ovarian cancer

Ovarian_Serous_Carcinoma_(DOID_0050933)

DOID:2394 / ovarian cancer

Ovarian_Sex-cord_Stromal_Tumor_(DOID_0080369)

DOID:2394 / ovarian cancer

Ovary_Serous_Adenocarcinoma_(DOID_5744)

DOID:2394 / ovarian cancer

PTEN_Hamartoma_Tumor_Syndrome_(DOID_0080191)

NA

Pancreatic_Adenocarcinoma_(DOID_4074)

DOID:1793 / pancreatic cancer

Pancreatic_Cancer_(DOID_1793)

DOID:1793 / pancreatic cancer

Pancreatic_Carcinoma_(DOID_4905)

DOID:1793 / pancreatic cancer

Pancreatic_Ductal_Adenocarcinoma_(DOID_3498)

DOID:1793 / pancreatic cancer

Pancreatic_Ductal_Carcinoma_(DOID_3587)

DOID:1793 / pancreatic cancer

Paraganglioma_(DOID_0050773)

DOID:3953 / adrenal gland cancer

Parietal_Lobe_Ependymoma_(DOID_0050903)

NA

Peritoneal_Mesothelioma_(DOID_1788)

DOID:1725 / peritoneum cancer

Polycythemia_Vera_(DOID_8997)

DOID:2531 / hematologic cancer

Prostate_Cancer_(DOID_10283)

DOID:10283 / prostate cancer

Rectum_Cancer_(DOID_1993)

DOID:9256 / colorectal cancer

Renal_Carcinoma_(DOID_4451)

DOID:263 / kidney cancer

Renal_Cell_Carcinoma_(DOID_4450)

DOID:263 / kidney cancer

Rhabdomyosarcoma_(DOID_3247)

DOID:4045 / muscle cancer

Sertoli-Leydig_Cell_Tumor_(DOID_2997)

NA

Skin_Melanoma_(DOID_8923)

DOID:4159 / skin cancer

Skin_Squamous_Cell_Carcinoma_(DOID_3151)

DOID:4159 / skin cancer

Solid_Tumor

NA

Spindle_Cell_Rhabdomyosarcoma_(DOID_3260)

DOID:4045 / muscle cancer

Stomach_Cancer_(DOID_10534)

DOID:10534 / stomach cancer

Stomach_Carcinoma_(DOID_5517)

DOID:10534 / stomach cancer

Systemic_Mastocytosis_(DOID_349)

DOID:2531 / hematologic cancer

T-cell_Acute_Lymphoblastic_Leukemia_(DOID_5603)

NA

Thymic_Carcinoma_(DOID_3284)

DOID:3277 / thymus cancer

Thyroid_Gland_Anaplastic_Carcinoma_(DOID_0080522)

DOID:1781 / thyroid gland cancer

Thyroid_Gland_Cancer_(DOID_1781)

DOID:1781 / thyroid gland cancer

Thyroid_Gland_Carcinoma_(DOID_3963)

DOID:1781 / thyroid gland cancer

Thyroid_Gland_Hurthle_Cell_Carcinoma_(DOID_8161)

DOID:1781 / thyroid gland cancer

Thyroid_Gland_Medullary_Carcinoma_(DOID_3973)

DOID:1781 / thyroid gland cancer

Thyroid_Gland_Papillary_Carcinoma_(DOID_3969)

DOID:1781 / thyroid gland cancer

Transitional_Cell_Carcinoma_(DOID_2671)

NA

Tuberous_Sclerosis_(DOID_13515)

NA

Villous_Adenoma

NA

Von_Hippel-Lindau_Disease_(DOID_14175)

NA

Waldenstroem’s_Macroglobulinemia_(DOID_0060901)

DOID:2531 / hematologic cancer

Uniprot Accession: human_protein_transcriptlocus.csv

Transcript ID (starts with ENST) was mapped to uniprot isoform accession

Mapping was NOT performed to uniprot canonical accession as this resulted in an issue with the final dataset in which a mutation for the same canonical accession would be listed with different amino acid changes