Convert: ICGC

Scripts

genomic liftover (mapvcf_copySA.py) > convert_icgc_vcf.py > map_icgc.py

Procedure

Perform liftover of mutations from GRCh37 to GRCh38 (mapvcf_copySA)

Summary

The most recent data release for CIVIC is aligned to the GRCH37 human reference genome. For this update however, we are using the human reference genome GRCh38.

To convert coordinates between the two reference genomes, we use a ‘liftover’ tool to remap the genomic coordinates.

Seun performed the liftover and provided the notes listed below.

Genomic Liftover Notes

Step performed and notes provided by Seun Agbaje

VCF

A VCF (Variant Call Format) file is a text file used to store gene sequence variations. The files often start with lines of metadata, then headers relating to the variants described. Because the standard for formatting and relaying genomic data is always evolving, there are numerous versions and references for VCF files and the dependencies they use

Fields

Common fields for VCF files include:

Field

Description

Chrom

chromosome that the variation is being called on

Pos

1 base position of the variant

ID

identifier of the variant

Ref

reference base at the position of variance

Alt

alternate alleles at the position

Qual

quality score ofthe given alleles

Filter

indicates which set of filters failed or passed

Info

descriptions of the variation

Format

(optional) fields describing the sample

Samples

values for each of the samples listed under format

Converting with CrossMap

CrossMap is a program that can convert genome coordinates between different assemblies, such as hg18 (GRCh36) to hg19 (GRCh37). It is made in python and offered as a webtool, by Ensembl in limited capacity or as a local script For full functionality. This gives extra customizability and the option to convert files over 50 mb, it is necessary to run a local edition of CrossMap.

Crossmap Documentation: http://crossmap.sourceforge.net/

Requirements

  • Python2 or Python3 installed on a linux server

  • Chain file - describes a pairwise alignment between two reference assemblies

  • They can be found through UCSC, Ensembl, and other sources

  • compressed files are allowed

  • hg19ToHg38.over.chain was best tested

  • target, input file - file to be converted in format compatible with CrossMap

  • CrossMap supports vcf, bam/cram/sam, maf, and other formats torelay genomic data

  • compressed files are allowed

  • referencefile - fasta format of the wanted genome assembly

Other files used

  • mapvcf is the script from the package that does the conversion. attachedis the version I used. I believe commenting out lines 100:109 is what allowed it to work

  • hg19ToHg38 is the chain file that I used

  • this is the command I used to get the assembly file, which is from UCSC “wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz &”

  • this is the command I used to unzip the assembly file “gzip -dk hg38.fa.gz”

  • the exact command I ran to create the file is this “python3 ./.local/bin/CrossMap.py vcf /mnt/d/hg19ToHg38.over.chain.gz /mnt/d/icgc_missense_mutations.vcf hg38.fa /mnt/d/icgc_missense_mutations_38_hgfz.vcf”

  • of note, there are numerous other assembly and chain files. I tried 3 or 4 of each and the ones linked here were the best. I determined best by both what the script relays and how big the final vcf file were

Output

Two output files were generated from the liftover and stored on the OncoMX-tst server at /software/pipeline/integrator/downloads/biomuta/v-5.0/icgc/ - icgc_missense_mutations_38.vcf

  • All mutations with converted coordinates

  • icgc_missense_mutations_38_fail.vcf - Mutations whose coordinates could not be converted

Only the mutations whose coordinates were successfully converted were carried forward in the pipeline

Run convert_icgc_vcf.py

Summary

The python script convert_icgc_vcf.py will convert the vcf formatted mutation file to a csv file.

With the vcf format, each mutation line in the file can contain multiple annotations and annotation-specific information.

The output csv format will contain only one annotation per line with associated annotation-sepcific information.

In order to know how the information for the mutation and annotation fields are structured, a schema describing the fields is provided to the script.

Example Line Tranformation

Input VCF lines

mutation A info | mutation A annotation 1 info | mutation A annotation 2 info

mutation B info | mutation B annotation 1 info | mutation B annotation 2 info | mutation B annotation 3 info

Output CSV lines

mutation A info,annotation 1 info

mutation A info,annotation 2 info

mutation B info,annotation 1 info

mutation B info,annotation 2 info

mutation B info,annotation 3 info

Script Specifications

The script must be called from the command line and takes specific command line arguments

Input
  • -i : A path to the ICGC .vcf file

  • -s : A schema file containing the field names in the annotations and to use for the output file

  • -o : A path to the output folder, where the transformed CSV data will go

Output
  • A .csv file with mutation data where each row contains one mutation and one unique annotation

Usage
  • python convert_icgc_vcf.py -h

*Gives a description of the neccessary commands

  • python convert_icgc_vcf.py -i <path/input_file.vcf> -s <path/schema.json> -o <path/>

*Runs the script with the given input vcf and schema json and outputs a csv file

Run map_icgc.py

Summary

The python script map_icgc.py will take the output of the vcf convertor script and:
  • Map the data to:
    • uniprot accessions

    • doid parent terms

  • Rename fields

  • Reformat fields
    • amino acid change and position

    • chromosome id

    • genomic location

    • nucleotide change

Script Specifications

The script must be called from the command line and takes specific command line arguments

Input
  • -i : A path to the ICGC .csv file

  • -m : A path to the folder containing mapping files

  • -d : The name of the doid mapping file

  • -e : The name of the ensp to uniprot accession mapping file

  • -o : A path to the output folder

Output
  • A .csv file with mutation data formatted to the biomuta field structure

Usage
  • python map_icgc.py -h

*Gives a description of the neccessary commands

  • python map_icgc.py -i <path/input_file.vcf> -m <path/> -d doid_mapping_file.csv -e enst_mapping_file.csv -o <path/>

*Runs the script with the given csv file and outputs a csv file formatted for the final biomuta master file

Additional Notes

All the mapping files are alable in the scripts repository in the folder pipeline/convert_step2/mapping

The mapping files used for converting the ICGC csv are:

DOID: tcga_doid_mapping.csv

ICGC uses TCGA study terms, so the same TCGA to DOID parent terms are used for mapping (generated from previous Biomuta mapping):

DO_slim_id

DO_slim_name

TCGA_project

DOID:5041

esophageal cancer

TCGA-ESCA

DOID:2531

hematologic cancer

TCGA-DLBC

DOID:9256

colorectal cancer

TCGA-READ

DOID:1319

brain cancer

TCGA-GBM

DOID:1319

brain cancer

TCGA-LGG

DOID:1781

thyroid cancer

TCGA-THCA

DOID:11054

urinary bladder cancer

TCGA-BLCA

DOID:363

uterine cancer

TCGA-UCEC

DOID:169

neuroendocrine tumor

TCGA-PCPG

DOID:4362

cervical cancer

TCGA-CESC

DOID:363

uterine cancer

TCGA-UCS

DOID:3277

thymus cancer

TCGA-THYM

DOID:3571

liver cancer

TCGA-LIHC

DOID:11934

head and neck cancer

TCGA-HNSC

DOID:2174

ocular cancer

TCGA-UVM

DOID:4159

skin cancer

TCGA-SKCM

DOID:9256

colorectal cancer

TCGA-COAD

DOID:3953

adrenal gland cancer

TCGA-ACC

DOID:1793

pancreatic cancer

TCGA-PAAD

DOID:2994

germ cell cancer

TCGA-TGCT

DOID:1324

lung cancer

TCGA-LUSC

DOID:1790

malignant mesothelioma

TCGA-MESO

DOID:2394

ovarian cancer

TCGA-OV

DOID:1115

sarcoma

TCGA-SARC

DOID:263

kidney cancer

TCGA-KIRP

DOID:10534

stomach cancer

TCGA-STAD

DOID:2531

hematologic cancer

TCGA-LAML

DOID:10283

prostate cancer

TCGA-PRAD

DOID:1324

lung cancer

TCGA-LUAD

DOID:1612

breast cancer

TCGA-BRCA

DOID:263

kidney cancer

TCGA-KIRC

DOID:263

kidney cancer

TCGA-KICH

Uniprot Accession: human_protein_transcriptlocus.csv

Transcript ID (starts with ENST) was mapped to uniprot annotation accession

Mapping was NOT performed to uniprot canonical accession as this resulted in an issue with the final dataset in which a mutation for the same canonical accession would be listed with different amino acid changes