Convert: ICGC
Scripts
genomic liftover (mapvcf_copySA.py) > convert_icgc_vcf.py > map_icgc.py
Procedure
Perform liftover of mutations from GRCh37 to GRCh38 (mapvcf_copySA)
Summary
The most recent data release for CIVIC is aligned to the GRCH37 human reference genome. For this update however, we are using the human reference genome GRCh38.
To convert coordinates between the two reference genomes, we use a ‘liftover’ tool to remap the genomic coordinates.
Seun performed the liftover and provided the notes listed below.
Genomic Liftover Notes
Step performed and notes provided by Seun Agbaje
VCF
A VCF (Variant Call Format) file is a text file used to store gene sequence variations. The files often start with lines of metadata, then headers relating to the variants described. Because the standard for formatting and relaying genomic data is always evolving, there are numerous versions and references for VCF files and the dependencies they use
Fields
Common fields for VCF files include:
Field |
Description |
|---|---|
Chrom |
chromosome that the variation is being called on |
Pos |
1 base position of the variant |
ID |
identifier of the variant |
Ref |
reference base at the position of variance |
Alt |
alternate alleles at the position |
Qual |
quality score ofthe given alleles |
Filter |
indicates which set of filters failed or passed |
Info |
descriptions of the variation |
Format |
(optional) fields describing the sample |
Samples |
values for each of the samples listed under format |
Converting with CrossMap
CrossMap is a program that can convert genome coordinates between different assemblies, such as hg18 (GRCh36) to hg19 (GRCh37). It is made in python and offered as a webtool, by Ensembl in limited capacity or as a local script For full functionality. This gives extra customizability and the option to convert files over 50 mb, it is necessary to run a local edition of CrossMap.
Crossmap Documentation: http://crossmap.sourceforge.net/
Requirements
Python2 or Python3 installed on a linux server
Chain file - describes a pairwise alignment between two reference assemblies
They can be found through UCSC, Ensembl, and other sources
compressed files are allowed
hg19ToHg38.over.chain was best tested
target, input file - file to be converted in format compatible with CrossMap
CrossMap supports vcf, bam/cram/sam, maf, and other formats torelay genomic data
compressed files are allowed
referencefile - fasta format of the wanted genome assembly
Other files used
mapvcf is the script from the package that does the conversion. attachedis the version I used. I believe commenting out lines 100:109 is what allowed it to work
hg19ToHg38 is the chain file that I used
this is the command I used to get the assembly file, which is from UCSC “wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz &”
this is the command I used to unzip the assembly file “gzip -dk hg38.fa.gz”
the exact command I ran to create the file is this “python3 ./.local/bin/CrossMap.py vcf /mnt/d/hg19ToHg38.over.chain.gz /mnt/d/icgc_missense_mutations.vcf hg38.fa /mnt/d/icgc_missense_mutations_38_hgfz.vcf”
of note, there are numerous other assembly and chain files. I tried 3 or 4 of each and the ones linked here were the best. I determined best by both what the script relays and how big the final vcf file were
Output
Two output files were generated from the liftover and stored on the OncoMX-tst server at /software/pipeline/integrator/downloads/biomuta/v-5.0/icgc/ - icgc_missense_mutations_38.vcf
All mutations with converted coordinates
icgc_missense_mutations_38_fail.vcf - Mutations whose coordinates could not be converted
Only the mutations whose coordinates were successfully converted were carried forward in the pipeline
Run convert_icgc_vcf.py
Summary
The python script convert_icgc_vcf.py will convert the vcf formatted mutation file to a csv file.
With the vcf format, each mutation line in the file can contain multiple annotations and annotation-specific information.
The output csv format will contain only one annotation per line with associated annotation-sepcific information.
In order to know how the information for the mutation and annotation fields are structured, a schema describing the fields is provided to the script.
Example Line Tranformation
Input VCF lines
mutation A info | mutation A annotation 1 info | mutation A annotation 2 info
mutation B info | mutation B annotation 1 info | mutation B annotation 2 info | mutation B annotation 3 info
Output CSV lines
mutation A info,annotation 1 info
mutation A info,annotation 2 info
mutation B info,annotation 1 info
mutation B info,annotation 2 info
mutation B info,annotation 3 info
Script Specifications
The script must be called from the command line and takes specific command line arguments
Input
-i : A path to the ICGC .vcf file
-s : A schema file containing the field names in the annotations and to use for the output file
-o : A path to the output folder, where the transformed CSV data will go
Output
A .csv file with mutation data where each row contains one mutation and one unique annotation
Usage
Run map_icgc.py
Summary
- The python script map_icgc.py will take the output of the vcf convertor script and:
- Map the data to:
uniprot accessions
doid parent terms
Rename fields
- Reformat fields
amino acid change and position
chromosome id
genomic location
nucleotide change
Script Specifications
The script must be called from the command line and takes specific command line arguments
Input
-i : A path to the ICGC .csv file
-m : A path to the folder containing mapping files
-d : The name of the doid mapping file
-e : The name of the ensp to uniprot accession mapping file
-o : A path to the output folder
Output
A .csv file with mutation data formatted to the biomuta field structure
Usage
Additional Notes
All the mapping files are alable in the scripts repository in the folder pipeline/convert_step2/mapping
The mapping files used for converting the ICGC csv are:
DOID: tcga_doid_mapping.csv
ICGC uses TCGA study terms, so the same TCGA to DOID parent terms are used for mapping (generated from previous Biomuta mapping):
DO_slim_id |
DO_slim_name |
TCGA_project |
|---|---|---|
DOID:5041 |
esophageal cancer |
TCGA-ESCA |
DOID:2531 |
hematologic cancer |
TCGA-DLBC |
DOID:9256 |
colorectal cancer |
TCGA-READ |
DOID:1319 |
brain cancer |
TCGA-GBM |
DOID:1319 |
brain cancer |
TCGA-LGG |
DOID:1781 |
thyroid cancer |
TCGA-THCA |
DOID:11054 |
urinary bladder cancer |
TCGA-BLCA |
DOID:363 |
uterine cancer |
TCGA-UCEC |
DOID:169 |
neuroendocrine tumor |
TCGA-PCPG |
DOID:4362 |
cervical cancer |
TCGA-CESC |
DOID:363 |
uterine cancer |
TCGA-UCS |
DOID:3277 |
thymus cancer |
TCGA-THYM |
DOID:3571 |
liver cancer |
TCGA-LIHC |
DOID:11934 |
head and neck cancer |
TCGA-HNSC |
DOID:2174 |
ocular cancer |
TCGA-UVM |
DOID:4159 |
skin cancer |
TCGA-SKCM |
DOID:9256 |
colorectal cancer |
TCGA-COAD |
DOID:3953 |
adrenal gland cancer |
TCGA-ACC |
DOID:1793 |
pancreatic cancer |
TCGA-PAAD |
DOID:2994 |
germ cell cancer |
TCGA-TGCT |
DOID:1324 |
lung cancer |
TCGA-LUSC |
DOID:1790 |
malignant mesothelioma |
TCGA-MESO |
DOID:2394 |
ovarian cancer |
TCGA-OV |
DOID:1115 |
sarcoma |
TCGA-SARC |
DOID:263 |
kidney cancer |
TCGA-KIRP |
DOID:10534 |
stomach cancer |
TCGA-STAD |
DOID:2531 |
hematologic cancer |
TCGA-LAML |
DOID:10283 |
prostate cancer |
TCGA-PRAD |
DOID:1324 |
lung cancer |
TCGA-LUAD |
DOID:1612 |
breast cancer |
TCGA-BRCA |
DOID:263 |
kidney cancer |
TCGA-KIRC |
DOID:263 |
kidney cancer |
TCGA-KICH |
Uniprot Accession: human_protein_transcriptlocus.csv
Transcript ID (starts with ENST) was mapped to uniprot annotation accession
Mapping was NOT performed to uniprot canonical accession as this resulted in an issue with the final dataset in which a mutation for the same canonical accession would be listed with different amino acid changes