Step 3: Combine

All resources are combined together into a master dataset.

Scripts

combine_csv.py

Procedure

Run combine_csv.py

Summary

All of the mutation data for each source was converted to a standardized data structure in the convert step.

Now, all of these separate csv files (one for each source) will be combined into a master csv

All csv files to be combined should be in a folder together with no additional csv files

Script Specifications

The script must be called from the command line and takes specific command line arguments

Input

-i : The folder containing csv mutation files to combine

-o : The folder to output the combined mutation file

Output

A csv file combining all csv files in a given folder

Usage

python combine_csv.py -h

*Gives a description of the neccessary commands

python combine_csv.py -i <path/> -o <path/>

*Runs the script with the given folder and combines all csv files in that folder

Additional Notes

Final fields

Field	Description
sample_name	Sample ID provided by the original resource (for v-5.0 only applies to TCGA and COSMIC)
chr_id	Chromosome number only (no ‘chr’)
start_pos	Genomic coordinates (For v-5.0 these are all with ref GRCh38)
end_pos	Identical to the start positoon because all mutations are Specifications
ref_nt	Reference nucleotide
alt_nt	Nucleotide mutation
aa_pos	Amino acide number of the amino acide change in the human_protein_transcriptlocus
ref_aa	Reference amino acid
alt_aa	Amino acid variation caused by the mutation
do_name	DO parent term
uniprot_canonical_ac	Uniprot accession for the specific ENST or ENSP listed from the source
source	Original data source of the mutation