Step 3: Combine
All resources are combined together into a master dataset.
Scripts
combine_csv.py
Procedure
Run combine_csv.py
Summary
All of the mutation data for each source was converted to a standardized data structure in the convert step.
Now, all of these separate csv files (one for each source) will be combined into a master csv
All csv files to be combined should be in a folder together with no additional csv files
Script Specifications
The script must be called from the command line and takes specific command line arguments
Input
-i : The folder containing csv mutation files to combine
-o : The folder to output the combined mutation file
Output
A csv file combining all csv files in a given folder
Usage
Additional Notes
Final fields
Field |
Description |
|---|---|
sample_name |
Sample ID provided by the original resource (for v-5.0 only applies to TCGA and COSMIC) |
chr_id |
Chromosome number only (no ‘chr’) |
start_pos |
Genomic coordinates (For v-5.0 these are all with ref GRCh38) |
end_pos |
Identical to the start positoon because all mutations are Specifications |
ref_nt |
Reference nucleotide |
alt_nt |
Nucleotide mutation |
aa_pos |
Amino acide number of the amino acide change in the human_protein_transcriptlocus |
ref_aa |
Reference amino acid |
alt_aa |
Amino acid variation caused by the mutation |
do_name |
DO parent term |
uniprot_canonical_ac |
Uniprot accession for the specific ENST or ENSP listed from the source |
source |
Original data source of the mutation |