Download: COSMIC

There are three COSMIC mutation datasets for coding mutations:

  • COSMIC Complete Mutation Data (Targeted Screens)

    A tab separated table of the complete curated COSMIC dataset (targeted screens) from the current release. It includes all coding point mutations, and the negative data set.

  • COSMIC Mutation Data (Genome Screens)

    A tab separated table of coding point mutations from genome wide screens (including whole exome sequencing).

  • COSMIC Mutations Data

    A tab separated table of all COSMIC coding point mutations from targeted and genome wide screens from the current release.

The COSMIC Mutations Data set was chosen because it combines both the Targeted and Genome Screens

Downloaded File: COSMIC_SNPs_June_2022.tsv

NOTE Downloading the mutation datasets requires a COSMIC login. With an academic email address, an account can be created for free and the download can be performed.

Fields

The COSMIC dataset contains a large number of fields, many of which were filtered out in order to speed up processing in subsequent steps.

A ‘simplified’ version of the file was used by selecting specific columns from the orginal downloaded file using the command line tool awk

Fields in Simplified Version

Field Name

Example

Accession Number

ENST00000404621.5

Sample name

H_LV-3334-1316090

Primary site

breast

Mutation CDS

c.644C>G

Mutation AA

p.S215*

Mutation genome position

12:124466234-124466234

All Fields from COSMIC and Field Descriptions

From ‘File Description’ drop down menu below ‘Cosmic Mutation Data’ (on downloads page)

Gene name

The gene name for which the data has been curated in COSMIC. In most cases this is the accepted HGNC identifier.

Accession Number

The transcript identifier of the gene.

Gene CDS length

Length of the gene (base pair) counts.

HGNC id

if gene is in HGNC this id helps linking it to HGNC.

Sample name

Sample id Id tumour A sample is an instance of a portion of a tumour being examined for mutations. The sample name can be derived from a number of sources. In many cases it originates from the cell line name. Other sources include names assigned by the annotators or an incremented number assigned during an anonymisation process. A number of samples can be taken from a single tumour and a number of tumours can be obtained from one individual. A sample id is used to identify a sample within the COSMIC database. There can be multiple ids if the same sample has been entered into the database multiple times from different papers.

Primary Site

The primary tissue/cancer from which the sample originated. More details on the tissue classification are avaliable from here. In COSMIC we have standard classification system for tissue types and sub types because they vary a lot between different papers.

Site Subtype 1

Further sub classification (level 1) of the samples tissue of origin.

Site Subtype 2

Further sub classification (level 2) of the samples tissue of origin.

Site Subtype 3

Further sub classification (level 3) of the samples tissue of origin.

Primary Histology

The histological classification of the sample.

Histology Subtype 1

Further histological classification (level 1) of the sample.

Histology Subtype 2

Further histological classification (level 2) of the sample.

Histology Subtype 3

Further histological classification (level 3) of the sample.

Genome-wide screen

if the entire genome/exome is sequenced.

GENOMIC_MUTATION_ID

Genomic mutation identifier (COSV) to indicate the definitive position of the variant on the genome. This identifier is trackable and stable between different versions of the release.

LEGACY_MUTATION_ID

Legacy mutation identifier (COSM) that will represent existing COSM mutation identifiers.

MUTATION_ID

An internal mutation identifier to uniquely represent each mutation on a specific transcript on a given assembly build.

Mutation CDS

The change that has occurred in the nucleotide sequence. Formatting is identical to the method used for the peptide sequence.

Mutation AA

The change that has occurred in the peptide sequence. Formatting is based on the recommendations made by the Human Genome Variation Society. The description of each type can be found by following the link to Mutation Overview page.

Mutation Description

Type of mutation at the amino acid level (substitution deletion insertion complex fusion unknown etc.)

Mutation zygosity

Information on whether the mutation was reported to be homozygous heterozygous or unknown within the sample.

LOH

LOH Information on whether the gene was reported to have loss of heterozygosity in the sample: yes no or unknown.

GRCh

The coordinate system used: 37 = GRCh37/Hg19 and 38 = GRCh38/Hg38

Mutation genome position

The genomic coordinates of the mutation.

Mutation strand

postive or negative.

Resistance Mutation

mutation confers drug resistance (see CosmicResistanceMutations.tsv.gz for gene/drug details).

Mutation somatic status

Information on whether the sample was reported to be Confirmed Somatic Previously Reported or Variant of unknown origin -

^

Confirmed Somatic = if the mutation has been confimed to be somatic in the experiment by sequencing both the tumour and a matched normal from the same patient.

^

Variant of unknown origin = when the mutation is known to be somatic but the tumour was sequenced without a matched normal.

^

Previously observed = when the mutation has been reported as somatic previously but not in current paper.

Pubmed_PMID

The PUBMED ID for the paper that the sample was noted in linking to pubmed to provide more details of the publication.

Id Study

Lists the unique Ids of studies that have involved this sample.

Sample Type

Tumour origin Describes where the sample has originated from including the tumour type.

Age

Age of the sample (if this information is provided with the publications).

HGVSP

Human Genome Variation Society peptide syntax.

HGVSC

Human Genome Variation Society coding dna sequence syntax (CDS).

HGVSG

Human Genome Variation Society genomic syntax (3’ shifted).