Loading a VCF File

Using VCF file on ProteinPaint

Current implementation works for SNV/indel data only.

Preparing a VCF file

Input file: file.vcf

bgzip file.vcf
tabix -p vcf file.vcf.gz

If successful, this will produce two files "file.vcf.gz" and "file.vcf.gz.tbi". The latter is the TBI index file. CSI index also works.

Hosting a VCF file

Host it on the ProteinPaint server:

  1. Put both the .gz file and the index file at the same directory > inside the directory of the ProteinPaint server.

    a. The directory is specified in the file "serverconfig.json".

  2. Obtain the relative path to the .gz file from but not include the directory.

    a. E.g. for /path/to/file.vcf.gz, the path would be "path/to/file.vcf.gz"

Host it on a web server:

  1. Put the .gz and index files on the web server. Obtain the URL of the .gz file for submission. For most cases ProteinPaint expects to find the index file at the same location of the .gz file

  2. If the index file does not share URL with the .gz file, provide its URL via the .indexURL attribute in the JSON definition

Showing a VCF file on ProteinPaint

Via URL parameters


http://server?block=on&genome=[genome]&vcffile=[name],[file path]
http://server?block=on&genome=[genome]&vcfurl=[name],[file URL]

The "vcfurl" cannot accept a separate URL for the index file, for such case, use embedding API instead

This will launch a genome browser view showing the VCF track.

Via embedding API

If the index file is using a separate URL with the .gz file, the embedding API must be used, in which the attribute "indexURL" can be used to provide an additional URL for the index file.

General introduction of embedding API.

General introduction on JSON definition for tracks.

See section Summary of VCF track object for an overview of the VCF track object.

Variant annotation formats

This section describes supported formats for functional annotation of variants. The annotation is optional and mostly for coding variants. The annotation will color-code variants using ProteinPaint's built-in protein mutation classes.

Shortcoming is the built-in classes are hardcoded and cannot be customized.

ProteinPaint fully supports custom defined classes through an alternative method. See the section on highlighting and filtering with INFO fields.

VEP format

The VCF output from the VEP is accepted by ProteinPaint. Find details at http://useast.ensembl.org/info/docs/tools/vep/vep_formats.html at the "VCF output" section.

The annotation is encoded in an INFO field called CSQ. It supports functional annotation of multiple isoforms on a given allele.

Description of the CSQ sub-fields must be in a required format for the purpose of identifying the name and order of the sub-fields. Following is one example:

##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence type as predicted by VEP. Format: Allele|Gene|Feature|Feature_type|Consequence|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|ALLELE_NUM|DISTANCE|STRAND|SYMBOL|SYMBOL_SOURCE|HGNC_ID|BIOTYPE|CANONICAL|CCDS|ENSP|SWISSPROT|TREMBL|UNIPARC|SIFT|PolyPhen|EXON|INTRON|DOMAINS|HGVSc|HGVSp|GMAF|AFR_MAF|AMR_MAF|ASN_MAF|EUR_MAF|AA_MAF|EA_MAF|CLIN_SIG|SOMATIC|PUBMED|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|LoF_info|LoF_flags|LoF_filter|LoF">

Required CSQ sub-fields:

  • Feature_type

    • When the value of Feature_type is "Transcript" for a variant, the value of "Feature" should be isoform name, e.g. RefSeq or ENSEMBL isoform. If multiple isoform names joined by comma, the first will be used
  • Consequence

  • HGVSp

  • Protein_position

  • Amino_acids

  • HGVSc

  • Existing_variation

    • If any above fields are available, will use the value as the label of the variant
VEP notation ProteinPaint notation Note
transcript_ablation deletion, intragenic Not yet supported.
splice_acceptor_variant splice
splice_donor_variant splice
stop_gained nonsense
frameshift_variant frameshift
stop_lost nonsense
start_lost nonsense
transcript_amplification nonstandard Not yet supported.
inframe_insertion proteinins
inframe_deletion proteindel
missense_variant missense
protein_altering_variant nonsense Not explicitly supported
splice_region_variant splice_region
incomplete_terminal_codon_variant nonsense
stop_retained_variant silent Not explicitly supported.
synonymous_variant silent
coding_sequence_variant nonstandard Not explicitly supported.
mature_miRNA_variant exon Not yet supported.
5_prime_UTR_variant utr_5
3_prime_UTR_variant utr_3
non_coding_transcript_exon_variant exon
intron_variant intron
NMD_transcript_variant silent Not equivalent.
non_coding_transcript_variant exon
upstream_gene_variant noncoding Not explicitly supported.
downstream_gene_variant noncoding Not explicitly supported.
TFBS_ablation noncoding Not yet supported.
TFBS_amplification noncoding Not yet supported.
TF_binding_site_variant noncoding Not yet supported.
regulatory_region_ablation noncoding Not yet supported.
regulatory_region_amplification noncoding Not yet supported.
feature_elongation noncoding Not yet supported.
regulatory_region_variant noncoding Not yet supported.
feature_truncation noncoding Not yet supported.
intergenic_variant noncoding Not yet supported.

ANN format

As generated by SnpEff: http://snpeff.sourceforge.net/VCFannotationformat_v1.0.pdf

This format uses the "ANN" INFO field. The meta line:

##INFO=<ID=ANN,Number=.,Type=String,Description="Functional annotations: 'Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcript_BioType | Rank | HGVS.c | HGVS.p | cDNA.pos / cDNA.length | CDS.pos / CDS.length | AA.pos / AA.length | Distance | ERRORS / WARNINGS / INFO' ">

Example annotation data:


The "Annotation" notations are converted to the built-in mutation class using the same set of VEP rules.

In-house format

This format uses multiple INFO fields to provide per-alt allele values:

##INFO=<ID=AV_POS,Number=A,Type=Integer,Description="Annovar+: adjusted allele position in order of ALT column">
##INFO=<ID=AV_REF,Number=A,Type=String,Description="Annovar+: adjusted reference allele in order of ALT column">
##INFO=<ID=AV_ALT,Number=A,Type=String,Description="Annovar+: adjusted variant allele in order of ALT column">
##INFO=<ID=AV_ACC,Number=A,Type=String,Description="Annovar+: isoform accession in order of ALT column">
##INFO=<ID=AV_AA,Number=A,Type=String,Description="Annovar+: amino acid change in order of ALT column">
##INFO=<ID=AV_CLASS,Number=A,Type=String,Description="Annovar+: variant class in order of ALT column">

Example annotation:


Combining multiple VCF files in one track

This is only achievable by declaring the tracks in JSON and calling ProteinPaint's embedding API.

Following example shows the JSON part of a VCF track which combines multiple VCF files.

"name":"track name",
    ... more files ...

In above example, the "tracks" attribute provides a list of member tracks.

Each member track can have 0 or multiple samples.

Highlighting and filtering variants with INFO fields

Variants from a VCF track can be highlighted by other types of annotation, in addition to the default mutation class.

To recreate this display, save following text into an HTML file, and display it in a web browser.

<!DOCTYPE html>
<head><meta charset="utf-8"></head>
<script src="https://proteinpaint.stjude.org/bin/proteinpaint.js" charset="utf-8"></script>
<div id=aaa style="margin:10px"></div>
    tracks:[   {
                name:'Clinical significance',
                    'Uncertain_significance':{color:'#aaa',label:'Uncertain significance',textcolor:'white'},
                    'not_provided': {color:'#ccc',label:'Not provided'},
                    '_not_provided':{color:'#ccc',label:'Not provided'},
                    'Benign/Likely_benign':{color:'#43ac6a',label:'Benign/Likely benign',textcolor:'white'},
                    'Likely_benign':{color:'#5bc0de',label:'Likely benign',textcolor:'white'},
                    'Likely_pathogenic':{color:'#e99002',label:'Likely pathogenic',textcolor:'white'},
                    'Pathogenic/Likely_pathogenic':{color:'#f04124',label:'Pathogenic/Likely pathogenic',textcolor:'white'},
                    'drug_response':{color:'gold',label:'Drug response',textcolor:'white'},
                    '_drug_response':{color:'gold',label:'Drug response',textcolor:'white'},
                    'Conflicting_interpretations_of_pathogenicity':{color:'#90C3D4',label:'Conflicting interpretations of pathogenicity'},
                    'not_provided':{color:'#ccc',label:'Not provided'},
                    '_not_provided':{color:'#ccc',label:'Not provided'},
                    'risk_factor':{color:'#ccc',label:'Risk factor'},
                    '_risk_factor':{color:'#ccc',label:'Risk factor'},
                name:'ExAC frequency',
                locusinfo:{ key:'AF_EXAC' },
                numericfilter:[ {side:'<',value:0.0001}, {side:'<',value:0.001},  {side:'<',value:.01} ],
                name:'GO-ESP frequency',
                locusinfo:{ key:'AF_ESP' },
                numericfilter:[ {side:'<',value:0.0001}, {side:'<',value:0.001},  {side:'<',value:.01} ],
                name:'1000 Genomes frequency',
                locusinfo:{ key:'AF_TGP' },
                numericfilter:[ {side:'<',value:0.0001}, {side:'<',value:0.001},  {side:'<',value:.01} ],

The example declares the VCF track as a JavaScript object (with the attribute "vcfinfofilter" explained below), and submit it to the public ProteinPaint server through the embedding API.

The attribute "vcfinfofilter" provides instructions about what INFO fields can be used to highlight variants. It must contain the "lst" attribute which is an array, and each element is an object describing an INFO field.

In each object describing an INFO field, use "locusinfo" to indicate that this field provides a value for each locus, and "altalleleinfo" to indicate that this field provides one value for each alternative allele, for a given locus.

In a VCF file, an INFO field may describes categorical information about variants. In this case, the attribute "categories" is used to list a set of keys from such categorical information. And for each key, it provides the color and text label. Keys found in the VCF file but not listed here will be shown in black color. Variants not annotated by this INFO field will be aggregated under the category "Not annotated" and shown in black.

Numerical INFO field can be used to provide cutoff values and filter variants, through "numericfilter". The attribute "categories" should not be used for numerical INFO.

Categorical INFO field, locus-level

    name:'Variant origin',

Its META line:

##INFO=<ID=SAO,Number=1,Type=Integer,Description="Variant Allele Origin: 0 - unspecified, 1 - Germline, 2 - Somatic, 3 - Both">

Annotation to a variant:

1       1014143 .     C       T       .       .       ...;SAO=1;...

Categorical INFO field, alt-allele-level

    name:'Review status',
        'no_assertion':{color:'#545454',label:'no assertion'},
        'no_criteria':{color:'#858585',label:'no criteria'},
        'mult':{color:'blue',label:'multiple submitters no conflicts'},
        'guide':{color:'green',label:'Practice guideline'}

Its META line:

##INFO=<ID=CLNREVSTAT,Number=.,Type=String,Description="no_assertion - No assertion provided, no_criteria - No assertion criteria provided, single - Criteria provided single submitter, mult - Criteria provided multiple submitters no conflicts, conf - Criteria provided conflicting interpretations, exp - Reviewed by expert panel, guideline - Practice guideline">

Annotation to a variant:

1       1051336 rs141463750     C       G,T     .       .       ...;CLNREVSTAT=single;...

(this may not be a good example of alt-allele-level INFO given the particular way that ClinVar selects an allele to describe, see the CLNALLE field)

Categorical field without predefined categories and automatic color assignment

This can apply to both locus-level and alt-allele level INFO fields.


When "autocategory" is set to true, it no longer requires the "categories{}" to predefine the list of types and colors. Such will be dynamically assigned each time the mutation data is loaded. A fixed color list will be used for assigning color to the list of showing categories by the order of abundance.

May allow adding a color list.

Numerical INFO field

For numerical INFO field, its value can be used for filtering variants in the style of population frequency. Same as categorical INFO, it can be either alt-allele-level or locus-level.

    name:'Population frequency filter',
    numericfilter:[.000001, 0.00001, 0.0001, 0.001],

The META line:

##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">

Annotation to a variant:

1       13528   .       C       G,T     1771.54 .      ...;AF=5.438e-04,3.738e-04;...

By default the filter will retain variants with values lower than the cutoff. To change the side of comparison, use the optional "side" attribute:

    numericfilter:[ {side:">",value:1}, {side:">", value:10} ... ]

To set the default cutoff values from a numeric filter, use the "numericCutoff" and "numericCutoffSide" attributes as shown in above example.

Missing value: if the numeric INFO field is missing for a variant, [its value will default to 0]{.underline}. This may work for the case of population frequency (AF), but may not make sense for other types of data. Possible solutions:

  • Specify a value that a missing variant will default to.

  • Underneath the Y scale, show a new row labeled "missing" to bin all the missing variants.

Additional features

These features will draw certain charts based on numeric attributes of the underlying samples.

Population frequency filter

The population frequency filter can be applied to any multi-sample VCF track, and requires appropriate configuration.

As above, this requires declaring the track as JSON object and working with the embedding API. following is one example:

    name:'SCD 143 samples',
         name:"Population frequency filter",
         lst:[0.01, 0.05, 0.1]

The attribute "populationfrequencyfilter" defines a set of fraction values between 0 and 1. Each value will search as a cutoff, e.g. value 0.01 allows to select variants with less than 1% frequency in the population (dividend is the total # of samples in the VCF track).

In ProteinPaint, it's displayed as below:

The cutoff values (1%, 5%, 10%) are listed in the legend section at the bottom. Click '<5%' to enable filtering, and resulted in 28 out of 35 variants visible in the track display. Click on the same cutoff value again to disable filtering.

Coverage-VAF plot

The coverage-VAF plot is available for any multi-sample tracks with allele read count information. The plot can be generated upon clicking on a mutation that is present in 2 or more samples.

This graph Requires the AD FORMAT field

##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">

Declare the track as the following JSON object with the "vaf2coverageplot" attribute.

    name:'SCD 143 samples',

Optionally, specify following attributes in vaf2coverageplot to distinguish samples by color in the plot.


Sample line plots

Track Shows numerical Y axis, for each variant, all samples are shown as strokes

One numerical value for each sample, as provided by a metadata attribute.

                     "axislabel":"HbF level"

Genotype by numeric value

Track shows numerical Y axis. For each variant, one circle is shown for each genotype. Each circle represent a group of samples carrying that genotype. Circle vertical position is determined by a consensus value (mean) across the samples. This value should be provided as INFO fields. Circle sizes are scaled by number of samples in that group.

The following configuration will require a multi-sample VCF file for getting the total number of samples for each genotype.

                "axislabel":"Genotype HbF",

Following configuration will als provide sample count in each genotype group as a INFO field. This requires two INFO fields for each genotype, one to provide Y-axis value, the other to provide number of samples for sizing the circle.

          "axislabel":"Genotype HbF",
                  "genotyepCountInfokey": STR,
                  "genotyepCountInfokey": STR,
                  "genotyepCountInfokey": STR,

Boxplots by genotype

Upon clicking a variant, this will show a set of boxplots, one for each group of samples carrying one genotype from this variant. The boxplot reflects distribution of samples with respect to a specified numeric attribute.

      "axislabel": STR,
      "sampleannotationkey": STR,
      "boxplotvaluekey": STR  // alternative to sampleannotationkey

If "sampleannoationkey" is set, will require sampleannotation to be provided. See the section on "sample metadata annotation".

If "boxplotvaluekey" is provided, its value should be the name of an INFO field that provides the box-and-whisker values for creating boxplots for each genotype.

The INFO field should be defined as:

##INFO=<ID=boxplotbygenotype,Number=1,Type=String,Description="For each possible genotype, define a boxplot with 5 numbers: low whisker, 1st quartile, median, 3rd quartile, high whisker. Genotypes are represented as refref, refalt, altalt. Separate genotypes by |, separate values by /">

Example value to this INFO field:


Sample metadata annotation

Sample annotation is an optional feature. If provided, it can be used to improve coverage-VAF plot, and divide samples to groups by the "sampletype" attribute to make separate coverage-VAF plots for each.

To provide sample annotation, add a nested attribute "sampleannotation.annotation" to the VCF track object:

    ... VCF definitions ... 

         ... more samples ... 

"annotation" provides a hash, where are keys and values are annotations for this sample.

"Key4annotation" defines the attribute name for . It enables user to specify if the annotation is at sample-level or patient-level.


further clarify

Additional enhancements/description can be added to "sampleannotation":


"Levels" provides a hierarchy of attributes for stratifying samples. Levels are arranged from high to low hierarchy. In each level, the string value of "k" and "full" should be found as attributes in "sampleannotation.annotation" of any sample.

"Variantsunburst" allows showing a sunburst chart of samples when clicking on a variant.

"Key2color" is a temp fix to define color for an annotation value from any level. In the example, "SCD" is the actual value of first level, and it must have prefix "root...". This defines the display color in sunburst chart.

Live example: https://proteinpaint.stjude.org/pcgp.html in which the track "PCGP somatic mutation" is a multi-sample VCF track and is equipped with metadata annotation to its samples. Its track legend shows the annotation which can be used to filter samples:

All the metadata annotation and track JSON declaration can be found in the source of the HTML page, around following lines:

Speeding up multi-sample VCF track loading

Given a multi-sample VCF track (cohort_VCF), prepare a variant-only version of this track (variant_VCF), and combine them in the track definition like below:

    name:'lots of samples',

When displaying this track in ProteinPaint, the variant_VCF will be loaded for showing all variants in the view range. Upon clicking a variant, the cohort_VCF will be queried to retrieve sample information on this variant, for rendering coverage-VAF plot or other uses.

This design cuts down data transfer and can greatly improve performance for loading large-cohort VCF track.

The "samplenamemap()" function can be applied in such track setup. Add this function in the "vcfcohortrack" object.

Linking an image to each variant

    name:'lots of samples',

By setting "variant2img", it allows to link an image to each variant. A "Image" button will appear at the details panel of the variant; clicking the button will display the image associated with this variant below.

For each variant, the image will be located in the following way; note that the position is 1-based, and path is relative to the TP directory.

variant2img.path + chromosome.position.REFallele.ALTallele + . + variant2img.ext

Table display of multi-column INFO Field

For a complex INFO field such as:

##INFO=<ID=INFOKEY,Number=.,Type=String,Description="Consequence annotations. Format: Allele|Consequence|SYMBOL|Entrez Gene ID|Feature_type|Feature|HGVSc|HGVSp|CIViC Variant Name|CIViC Variant ID|CIViC Variant Aliases|CIViC HGVS|Allele Registry ID|ClinVar IDs|CIViC Variant Evidence Score|CIViC Entity Type|CIViC Entity ID|CIViC Entity URL|CIViC Entity Source|CIViC Entity Variant Origin|CIViC Entity Status">

Through configuration, contents of such INFO field for a variant can be displayed as a table:

CIViC Entity Type CIViC Entity ID CIViC Entity URL CIViC Entity Source CIViC Entity Variant Origin CIViC Entity Status
evidence 3766 [https://civicdb.org/links/evidence/3766]{.underline}](https://civicdb.org/links/evidence/3766) PUBMED [22621641 Somatic submitted
evidence 7485 [https://civicdb.org/links/evidence/7485]{.underline}](https://civicdb.org/links/evidence/7485) PUBMED [31151904 Somatic submitted
evidence 7077 [https://civicdb.org/links/evidence/7077]{.underline}](https://civicdb.org/links/evidence/7077) PUBMED [25527633 Somatic submitted
evidence 3768 [https://civicdb.org/links/evidence/3768]{.underline}](https://civicdb.org/links/evidence/3768) PUBMED [23278307 Somatic submitted
evidence 3781 [https://civicdb.org/links/evidence/3781]{.underline}](https://civicdb.org/links/evidence/3781) PUBMED [23733758 Somatic submitted


Key: name of the CSQ-like INFO field.

Value: {} with keys of "col_separator" and "fields" array.

"Fields" array should contain all the sub-fields in the INFO field, even for those hidden ones. Each element describes configurations with display/formatting instructions for each sub-field.

info2table: {
    INFOKEY: {
       col_separator: '|',
       separate_tables:[ ... ], // optional
       fields: [
           { name: 'Allele' },
           { name: 'Consequence' },
           { name: 'SYMBOL' },
           { name: 'Entrez Gene ID' },
           { name: 'Feature_type' },
           { name: 'Feature' },
           { name: 'HGVSc' },
           { name: 'HGVSp' },
           ... more fields ... 


Same setting as the "info2table".

Will only parse the first item of the CSQ field. This is useful in such cases when the given sub-fields are duplicated across all items.

A "vertical" two-column table will be displayed, with a row for each sub-field.

> Allele > T
> Consequence > missense_variant
> Entrez Gene ID > 673
> Feature_type > transcript
> Feature > ENST00000288602.6
> HGVSc > ENST00000288602.6:c.1799T>A

If set, the field will be hidden from the table.


Required. A display name of the field.


Will convert all '&' to HTML line break


The column value is a full URL and will be shown as a tag.

.appendUrl: STR

To append the value to the end of the given partial URL and display the composed URL.


Example config:

urlMatchLst: {
   separator: '_',
   idIndex: 0,
   types: [
      { type: 'pubmed', appendUrl: 'https://www.ncbi.nlm.nih.gov/pubmed/' },
      { type: 'asco', appendUrl: 'https://meetinglibrary.asco.org/record/' }

Which allows to parse following values into respective URL links.


Optional, and only applies to "info2table" but not "info2singletable".

When multiple "rows" could be contained in one INFO field, the rows are to be divided into groups based on criteria of the "groupers" array, and shown in separate tables. Each table has the same set of columns, as defined in the "fields[]" array.

separate_tables: [
       headhtml: 'Accepted evidence',
       groupers: [
           { field: 'CIViC Entity Type', value: 'evidence' },
           { field: 'CIViC Entity Status', value: 'accepted' }
   { ... another table ... }

A URL can be made for SNV/indel variants from a VCF track, so that when user clicks on the variant and shows the details, the URL link will be shown, to link to the variant in another website.

To enable this, add the "url4variant" attribute in the track definition:

            return 'https://pecan.stjude.org/#/variant/view/'

Each element in "url4variant" describes a method to produce a URL for the variant. This allows multiple URL links to be produced on the same variant.

Each element has the "makeurl" attribute, a JavaScript function that takes in the variant object as input, and returns the fully-formed URL string, parameterized by the attributes of the variant.

The method may not work for mutation types other than SNV/indel.

Summary of VCF track object

A schematic overview of all available options that you can use to customize a VCF track.