GA4GH Variant Annotation Specification

This is a test of readthedocs updating… The Variant Annotation Specification (VA-Spec) is a standard developed by the Global Alliance for Genomic Health to facilitate and improve sharing of knowledge about any form of genetic or molecular variation.

Introduction

Maximizing the personal, public, research, and clinical value of genomic information will require that clinicians, researchers, and testing laboratories to widely and reliably exchange genetic variation data and knowledge. The Variant Annotation Specifiction (VA-Spec) — written by a partnership of national information resource providers, major public initiatives, and diagnostic testing laboratories — is an modeling framework of processes and formal specifications to standardize the exchange of variation knowledge. It consistens of four primary components, and several supporting resources guiding their implementation.

Specifiction Components

  1. Domain Analysis Model: a conceptual framework for understanding the processes and participants involved in evidence-based knowledge generation.

  2. Core Information Model: a domain-agnostic information model for structuring knowledge statements and their supporting evidence and provenance.

  3. Profile Catalog: A repository of sharable and extensible Profiles, including Implementation Profiles tailored for specific data systems, and GA4GH Standard Profiles for broader community use.

  4. Reference Implementation: a library of software and services that demonstrate the creation, validation, and exchange of standards compliant data.

Implementation Support

  1. Profiling Methodology: guidance and tools for executing the profiling process to produce implementaton profiles and application models.

  2. Implementation Sandbox: a community testbed for defining, adapting, and sharing implementation profiles.

  3. Standards Development Process: a formally defined process through which GA4GH Standard Profiles evolve from implementation models through collaboration and community consensus.

The machine readable schema definitions and example code are available online at the va-spec repository (https://github.com/ga4gh/va-spec) (coming soon).

Readers may wish to view some examples of framework implementation and data models before reading the specification (coming soon).

Specification Components

The VA Specification is comprised of four key components:

1. Domain Analysis Model

  • A conceptual framework for understanding the processes and participants involved in evidence-based knowledge generation

  • Provides a foundational terminology and understanding of the domain, and shared perspective from which to approach data creation and modeling tasks

2. Core Information Model

  • A domain-agnostic information model for structuring knowledge statements and their supporting evidence and provenance.

  • Provides the foundation on which to build ‘Profiles’ that specialize the Core IM for specific statement types or data applications.

3. Profile Catalog

  • A repository of sharable and extensible Profiles, including Implementation Profiles tailored for specific data systems, and GA4GH Standard Profiles for broader community use.

  • Facilitates coordinated development of models by enabling implementers to re-use and build on existing work.

4. Reference Implementation

  • A library of software and services that demonstrate the creation, validation, and exchange of standards compliant data.

  • Provides implementers with working code to apply or adapt for their applications

Implementation Support

Tools, processes, and formal guidance to facilitate adoption of the VA Modeling Framework, sharing and reuse of data models, and emergence of community standards through an implementation-driven development approach.

1. Profiling Methodology

  • Detailed guidance and tools for executing the profiling process to produce implementaton profiles and application models.

2. Implementation Sandbox

  • A community testbed for defining, adapting, and sharing implementation profiles.

3. Standards Development Process

  • A formally defined process through which GA4GH standards emerge from implementation models through community collaboration and consensus.

This minimal list will be expanded in the near future with additioanl guidance and tooling support for generating implementation models and community standards using the VA-Specification Framework.

Data Examples

Terminology & Information Model

When biologists define terms in order to describe phenomena and observations, they rely on a background of human experience and intelligence for interpretation. Definitions may be abstract, perhaps correctly reflecting uncertainty of our understanding at the time. Unfortunately, such terms are not readily translatable into an unambiguous representation of knowledge.

For example, “allele” might refer to “an alternative form of a gene or locus” [Wikipedia], “one of two or more forms of the DNA sequence of a particular gene” [ISOGG], or “one of a set of coexisting sequence alleles of a gene” [Sequence Ontology]. Even for human interpretation, these definitions are inconsistent: does the definition precisely describe a specific change on a specific sequence, or, rather, a more general change on an undefined sequence? In addition, all three definitions are inconsistent with the practical need for a way to describe sequence changes outside regions associated with genes.

The computational representation of biological concepts requires translating precise biological definitions into information models and data structures that may be used in software. This translation should result in a representation of information that is consistent with conventional biological understanding and, ideally, be able to accommodate future data as well. The resulting computational representation of information should also be cognizant of computational performance, the minimization of opportunities for misunderstanding, and ease of manipulating and transforming data.

Accordingly, for each term we define below, we begin by describing the term as used by the genetics and/or bioinformatics communities as available. When a term has multiple such definitions, we explicitly choose one of them for the purposes of computational modelling. We then define the computational definition that reformulates the community definition in terms of information content. Finally, we translate each of these computational definitions into precise specifications for the (information model). Terms are ordered “bottom-up” so that definitions depend only on previously-defined terms.

Note

The keywords “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.

Information Model Principles

  • VRS objects are minimal value objects. Two objects are considered equal if and only if their respective attributes are equal. As value objects, VRS objects are used as primitive types and MUST NOT be used as containers for related data, such as primary database accessions, representations in particular formats, or links to external data. Instead, related data should be associated with VRS objects through identifiers. See Computed Identifiers.

  • VRS uses polymorphism. VRS uses polymorphism extensively in order to provide a coherent top-down structure for variation while enabling precise models for variation data. For example, Allele is a kind of Variation, SequenceLocation is a kind of Location, and SequenceState is a kind of State. See Future Plans for the roadmap of VRS data classes and relationships. All VRS objects contain a type attribute, which is used to discriminate polymorphic objects.

  • Error handling is intentionally unspecified and delegated to implementation. VRS provides foundational data types that enable significant flexibility. Except where required by this specification, implementations may choose whether and how to validate data. For example, implementations MAY choose to validate that particular combinations of objects are compatible, but such validation is not required.

  • VRS uses snake_case to represent compound words. Although the schema is currently JSON-based (which would typically use camelCase), VRS itself is intended to be neutral with respect to languages and database.

  • Optional attributes start with an underscore. Optional attributes are not part of the value object. Such attributes are not considered when evaluating equality or creating computed identifiers. The _id attribute is available to identifiable objects, and MAY be used by an implementation to store the identifier for a VRS object. If used, the stored _id element MUST be a CURIE. If used for creating a Truncated Digest (sha512t24u) for parent objects, the stored element must be a GA4GH Computed Identifier. Implementations MUST ignore attributes beginning with an underscore and they SHOULD NOT transmit objects containing them.

Variation

In the genetics community, variation is often used to mean sequence variation, describing the differences observed in DNA or AA bases among individuals, and typically with respect to a common reference sequence.

In VRS, the Variation class is the conceptual root of all types of biomolecular variation, and the Variation abstract class is the top-level object in the Current Variation Representation Specification Schema. Variation types are broadly categorized as Molecular Variation, Systemic Variation, or a utility subclass. Types of variation are widely varied, and there are several Variation Classes currently under consideration to capture this diversity.

Information Model

Field

Type

Limits

Description

_id

CURIE

0..1

Variation Id. MUST be unique within document.

type

string

1..1

The Variation class type. MUST match child class type.

Molecular Variation

Allele

Note

The terms allele and variant are often used interchangeably, although this use may mask subtle distinctions made by some users. Specifically, while allele connotes a specific sequence state, variant connotes a change between states.

This distinction makes it awkward to use variant to represent an unchanged (refrence-agreement) state at a Sequence Location. This was a primary factor for choosing to use allele over variant when designing VRS. Read more about this design decision: Using Allele Rather than Variant.

An allele may refer to a number of alternative forms of the same gene or same genetic locus. In the genetics community, allele may also refer to a specific haplotype. In the context of biological sequences, “allele” refers to a distinct state of a molecule at a location.

Implementation Guidance

  • The Sequence Expression and Location subclasses respectively represent diverse kinds of sequence changes and mechanisms for describing the locations of those changes, including varying levels of precision of sequence location and categories of sequence changes.

  • Implementations MUST enforce values interval.end ≤ sequence_length when the Sequence length is known.

  • Alleles are equal only if the component fields are equal: at the same location and with the same state.

  • Alleles MAY have multiple related representations on the same Sequence type due to normalization differences.

  • Implementations SHOULD normalize Alleles using fully-justified normalization whenever possible to facilitate comparisons of variation in regions of representational ambiguity.

  • Implementations SHOULD preferentially represent Alleles using LiteralSequenceExpression, however there are cases where use of other Sequence Expression classes is most appropriate; see Using Sequence Expressions for guidance.

  • When the alternate Sequence is the same length as the interval, the lengths of the reference Sequence and imputed Sequence are the same. (Here, imputed sequence means the sequence derived by applying the Allele to the reference sequence.) When the replacement Sequence is shorter than the length of the interval, the imputed Sequence is shorter than the reference Sequence, and conversely for replacements that are larger than the interval.

  • When the state is a LiteralSequenceExpression of "" (the empty string), the Allele refers to a deletion at this location.

  • The Allele entity is based on Sequence and is intended to be used for intragenic and extragenic variation. Alleles are not explicitly associated with genes or other features.

  • Biologically, referring to Alleles is typically meaningful only in the context of empirical alternatives. For modelling purposes, Alleles MAY exist as a result of biological observation or computational simulation, i.e., virtual Alleles.

  • “Single, contiguous” refers the representation of the Allele, not the biological mechanism by which it was created. For instance, two non-adjacent single residue Alleles could be represented by a single contiguous multi-residue Allele.

  • When a trait has a known genetic basis, it is typically represented computationally as an association with an Allele.

  • This specification’s definition of Allele applies to any Location, including locations on RNA or protein Sequence.

Examples

An Allele correponding to rs7412 C>T on GRCh38:

{
  "location": {
    "interval": {
      "end": {
        "type": "Number",
        "value": 44908822
      },
      "start": {
        "type": "Number",
        "value": 44908821
      },
      "type": "SequenceInterval"
    },
    "sequence_id": "ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl",
    "type": "SequenceLocation"
  },
  "state": {
    "sequence": "T",
    "type": "SequenceState"
  },
  "type": "Allele"
}

Sources

Haplotype

Haplotypes are a specific combination of Alleles that are in-cis: occurring on the same physical molecule. Haplotypes are commonly described with respect to locations on a gene, a set of nearby genes, or other physically proximal genetic markers that tend to be transmitted together.

Implementation Guidance

  • Haplotypes are an assertion of Alleles known to occur “in cis” or “in phase” with each other.

  • All Alleles in a Haplotype MUST be defined on the same reference sequence or chromosome.

  • Alleles within a Haplotype MUST not overlap (“overlap” is defined in Interval).

  • The locations of Alleles within the Haplotype MUST be interpreted independently. Alleles that create a net insertion or deletion of sequence MUST NOT change the location of “downstream” Alleles.

  • The members attribute is required and MUST contain at least two Alleles.

Sources

  • ISOGG: Haplotype — A haplotype is a combination of alleles (DNA sequences) at different places (loci) on the chromosome that are transmitted together. A haplotype may be one locus, several loci, or an entire chromosome depending on the number of recombination events that have occurred between a given set of loci.

  • SequenceOntology: haplotype (SO:0001024) — A haplotype is one of a set of coexisting sequence variants of a haplotype block.

  • GENO: Haplotype (GENO:0000871) - A set of two or more sequence alterations on the same chromosomal strand that tend to be transmitted together.

Examples

An APOE ε2 Haplotype with inline Alleles:

{
  "members": [
    {
      "location": {
        "interval": {
          "end": {
            "type": "Number",
            "value": 44908822
          },
          "start": {
            "type": "Number",
            "value": 44908821
          },
          "type": "SequenceInterval"
        },
        "sequence_id": "ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl",
        "type": "SequenceLocation"
      },
      "state": {
        "sequence": "C",
        "type": "LiteralSequenceExpression"
      },
      "type": "Allele"
    },
    {
      "location": {
        "interval": {
          "end": {
            "type": "Number",
            "value": 44908684
          },
          "start": {
            "type": "Number",
            "value": 44908683
          },
          "type": "SequenceInterval"
        },
        "sequence_id": "ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl",
        "type": "SequenceLocation"
      },
      "state": {
        "sequence": "C",
        "type": "LiteralSequenceExpression"
      },
      "type": "Allele"
    }
  ],
  "type": "Haplotype"
}

The same APOE ε2 Haplotype with referenced Alleles:

{
  "members": [
    "ga4gh:VA.-kUJh47Pu24Y3Wdsk1rXEDKsXWNY-68x",
    "ga4gh:VA.Z_rYRxpUvwqCLsCBO3YLl70o2uf9_Op1"
  ],
  "type": "Haplotype"
}

The GA4GH computed identifier for these Haplotypes is ga4gh:VH.i8owCOBHIlRCPtcw_WzRFNTunwJRy99-, regardless of whether the Variation objects are inlined or referenced, and regardless of order. See Computed Identifiers for more information.

Systemic Variation

AbsoluteCopyNumber

Absolute Copy Number Variation captures the copies of a molecule within a genome, and can be used to express concepts such as amplification and copy loss. Copy Number Variation has conflated meanings in the genomics community, and can mean either (or both) the notion of copy number in a genome or copy number on a molecule. VRS separates the concerns of these two types of statements; this concept is a type of Systemic Variation and so describes the number of copies in a genome. The related Molecular Variation concept can be expressed as an Allele with a RepeatedSequenceExpression.

Examples

Two, three, or four total copies of BRCA1:

{
  "copies": {
    "comparator": ">=",
    "type": "IndefiniteRange",
    "value": 3
  },
  "subject": {
    "gene_id": "ncbigene:348",
    "type": "Gene"
  },
  "type": "AbsoluteCopyNumber"
}
RelativeCopyNumber

Relative Copy Number Variation captures a classification of copies of a molecule within a system, relative to a baseline. These types of Variation are common outputs from CNV callers, particularly in the somatic domain where Absolute Copy Counts are difficult to estimate and less useful in practice than relative statements.

Examples

Low-level copy gain of BRCA1:

{
  "relative_copy_class": "low-level gain",
  "subject": {
    "gene_id": "ncbigene:348",
    "type": "Gene"
  },
  "type": "RelativeCopyNumber"
}
Genotype

A genotype is a representation of the variants present at a given genomic locus, and may be referred to either by individual nucleotide representations (e.g. GT representation in VCF files) or symbolically (e.g. A/B/O blood type reporting). To support these use cases, VRS genotypes enable representation of genotypes using either Allele objects (as commonly done in VCF records) or larger Haplotype objects (which would otherwise be represented using symbolic shorthand).

Implementation guidance

  • Haplotypes or Alleles in GenotypeMember objects MAY occur at different locations or on different reference sequences. For example, an individual may have haplotypes on two population-specific references.

Notes

  • The term “genotype” has two, related definitions in common use. The narrower definition is a set of alleles observed at a single location and often with a ploidy of two, such as a pair of single residue variants on an autosome. The broader, generalized definition is a set of alleles at multiple locations and/or with ploidy other than two. VRS Genotype entity is based on this broader definition.

  • The term “diplotype” is often used to refer to two in-trans haplotypes at a locus. VRS Genotype entity subsumes the conventional definition of diplotype, though it describes no explicit in-trans phase relationship. Therefore, VRS does not include an explicit entity for diplotypes. See this note for a discussion.

  • VRS makes no assumptions about ploidy of an organism or individual nor any polysomy affecting a locus. The genotype.count attribute explicitly captures the total count of molecules associated with a genomic locus represented by the Genotype.

  • In diploid organisms, there are typically two instances of each autosomal chromosome, and therefore two instances of sequence at a particular locus. Thus, Genotypes will often list two GenotypeMembers each based on a distinct Haplotype or Allele. In the case of haploid chromosomes or haploinsufficiency, the Genotype consists of a single GenotypeMember.

  • A specific (heterozygous) diplotype SHOULD be represented as a Genotype of two GenotypeMember instances each containing a constituent Haplotype. A homozygous diplotype SHOULD be represented as a Genotype of one constituent GenotypeMember (with GenotypeMember.count=2).

  • A consequence of the computational definition is that in-cis Haplotypes at overlapping or adjacent intervals MUST be merged into a single Haplotype for the same Genotype.

  • A GenotypeMember.variation value MUST be unique among Genotype Members within a Genotype. When more than one Genotype Member would have the same variation value (e.g. in the case of a homozygous variant), this would be represented as a Genotype Value with a corresponding count (i.e. for a diploid homozygous variant, GenotypeMember.count = 2).

  • The rationale for permitting Genotypes with Haplotypes defined on different reference sequences is to enable the accurate representation of segments of DNA with the most appropriate population-specific reference sequence.

  • Deletion of sequence at locus would be represented by the presence of Alleles of deleted sequence, not absence of Alleles; therefore Genotypes MAY NOT have count < 1.

Sources

SO: Genotype (SO:0001027) — A genotype is a variant genome, complete or incomplete.

Note

VRS defines Genotypes using a list of GenotypeMembers defined by Haplotypes or Alleles. In essence, Haplotypes and Genotypes represent two distinct dimensions of containment: Haplotypes represent the “in phase” relationship of Alleles while Genotypes represents sets of Haplotypes of arbitrary ploidy.

There are two important consequences of these definitions: There is no single-location Genotype. Users of SNP data will be familiar with representations like rs7412 C/C, which indicates the diploid state at a position. In VRS, this is merely a special case of a Genotype with one GenotypeMember, defined by a single Allele with two copies. VRS does not define a diplotype class. A diplotype is a special case of a VRS Genotype with count = 2. In practice, software data types that assume a ploidy of 2 make it very difficult to represent haploid states, copy number loss, and copy number gain, all of which occur when representing human data. In addition, inferred ploidy = 2 makes software incompatible with organisms with other ploidy. VRS requires explicit definition of the count of molecules associated with a genomic locus using the count attribute, though this count may be inexact (e.g. a DefiniteRange or IndefiniteRange).

Utility Variation

Text

A free-text description of variation that is intended for interpretation by humans.

Important

Text variation should be used sparingly. The Text type is provided as an option of last resort for systems that need to represent human-readable descriptions of complex genetic phenomena or variation for which VRS does not yet have a data type. Structured data types should be preferred over Text.

Implementation Guidance

  • An implementation MUST represent Variation with subclasses other than Text if possible.

  • Because the Text type can be easily abused, implementations are NOT REQUIRED to provide it. If it is provided, implementations SHOULD consider applying access controls.

  • Implementations SHOULD upgrade Text variation to structured data types when available. A future version of VRS will provide additional guidance regarding upgrade mechanisms.

  • Additional Variation subclasses are continually under consideration. Please open a GitHub issue if you would like to propose a Variation subclass to cover a needed variation representation.

Examples

{
  "definition": "MSI High",
  "type": "Text"
}
VariationSet

Sets of variation are used widely, such as sets of variants in dbSNP or ClinVar that might be related by function.

Implementation Guidance

  • The VariationSet identifier MAY be computed as described in Computed Identifiers, in which case the identifier effectively refers to a static set because a different set of members would generate a different identifier.

  • members may be specified as Variation objects or CURIE identifiers.

  • CURIEs MAY refer to entities outside the ga4gh namespace. However, objects that use non-ga4gh identifiers MAY NOT use the Computed Identifiers mechanism.

  • VariationSet identifiers computed using the GA4GH Computed Identifiers process do not depend on whether the Variation objects are inlined or referenced, and do not depend on the order of members.

  • Elements of members must be subclasses of Variation, which permits sets to be nested.

  • Recursive sets are not meaningful and are not supported.

  • VariationSets may be empty.

Examples

Example VariationSet with inline Alleles:

{
  "members": [
    {
      "location": {
        "interval": {
          "end": {
            "type": "Number",
            "value": 44908822
          },
          "start": {
            "type": "Number",
            "value": 44908821
          },
          "type": "SequenceInterval"
        },
        "sequence_id": "ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl",
        "type": "SequenceLocation"
      },
      "state": {
        "sequence": "C",
        "type": "LiteralSequenceExpression"
      },
      "type": "Allele"
    },
    {
      "location": {
        "interval": {
          "end": {
            "type": "Number",
            "value": 44908684
          },
          "start": {
            "type": "Number",
            "value": 44908683
          },
          "type": "SequenceInterval"
        },
        "sequence_id": "ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl",
        "type": "SequenceLocation"
      },
      "state": {
        "sequence": "C",
        "type": "LiteralSequenceExpression"
      },
      "type": "Allele"
    }
  ],
  "type": "VariationSet"
}

The same VariationSet with referenced Alleles:

{
  "members": [
    "ga4gh:VA.-kUJh47Pu24Y3Wdsk1rXEDKsXWNY-68x",
    "ga4gh:VA.Z_rYRxpUvwqCLsCBO3YLl70o2uf9_Op1"
  ],
  "type": "VariationSet"
}

The GA4GH computed identifier for these sets is ga4gh:VS.QLQXSNSIFlqNYWmQbw-YkfmexPi4NeDE, regardless of the whether the Variation objects are inlined or referenced, and regardless of order. See Computed Identifiers for more information.

Locations and Intervals

Location

As used by biologists, the precision of “location” (or “locus”) varies widely, ranging from precise start and end numerical coordinates defining a Location, to bounded regions of a sequence, to conceptual references to named genomic features (e.g., chromosomal bands, genes, exons) as proxies for the Locations on an implied reference sequence.

The most common and concrete Location is a SequenceLocation, i.e., a Location based on a named sequence and an Interval on that sequence. Another common Location is a ChromosomeLocation, specifying a location from cytogenetic coordinates of stained metaphase chromosomes. Additional Intervals and Locations may also be conceptual or symbolic locations, such as a cytoband region or a gene. Any of these may be used as the Location for Variation.

Implementation Guidance

  • Location refers to a position. Although it MAY imply a sequence, the two concepts are not interchangeable, especially when the location is non-specific (e.g., specified by an IndefiniteRange). To represent a sequence derived from a Location, see DerivedSequenceExpression.

ChromosomeLocation

Chromosomal locations based on named features, including named landmarks, cytobands, and regions observed from chromosomal staining techniques.

Implementation Guidance

  • ChromosomeLocation is intended to enable the representation of cytogenetic results from karyotyping or low-resolution molecular methods, particularly those found in older scientific literature. Precise SequenceLocation should be preferred when nucleotide-scale location is known.

  • species is specified using the NCBI taxonomy. The CURIE prefix MUST be “taxonomy”, corresponding to the NCBI taxonomy prefix at identifiers.org, and the CURIE reference MUST be an NCBI taxonomy identifier (e.g., 9606 for Homo sapiens).

  • ChromosomeLocation is intended primarily for human chromosomes. Support for other species is possible and will be considered based on community feedback.

  • chromosome is an archetypal chromosome name. Valid values for, and the syntactic structure of, chromosome depends on the species. chromosome MUST be an official sequence name from NCBI Assembly. For humans, valid chromosome names are 1..22, X, Y (case-sensitive). NOTE: A `chr` prefix is NOT part of the chromosome and MUST NOT be included.

  • interval refers to a contiguous region specified named markers, which are presumed to exist on the specified chromosome. See CytobandInterval for additional information.

  • The conversion of ChromosomeLocation instances to SequenceLocation instances is out-of-scope for VRS. When converting start and end to SequenceLocations, the positions MUST be interpreted as inclusive ranges that cover the maximal extent of the region.

  • Data for converting cytogenetic bands to precise sequence coordinates are available at NCBI GDP, UCSC GRCh37 (hg19), UCSC GRCh38 (hg38), and bioutils (Python).

  • See also the rationale for Not using External Chromosome Declarations.

Examples

{
  "chr": "19",
  "interval": {
    "end": "q13.32",
    "start": "q13.32",
    "type": "CytobandInterval"
  },
  "species_id": "taxonomy:9606",
  "type": "ChromosomeLocation"
}
SequenceLocation

A Sequence Location is a specified subsequence of a reference Sequence. The reference is typically a chromosome, transcript, or protein sequence.

Implementation Guidance

  • For a Sequence of length n:
    • 0 ≤ interval.startinterval.endn

    • inter-residue coordinate 0 refers to the point before the start of the Sequence

    • inter-residue coordinate n refers to the point after the end of the Sequence.

  • Coordinates MUST refer to a valid Sequence. VRS does not support referring to intronic positions within a transcript sequence, extrapolations beyond the ends of sequences, or other implied sequence.

Important

HGVS permits variants that refer to non-existent sequence. Examples include coordinates extrapolated beyond the bounds of a transcript and intronic sequence. Such variants are not representable using VRS and MUST be projected to a genomic reference in order to be represented.

Examples

{
  "interval": {
    "end": 44908822,
    "start": 44908821,
    "type": "SimpleInterval"
  },
  "sequence_id": "ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl",
  "type": "SequenceLocation"
}

SequenceInterval

SequenceInterval is intended to be compatible with a “region” in Sequence Ontology (SO:0000001), with the exception that the GA4GH VRS SequenceInterval may be zero-width. The SO definition of region has an “extent greater than zero”.

Sources

Examples

{
  "end": {
    "type": "Number",
    "value": 44908822
  },
  "start": {
    "type": "Number",
    "value": 44908821
  },
  "type": "SequenceInterval"
}

CytobandInterval

Important

VRS currently supports only human cytobands and cytoband intervals. Implementers wishing to use VRS for other cytogenetic systems are encouraged to open a GitHub issue.

Cytobands refer to regions of chromosomes that are identified by visible patterns on stained metaphase chromosomes. They provide a convenient, memorable, and low-resolution shorthand for chromosomal segments.

Implementation Guidance

  • When using CytobandInterval to refer to human cytogentic bands, the following conventions MUST be used. Bands are denoted by the arm (“p” or “q”) and position (e.g., “22”, “22.3”, or the symbolic values “cen” or “ter”) per ISCN conventions [1]. These conventions identify cytobands in order from the centromere towards the telomeres. In VRS, we order cytoband coordinates in the p-ter → cen → q-ter orientation, analogous to sequence coordinates. This has the consequence that bands on the p-arm are represented in descending numerical order when selecting cytobands for start and end.

Examples

{
  "end": "q13.32",
  "start": "q13.32",
  "type": "CytobandInterval"
}

Sequence Expression

VRS provides several syntaxes for expressing a sequence, collectively referred to as Sequence Expressions. They are:

Some SequenceExpression instances may appear to resolve to the same sequence, but are intended to be semantically distinct. There MAY be reasons to select or enforce one form over another that SHOULD be managed by implementations. See discussion on Equivalence Between Concepts.

LiteralSequenceExpression

A LiteralSequenceExpression “wraps” a string representation of a sequence for parallelism with other SequenceExpressions.

Examples

{
  "sequence": "ACGT",
  "type": "LiteralSequenceExpression"
}

DerivedSequenceExpression

Certain mechanisms of variation result from relocating and transforming sequence from another location in the genome. A derived sequence is a mechanism for expressing (typically large) reference subsequences specified by a SequenceLocation.

Examples

{
  "location": {
    "interval": {
      "end": {
        "type": "Number",
        "value": 44908822
      },
      "start": {
        "type": "Number",
        "value": 44908821
      },
      "type": "SequenceInterval"
    },
    "sequence_id": "ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl",
    "type": "SequenceLocation"
  },
  "reverse_complement": false,
  "type": "DerivedSequenceExpression"
}

RepeatedSequenceExpression

Repeated Sequence is a class of sequence expression where a specified subsequence is repeated multiple times in tandem. Microsatellites are an example of a common class of repeated sequence, but repeated sequence can also be used to describe larger subsequence repeats, up to and including large-scale tandem duplications.

Examples

{
  "count": {
    "comparator": ">=",
    "type": "IndefiniteRange",
    "value": 6
  },
  "seq_expr": {
    "location": {
      "interval": {
        "end": {
          "type": "Number",
          "value": 44908822
        },
        "start": {
          "type": "Number",
          "value": 44908821
        },
        "type": "SequenceInterval"
      },
      "sequence_id": "ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl",
      "type": "SequenceLocation"
    },
    "reverse_complement": false,
    "type": "DerivedSequenceExpression"
  },
  "type": "RepeatedSequenceExpression"
}

Feature

Gene

A gene is a basic and fundamental unit of heritability. Genes are functional regions of heritable DNA or RNA that include transcript coding regions, regulatory elements, and other functional sequence domains. Because of the complex nature of these many components comprising a gene, the interpretation of a gene depends on context.

Implementation guidance

  • Gene symbols (e.g., “BRCA1”) are unreliable keys. Implementations MUST NOT use a gene symbol to define a Gene.

  • A gene is specific to a species. Gene orthologs have distinct records in the recommended databases. For example, the BRCA1 gene in humans and the Brca1 gene in mouse are orthologs and have distinct records in the recommended gene databases.

  • Implementations MUST use authoritative gene namespaces available from identifiers.org whenever possible. Examples include:

  • The hgnc namespace is RECOMMENDED for human variation in order to improve interoperability. When using the hgnc namespace, the optional “HGNC:” prefix MUST NOT be used.

  • Gene MAY be converted to one or more Locations using external data. The source of such data and mechanism for implementation is not defined by this specification.

  • See discussion on Equivalence Between Concepts.

Examples

The following examples all refer to the human APOE gene:

{
  'gene_id': 'ncbigene:613',
  'type': 'Gene'
}

Sources

  • SequenceOntology: gene (SO:0000704) — A region (or regions) that includes all of the sequence elements necessary to encode a functional transcript. A gene may include regulatory regions, transcribed regions and/or other functional sequence regions.

Basic Types

Basic types are data structures that represent general concepts and that may be applicable in multiple parts of VRS.

Number

Examples

{
  "type": "Number",
  "value": 55
}

DefiniteRange

Examples

{
  "max": 33,
  "min": 22,
  "type": "DefiniteRange"
}

IndefiniteRange

Examples

This value is equivalent to the concept of “equal to or greater than 22”:

{
  "comparator": ">=",
  "type": "IndefiniteRange",
  "value": 22
}

GenotypeMember

Primitives

Primitives represent simple values with syntactic or other constraints. They enable correctness for values stored in VRS.

CURIE

Implementation Guidance

  • All identifiers in VRS MUST be a valid CURIE, regardless of whether the identifier refers to GA4GH VRS objects or external data.

  • For GA4GH VRS objects, this specification RECOMMENDS using globally unique Computed Identifiers for use within and between systems.

  • For external data, CURIE-formatted identifiers MUST be used. When an appropriate namespace exists at identifiers.org, that namespace MUST be used. When an appropriate namespace does not exist at identifiers.org, support is implementation-dependent. That is, implementations MAY choose whether and how to support informal or local namespaces.

  • Implementations MUST use CURIE identifiers verbatim. Implementations MAY NOT modify CURIEs in any way (e.g., case-folding).

Examples

Identifiers for GRCh38 chromosome 19:

ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl
refseq:NC_000019.10
grch38:19

See Identifier Construction for examples of CURIE-based identifiers for VRS objects.

HumanCytoband

Cytobands are any of a pattern of stained bands, formed on chromosomes of cells undergoing metaphase, that serve to identify particular chromosomes. Human cytobands are predominantly specified by the International System for Human Cytogenomic Nomenclature (ISCN) [1].

Information Model

A string constrained to match the regular expression ^cen|[pq](ter|([1-9][0-9]*(\.[1-9][0-9]*)?))$, derived from the ISCN guidelines [1].

Examples

"q13.32" (string)

Residue

A residue refers to a specific monomer within the polymeric chain of a protein or nucleic acid (Source: Wikipedia Residue page).

Sequence

A sequence is a character string representation of a contiguous, linear polymer of nucleic acid or amino acid Residues. Sequences are the prevalent representation of these polymers, particularly in the domain of variant representation.

Information Model

A string constrained to match the regular expression ^[A-Z*\-]*$, derived from the IUPAC one-letter nucleic acid and amino acid codes.

Implementation Guidance

  • Sequences MAY be empty (zero-length) strings. Empty sequences are used as the replacement Sequence for deletion Alleles.

  • Sequences MUST consist of only uppercase IUPAC abbreviations, including ambiguity codes.

  • A Sequence provides a stable coordinate system by which an Allele MAY be located and interpreted.

  • A Sequence MAY have several roles. A “reference sequence” is any Sequence used to define an Allele. A Sequence that replaces another Sequence is called a “replacement sequence”.

  • In some contexts outside VRS, “reference sequence” may refer to a member of set of sequences that comprise a genome assembly. In VRS specification, any sequence may be a “reference sequence”, including those in a genome assembly.

  • For the purposes of representing sequence variation, it is not necessary that Sequences be explicitly “typed” (i.e., DNA, RNA, or AA).

Examples

"ACGT" (string)

Deprecated and Obsolete Classes

SimpleInterval

Warning

DEPRECATED. Use SequenceInterval instead. SimpleInterval will be removed in VRS 2.0.

Implementation Guidance

  • Implementations MUST enforce values 0 ≤ start ≤ end. In the case of double-stranded DNA, this constraint holds even when a feature is on the complementary strand.

  • VRS uses Inter-residue coordinates because they provide conceptual consistency that is not possible with residue-based systems (see rationale). Implementations will need to convert between inter-residue and 1-based inclusive residue coordinates familiar to most human users.

  • Inter-residue coordinates start at 0 (zero).

  • The length of an interval is end - start.

  • An interval in which start == end is a zero width point between two residues.

  • An interval of length == 1 MAY be colloquially referred to as a position.

  • Two intervals are equal if the their start and end coordinates are equal.

  • Two intervals intersect if the start or end coordinate of one is strictly between the start and end coordinates of the other. That is, if:

    • b.start < a.start < b.end OR

    • b.start < a.end < b.end OR

    • a.start < b.start < a.end OR

    • a.start < b.end < a.end

  • Two intervals a and b coincide if they intersect or if they are equal (the equality condition is REQUIRED to handle the case of two identical zero-width SimpleIntervals).

  • <start, end>=<0,0> refers to the point with width zero before the first residue.

  • <start, end>=<i,i+1> refers to the i+1th (1-based) residue.

  • <start, end>=<N,N> refers to the position after the last residue for Sequence of length N.

  • See example notebooks in GA4GH VRS Python Implementation.

Examples

{
  "end": 44908822,
  "start": 44908821,
  "type": "SimpleInterval"
}

SequenceState

Warning

DEPRECATED. Use LiteralSequenceExpression instead. SequenceState will be removed in VRS 2.0.

Deprecated since version 1.2.

Examples

{
  "sequence": "T",
  "type": "SequenceState"
}

State

Warning

OBSOLETE. State was an abstract class that was intended for future growth. It was replaced by SequenceExpressions, which subsumes the functionality envisioned for State. Because State was abstract, and therefore purely an internal concept, it was made obsolete at the same time that SequenceState was deprecated.

Deprecated since version 1.2.

Computational Definition

State objects are one of two primary components specifying a VRS Allele (in addition to Location), and the designated components for representing change (or non-change) of the features indicated by the Allele Location. As an abstract class, State currently encompasses single and contiguous Sequence changes (see SequenceState).

Schema

Overview

_images/schema-current.png

Current Variation Representation Specification Schema

Legend The VRS information model consists of several interdependent data classes, including both concrete classes and abstract superclasses (indicated by <<abst>> stereotype in header). These classes may be broadly categorized as conceptual representations of Variation (green boxes), Feature (blue boxes), Location (light blue boxes), SequenceExpression (purple boxes), and General Purpose Types (gray boxes). The general purpose types support the primary classes, including intervals, ranges, Number and GA4GH Sequence strings (not shown). While all VRS objects are Value Objects, only some objects are intended to be identifiable (Variation, Location, and Sequence). Conceptual inheritance relationships between classes is indicated by connecting lines. [source]

Machine Readable Specifications

The machine readable VRS is written using JSON Schema.

The schema itself is written in YAML (vrs.yaml) and converted to JSON (vrs.json).

Contributions to the schema MUST be written in the YAML document.

Implementation Guide

This section describes the data and algorithmic components that are REQUIRED for implementations of VRS.

  • Required External Data: All implementations will require access to sequences and sequence accessions. The Required External Data section provides guidance on the abstract functionality that is required in order to implement VRS.

  • Normalization: Expands Alleles to the maximal region of representational ambiguity.

  • Computed Identifiers: Generate globally unique identifiers based solely on the variation definition.

Required External Data

All VRS implementations will require external data regarding sequences and sequence metadata. The choices of data sources and access methods are left to implementations. This section provides guidance about how to implement required data and helps implementers estimate effort. This section is descriptive only: it is not intended to impose requirements on interface to, or sources of, external data. For clarity and completeness, this section also describes the contexts in which external data are used.

Contexts

  • Conversion from other variant formats When converting from other variation formats, implementations MUST translate primary database accessions or identifiers (e.g., NM_000551.3 or refseq:NM_000551.3) to a GA4GH VRS sequence identifier ( ga4gh:SQ.v_QTc1p-MUYdgrRv4LMT6ByXIOsdw3C_)

  • Conversion to other variant formats When converting to other variation formats, implementations SHOULD translate GA4GH VR sequence identifier ( ga4gh:SQ.v_QTc1p-MUYdgrRv4LMT6ByXIOsdw3C_) to primary database identifiers (refseq:NM_000551.3) that will be more readily recognized by users.

  • Normalization During Normalization, implementations will need access to sequence length and sequence contexts.

Data Services

The following tables summarizes data required in the above contexts:

Data Service Desciptions

Data Service

Description

Contexts

sequence

For a given sequence identifier and range, return the corresponding subsequence.

normalization

sequence length

For a given sequence identifier, return the length of the sequence

normalization

identifier translation

For a given sequence identifier and target namespace, return all identifiers in the target namespace that are equivelent to the given identifier.

Conversion to/from other formats

Note

Construction of the GA4GH computed identifier for a sequence is described in Computed Identifiers.

Suggested Implementation

In order to maximize portability and to insulate implementations from decisions about external data sources, implementers should consider writing an abstract data proxy interface to define a service, and then implement this interface for each data backend to be supported. The data proxy interface defines three methods:

  • get_sequence(identifier, start, end): Given a sequence identifier and start and end coordinates, return the corresponding sequence segment.

  • get_metadata(identifier): Given a sequence identifier, return a dictionary of length, alphabet, and known aliases.

  • translate_sequence_identifier(identifier, namespace): Given a sequence identifier, return all aliases in the specified namespace. Zero or more aliases may be returned.

The vrs-python: GA4GH VRS Python Implementation DataProxy class provides an example of this design pattern and sample replies. GA4GH VRS Python Implementation implements the DataProxy interface using a local SeqRepo instance backend and using a SeqRepo REST Service backend. A GA4GH refget implementation has been started, but is pending interface changes to support lookup using primary database accessions.

Examples

The following examples are taken from VRS Python Notebooks:

from ga4gh.vrs.dataproxy import SeqRepoRESTDataProxy
seqrepo_rest_service_url = "http://localhost:5000/seqrepo"
dp = SeqRepoRESTDataProxy(base_url=seqrepo_rest_service_url)

def get_sequence(identifier, start=None, end=None):
    """returns sequence for given identifier, optionally limited
    to inter-residue <start, end> interval"""
    return dp.get_sequence(identifier, start, end)
def get_sequence_length(identifier):
    """return length of given sequence identifier"""
    return dp.get_metadata(identifier)["length"]
def translate_sequence_identifier(identifier, namespace):
    """return for given identifier, return *list* of equivalent identifiers in given namespace"""
    return dp.translate_sequence_identifier(identifier, namespace)
get_sequence_length("ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl")
58617616
start, end = 44908821-25, 44908822+25
get_sequence("ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl", start, end)
'CCGCGATGCCGATGACCTGCAGAAGCGCCTGGCAGTGTACCAGGCCGGGGC'
translate_sequence_identifier("GRCh38:19", "ga4gh")
['ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl']
translate_sequence_identifier("ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl", "GRCh38")
['GRCh38:19', 'GRCh38:chr19']

Normalization

In VRS, “normalization” refers to the process of rewriting an ambiguous variation representation of variation into a canonical form. Normalization eliminates a class of ambiguity that impedes comparison of variation across systems.

In the sequencing community, “normalization” refers to the process of converting a given sequence variant into a canonical form, typically by left- or right-shuffling insertion/deletion variants. VRS normalization extends this concept to all classes of VRS Variation objects.

Implementations MUST provide a normalize function that accepts any Variation object and returns a normalized Variation. Guidelines for these functions are below.

General Normalization Rules

  • Object types that do not have explicit VRS normalization rules below are returned as-is. That is, all types of Variation MUST be supported, even if such objects are unchanged.

  • VRS normalization functions are idempotent: Normalizing a previously-normalized object returns an equivalent object.

  • VRS normalization functions are not necessarily homomorphic: That is, the input and output objects may be of different types.

Allele Normalization

Certain insertion or deletion alleles may have ambiguous representations when using conventional sequence normalization, resulting in significant challenges when comparing such alleles.

VRS uses a “fully-justified” normalization algorithm adapted from NCBI’s Variant Overprecision Correction Algorithm [1]. Fully-justified normalization expands such ambiguous representation over the entire region of ambiguity, resulting in an unambiguous representation that may be readily compared with other alleles.

This algorithm was designed for Allele instances in which the Reference Allele Sequence and Alternate Allele Sequence are precisely known and intended to be normalized. In some instances, this may not be desired, e.g. faithfully maintaining a sequence represented as a repeating subsequence through a RepeatSequenceExpression object. We also anticipate that these edge cases will not be common, and encourage adopters to use the VRS Allele Normalization Algorithm whenever possible.

LiteralSequenceExpression Alleles

When normalizing an Allele with a LiteralSequenceExpression state, the following normalization rules apply:

  1. Start with an unnormalized Allele, with corresponding reference and alternate Allele Sequences.

    1. The Reference Allele Sequence refers to the subsequence at the Allele SequenceLocation.

    2. The Alternate Allele Sequence refers to the Sequence described by the Allele state attribute.

    3. Let start and end initially be the start and end of the Allele SequenceLocation.

  2. Trim common flanking sequence from Allele sequences.

    1. Trim common suffix sequence (if any) from both of the Allele Sequences and decrement end by the length of the trimmed suffix.

    2. Trim common prefix sequence (if any) from both of the Allele Sequences and increment start by the length of the trimmed prefix.

  3. Compare the two Allele sequences, if:

    1. both are empty, the input Allele is a reference Allele. Return the input Allele unmodified.

    2. both are non-empty, the input Allele has been normalized to a substitution. Return a new Allele with the modified start, end, and Alternate Allele Sequence.

    3. one is empty, the input Allele is an insertion (empty reference sequence) or a deletion (empty alternate sequence). Continue to step 3.

  4. Determine bounds of ambiguity.

    1. Left roll: Set a left_roll_bound equal to start. While the terminal base of the non-empty Allele sequence is equal to the base preceding the left_roll_bound, decrement left_roll_bound and circularly permute the Allele sequence by removing the last character of the Allele sequence, then prepending the character to the resulting Allele sequence.

    2. Right roll: Set a right_roll_bound equal to start. While the terminal base of the non-empty Allele sequence is equal to the base following the right_roll_bound, increment right_roll_bound and circularly permute the Allele sequence by removing the first character of the Allele sequence, then appending the character to the resulting Allele sequence.

  5. Construct a new Allele covering the entire region of ambiguity.

    1. Prepend characters from left_roll_bound to start to both Allele Sequences.

    2. Append characters from start to right_roll_bound to both Allele Sequences.

    3. Set start to left_roll_bound and end to right_roll_bound, and return a new Allele with the modified start, end, and Alternate Allele Sequence.

_images/normalize.png

A demonstration of fully justifying an insertion allele.

Reproduced from [2]

RepeatedSequenceExpression Alleles

When normalizing an Allele with a RepeatedSequenceExpression state, normalization is similar to that of LiteralSequenceExpression, expanding the Reference Allele Sequence to capture the entire region of ambiguity. Unlike LiteralSequenceExpression normalization, however, the region of ambiguity is defined by full-length repeat subunits. The Alternate Allele Sequence is also expanded in this way, but is represented by altering the RepeatedSequenceExpression.count attribute, rather than the seq_expr attribute.

The above only applies if RepeatedSequenceExpression.seq_expr is set to a LiteralSequenceExpression object. If the RepeatedSequenceExpression.seq_expr is instead a DerivedSequenceExpression, the Allele SHOULD be returned as-is.

References

Computed Identifiers

VRS provides an algorithmic solution to deterministically generate a globally unique identifier from a VRS object itself. All valid implementations of the VRS Computed Identifier will generate the same identifier when the objects are identical, and will generate different identifiers when they are not. The VRS Computed Digest algorithm obviates centralized registration services, allows computational pipelines to generate “private” ids efficiently, and makes it easier for distributed groups to share data.

A VRS Computed Identifier for a VRS concept is computed as follows:

Important

Normalizing objects is STRONGLY RECOMMENDED for interoperability. While normalization is not strictly required, automated validation mechanisms are anticipated that will likely disqualify Variation that is not normalized. See Implementations should normalize Alleles for a rationale.

The following diagram depicts the operations necessary to generate a computed identifier. These operations are described in detail in the subsequent sections.

_images/id-dig-ser.png

Serialization, Digest, and Computed Identifier Operations

Entities are shown in gray boxes. Functions are denoted by bold italics. The yellow, green, and blue boxes, corresponding to the sha512t24u, ga4gh_digest, and ga4gh_identify functions respectively, depict the dependencies among functions. SHA512 is SHA-512 truncated to 24 bytes (192 bits), using the SHA-512 initialization vector. base64url is the official name of the variant of Base64 encoding that uses a URL-safe character set. [figure source]

Note

Most implementation users will need only the ga4gh_identify function. We describe the ga4gh_serialize, ga4gh_digest, and sha512t24u functions here primarily for implementers.

Requirements

Implementations MUST adhere to the following requirements:

  • Implementations MUST use the normalization, serialization, and digest mechanisms described in this section when generating GA4GH Computed Identifiers. Implementations MUST NOT use any other normalization, serialization, or digest mechanism to generate a GA4GH Computed Identifier.

  • Implementations MUST ensure that all nested objects are identified with GA4GH Computed Identifiers. Implementations MAY NOT reference nested objects using identifiers in any namespace other than ga4gh.

Note

The GA4GH schema MAY be used with identifiers from any namespace. For example, a SequenceLocation may be defined using a sequence_id = refseq:NC_000019.10. However, an implementation of the Computed Identifier algorithm MUST first translate sequence accessions to GA4GH SQ accessions to be compliant with this specification.

Digest Serialization

Digest serialization converts a VRS object into a binary representation in preparation for computing a digest of the object. The Digest Serialization specification ensures that all implementations serialize variation objects identically, and therefore that the digests will also be identical. VRS provides validation tests to ensure compliance.

Important

Do not confuse Digest Serialization with JSON serialization or other serialization forms. Although Digest Serialization and JSON serialization appear similar, they are NOT interchangeable and will generate different GA4GH Digests.

Although several proposals exist for serializing arbitrary data in a consistent manner ([Gibson], [OLPC], [JCS]), none have been ratified. As a result, VRS defines a custom serialization format that is consistent with these proposals but does not rely on them for definition; it is hoped that a future ratified standard will be forward compatible with the process described here.

The first step in serialization is to generate message content. If the object is a string representing a Sequence, the serialization is the UTF-8 encoding of the string. Because this is a common operation, implementations are strongly encouraged to precompute GA4GH sequence identifiers as described in Required External Data.

If the object is an instance of a VRS class, implementations MUST:

  • ensure that objects are referenced with identifiers in the ga4gh namespace

  • replace each nested identifiable object with their corresponding digests. (Note: Attributes of some objects, such as CopyNumber, permit a mix of identifiable and non-identifiable values.)

  • order arrays of digests and ids by Unicode Character Set values

  • filter out fields that start with underscore (e.g., _id)

  • filter out fields with null values

The second step is to JSON serialize the message content with the following REQUIRED constraints:

  • encode the serialization in UTF-8

  • exclude insignificant whitespace, as defined in RFC8259§2

  • order all keys by Unicode Character Set values

  • use two-char escape codes when available, as defined in RFC8259§7

The criteria for the digest serialization method was that it must be relatively easy and reliable to implement in any common computer language.

Example

allele = models.Allele(location=models.SequenceLocation(
    sequence_id="ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl",
    interval=simple_interval),
    state=models.SequenceState(sequence="T"))
ga4gh_serialize(allele)

Gives the following binary (UTF-8 encoded) data:

{"location":"u5fspwVbQ79QkX6GHLF8tXPCAXFJqRPx","state":{"sequence":"T","type":"SequenceState"},"type":"Allele"}

For comparison, here is one of many possible JSON serializations of the same object:

allele.for_json()
{
  "location": {
    "interval": {
      "end": 44908822,
      "start": 44908821,
      "type": "SimpleInterval"
    },
    "sequence_id": "ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl",
    "type": "SequenceLocation"
  },
  "state": {
    "sequence": "T",
    "type": "SequenceState"
  },
  "type": "Allele"
}

Truncated Digest (sha512t24u)

The sha512t24u truncated digest algorithm [Hart2020] computes an ASCII digest from binary data. The method uses two well-established standard algorithms, the SHA-512 hash function, which generates a binary digest from binary data, and a URL-safe variant of Base64 encoding, which encodes binary data using printable characters.

Computing the sha512t24u truncated digest for binary data consists of three steps:

  1. Compute the SHA-512 digest of a binary data.

  2. Truncate the digest to the left-most 24 bytes (192 bits). See Truncated Digest Timing and Collision Analysis for the rationale for 24 bytes.

  3. Encode the truncated digest as a base64url ASCII string.

>>> import base64, hashlib
>>> def sha512t24u(blob):
        digest = hashlib.sha512(blob).digest()
        tdigest = digest[:24]
        tdigest_b64u = base64.urlsafe_b64encode(tdigest).decode("ASCII")
        return tdigest_b64u
>>> sha512t24u(b"ACGT")
'aKF498dAxcJAqme6QYQ7EZ07-fiw8Kw2'

Identifier Construction

The final step of generating a computed identifier for a VRS object is to generate a W3C CURIE formatted identifier, which has the form:

prefix ":" reference

The GA4GH VRS constructs computed identifiers as follows:

"ga4gh" ":" type_prefix "." <digest>

Warning

Do not confuse the W3C CURIE prefix (“ga4gh”) with the type prefix.

Type prefixes used by VRS are:

type_prefix

VRS class name

SQ

Sequence

VA

Allele

VH

Haplotype

VAB

Abundance

VS

VariationSet

VSL

SequenceLocation

VCL

ChromosomeLocation

VT

Text

For example, the identifer for the allele example under Digest Serialization gives:

ga4gh:VA.EgHPXXhULTwoP4-ACfs-YCXaeUQJBjH_

References

[Hart2020]

Hart RK, Prlić A. SeqRepo: A system for managing local collections of biological sequences. PLoS One. 2020;15: e0239883. doi:10.1371/journal.pone.0239883

Example

This section provides a complete, language-neutral example of essential features of VRS. In this example, we will translate an HGVS-formatted variant, NC_000019.10:g.44908822C>T, into its VRS format and assign a globally unique identifier.

Translate HGVS to VRS

The HGVS Variant Nomenclature string NC_000019.10:g.44908822C>T represents a single base substitution on the reference sequence NC_000019.10 (human chromosome 19, assembly GRCh38) at position 44908822 from the reference nucleotide C to T.

In VRS, a contiguous change is represented using an Allele object, which is composed of a Location and of the State at that location. Location and State are abstract concepts: VRS is designed to accommodate many kinds of Locations based on sequence position, gene names, cytogentic bands, or other ways of describing locations. Similarly, State may refer to a specific sequence change, a contiguous repeated sequence, or a sequence derived from another source.

In this example, we will use a SequenceLocation, which is composed of a sequence identifier and a SequenceInterval.

In VRS, all identifiers are a Compact URI (CURIE). Therefore, NC_000013.11 MUST be written as the string refseq:NC_000019.10 to make explicit that this sequence is from RefSeq . VRS does not restrict which data sources may be used, but does recommend using prefixes from identifiers.org.

VRS uses Inter-residue Coordinates. Inter-residue coordinates always use intervals to refer to sequence spans. For the purposes of this example, inter-residue coordinates look like the more familiar 0-based, right-open numbering system. (Please read about Inter-residue Coordinates if you are interested in the significant advantages of this design choice over other coordinate systems.)

The SequenceInterval for the position 44908822 is

{
  "end": {
    "type": "Number",
    "value": 44908822
  },
  "start": {
    "type": "Number",
    "value": 44908821
  },
  "type": "SequenceInterval"
}

The SequenceLocation is constructed from a sequence identifier and the above interval.

{
  "interval": {
    "end": {
      "type": "Number",
      "value": 44908822
    },
    "start": {
      "type": "Number",
      "value": 44908821
    },
    "type": "SequenceInterval"
  },
  "sequence_id": "refseq:NC_000019.10",
  "type": "SequenceLocation"
}

A LiteralSequenceExpression object consists simply of the replacement sequence, as follows:

{
  "sequence": "T",
  "type": "LiteralSequenceExpression"
}

The Allele object’s location and state attributes may then be constructed from the above SequenceLocation and LiteralSequenceExpressions respectively:

{
  "location": {
    "interval": {
      "end": {
        "type": "Number",
        "value": 44908822
      },
      "start": {
        "type": "Number",
        "value": 44908821
      },
      "type": "SequenceInterval"
    },
    "sequence_id": "refseq:NC_000019.10",
    "type": "SequenceLocation"
  },
  "state": {
    "sequence": "T",
    "type": "LiteralSequenceExpression"
  },
  "type": "Allele"
}

This Allele is a fully-compliant VRS object that is parsable using the VRS JSON Schema.

Note

VRS is verbose! The goal of VRS is to provide a extensible framework for representation of sequence variation in computers. VRS objects are readily parsable and have precise meaning, but are often larger than other representations and are typically less readable by humans. This tradeoff is intentional!

Generate a computed identifer

A key feature of VRS is an easily-implemented algorithm to generate computed, digest-based identifiers for variation objects. This algorithm permits organizations to generate the same identifier for the same allele without prior coordination, which in turn facilitates sharing, obviates centralized registration services, and enables identifiers to be used in secure settings (such as diagnostic labs).

The VRS computed identifier procedure requires that all nested identifiable objects are expressed using computed identifiers. Using GA4GH sequence identifiers collapses differences between alleles due to trivial differences in reference naming. The same variation reported on NC_000019.10, CM000681.2, GRCh38:19, GRCh38.p13:19 would appear to be distinct variation; using a digest identifer will ensure that variation is reported on a single sequence identifier. Furthermore, using digest-based sequence identifiers enables the use of custom reference sequences.

Important

VRS permits the use of conventional sequence accessions from RefSeq, Ensemble, or other sources. However, when generating copmuted identifiers, implementations MUST use GA4GH-sequence accessions.

In this example, the sequence identifier refseq:NC_000019.10 MUST be transformed into digest-based identifer ga4gh:GS.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl as described in Computed Identifiers. In practice, implmentations should precompute sequence digests or should use an existing service that does so. (See Required External Data for a description of data that are needed to implement VRS.) Subsitituing the GA4GH sequence identifier into the Allele’s location.sequence_id attribute gives:

{
  "location": {
    "interval": {
      "end": {
        "type": "Number",
        "value": 44908822
      },
      "start": {
        "type": "Number",
        "value": 44908821
      },
      "type": "SequenceInterval"
    },
    "sequence_id": "ga4gh:GS.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl",
    "type": "SequenceLocation"
  },
  "state": {
    "sequence": "T",
    "type": "LiteralSequenceExpression"
  },
  "type": "Allele"
}

The first step in constructing a computed identifier is to create a binary digest serialization of the Allele. Details are provided in Computed Identifiers. For this example, the binary (ASCII encoded) object looks like:

{"location":"esDSArZQC-Sx-96ZZzHnzAVNOc439oE5","state":{"sequence":"T","type":"LiteralSequenceExpression"},"type":"Allele"}

Important

The GA4GH binary digest serialization process imposes constraints that guarantee that different implementations will generate the same binary “blob” for a given object. Do not confuse binary digest serialization with JSON serialization, which is used elsewhere with VRS schema.

The GA4GH digest for the above blob is computed using the first 192 bits (24 bytes) of the SHA-512 digest, base64url encoded. Conceptually, the function is base64url( sha512( blob )[:24] ). In this example, the value returned is _YNe5V9kyydfkGU0NRyCMHDSKHL4YNvc.

A GA4GH Computed Identifier has the form:

"ga4gh" ":" <type_prefix> "." <digest>

The type_prefix for a VRS Allele is VA, which results in the following computed identifier for our example:

ga4gh:VA._YNe5V9kyydfkGU0NRyCMHDSKHL4YNvc

Importantly, GA4GH computed identifers may be used literally (without escaping) in URIs.

Variation and Location objects contain an OPTIONAL _id attribute which implementations may use to store any CURIE-formatted identifier. If an implementation returns a computed identifier with objects, the object might look like the following:

{
  "_id": "ga4gh:VA._YNe5V9kyydfkGU0NRyCMHDSKHL4YNvc",
  "location": {
    "interval": {
      "end": {
        "type": "Number",
        "value": 44908822
      },
      "start": {
        "type": "Number",
        "value": 44908821
      },
      "type": "SequenceInterval"
    },
    "sequence_id": "refseq:NC_000019.10",
    "type": "SequenceLocation"
  },
  "state": {
    "sequence": "T",
    "type": "LiteralSequenceExpression"
  },
  "type": "Allele"
}

This example provides a full VRS-compliant Allele with a computed identifier.

Note

The _id attribute is optional. If it is used, the value MUST be a CURIE, but it does NOT need to be a GA4GH Computed Identifier. Applications MAY choose to implement their own identifier scheme for private or public use. For example, the above _id could be a serial number assigned by an application, such as acmecorp:v0000123.

What’s Next?

This example has shown a full example for a relatively simple case. VRS provides a framework that will enable much more complex variation. Please see Future Plans for a discussion of variation classes that are intened in the near future.

The Implementations section lists libraries and packages that implement VRS.

VRS objects are value objects. An important consequence of this design choice is that data should be associated with VRS objects via their identifiers rather than embedded within those objects. The appendix contains an example of associating annotations with variation.

Releases

Note

VRS follows Semantic Versioning 2.0. For a version number MAJOR.MINOR.PATCH:

  • MAJOR version is incremented for incompatible API changes.

  • MINOR version is incremented for new, backwards-compatible functionality. For VRS, this means changes that add support for new types of variation or extend existing types.

  • PATCH version is incremented for bug fixes. For VRS, examples are clarifications of documentation and bug fixes on property constraints. No changes to information models will occur in PATCH releases.

All planned work The VRS Roadmap for upcoming developments. All currently planned work will be MINOR updates according to the guidelines above.

1.2

1.2.0

News
Important
Major Changes
  • New classification of variation types.

    • Molecular Variation refers to variation within or of a contiguous molecular

    • Systemic Variation refers to variation in the context of a system, such as a genome, sample, or homologous chromosomes

    • Utility Variation classes provide useful representations for certain technical operations

  • New SequenceExpressions subclasses replace SequenceState. Subtypes are:

  • CopyNumber, a form of SystemicVariation, represents the copies of a molecule within a genome, and can be used to express concepts such as amplification and copy loss.

  • Gene enables reference to an external definition of a gene, particularly for useas a subject of copy number expressions.

  • DefiniteRange and IndefiniteRange represent bounded and half-bounded ranges respectively. A new Number type wraps integers so that some attributes may assume values of any of these three types.

Minor Changes
  • Sequence strings are now formally defined by a Sequence type, which is fundamentally also a string. This change aids documentation but has no technical impact.

1.1

1.1.2

This patch version makes the following corrections and clarifications:

  • Adds the intended ChromosomeLocation prefix to the Computed Identifiers table.

  • Revises the Cytoband information model to align with ISCN conventions.

  • Updates the Cytoband regex to match the specified model.

1.1.1

This patch version makes the following corrections and clarifications:

  • Correct styling / indexing of CytobandLocation in restructuredText to match the current Schema and ER Diagram.

  • Remove erroneous bracket notation after CURIE from the locations attribute in the Allele information model.

  • Added citation for sha512t24u and truncated digest collision analysis.

  • Revised Note in inter-residue design decision to acknowledge community terms.

1.1.0

1.1.0 is the second release of VRS.

New classes
  • ChromosomeLocation: A region of a chromosomed specified by species and name using cytogenetic naming conventions

  • CytobandInterval: A contiguous region specified by chromosomal bands features.

  • Haplotype: A set of zero or more Alleles.

  • VariationSet: A set of Variation objects.

Other data model changes
  • Interval was renamed to SequenceInterval. Interval was an internal class that was never instantiated, so this change should not be visiable to users.

Documentation changes

1.0

1.0.0

VRS 1.0.0 was the first public release of the Variation Representation Specification.

Appendices

Getting Involved

VRS is driven by community involvement. Here are a few ways that you can get involved:

Design Decisions

VRS contributors confronted numerous trade-offs in developing this specification. As these trade-offs may not be apparent to outside readers, this section highlights the most significant ones and the rationale for our design decisions, including:

Variation Rather than Variant

The abstract Variation class is intentionally not labeled “Variant”, despite this being the primary term used in other molecular variation exchange formats (e.g. Variant Call Format, HGVS Sequence Variant Nomenclature). This is because the term “Variant” as used in the Genetics community is intended to describe discrete changes in nucleotide / amino acid sequence. “Variation”, in contrast, captures other classes of molecular variation, including epigenetic alteration and transcript abundance. Capturing these other classes of variation is a future goal of VRS, as there are many annotations that will require these variation classes as the subject.

Allele Rather than Variant

The most primitive sequence assertion in VRS is the Allele entity. Colloquially, the words “allele” and “variant” have similar meanings and they are often used interchangeably. However, the VR contributors assert that it is essential to distinguish between the state of a reference sequence from the change from a reference sequence. It is imperative that precise terms are used when modelling data. Therefore, within VRS, “allele” refers to a state of a reference sequence and “variant” refers to a change from a reference sequence.

The word “variant”, which implies change, makes it awkward to refer to the (unchanged) reference allele. Some systems will use an HGVS-like syntax (e.g., NC_000019.10:g.44906586G>G or NC_000019.10:g.44906586=) when referring to an unchanged residue. In some cases, such “variants” are even associated with allele frequencies. Similarly, a predicted consequence is better associated with an allele than with a variant.

Alleles are Fully Justified

In order to standardize the representation of sequence variation, Alleles SHOULD be fully justified from the description of the NCBI Variant Overprecision Correction Algorithm (VOCA). Furthermore, normalization rules are identical for all sequence types (DNA, RNA, and protein).

The choice of algorithm was relatively straightforward: VOCA is published, easily understood, easily implemented, and covers a wide range of cases.

The choice to fully justify is a departure from other common variation formats. The HGVS nomenclature recommendations, originally published in 1998, require that alleles be right normalized (3’ rule) on all sequence types. The Variant Call Format (VCF), released as a PDF specification in 2009, made the conflicting choice to write variants left (5’) normalized and anchored to the previous nucleotide.

Fully-justified alleles represent an alternate approach. A fully-justified representation does not make an arbitrary choice of where a variant truly occurs in a low-complexity region, but rather describes the final and unambiguous state of the resultant sequence.

Implementations should normalize Alleles

VRS STRONGLY RECOMMENDS that Alleles be normalized when generating computed identifiers unless there is compelling reason to do otherwise. Those reasons are the subject of this section.

Allele Normalization is the process of comparing a span of reference sequence to a sequence state (often the alternative sequence) and resolving that span to an unambiguous form. The fully-justified Allele normalization in VRS consists of two steps: trimming and shuffling. In the trimming step, common flanking prefix and suffix sequences are removed. For example, a CAG-to-CTG Allele would be trimmed to merely A-to-T, with the position adjusted accordingly. There are four cases of the resulting sequences:

  1. The trimmed sequences are empty: The Allele refers to reference state.

  2. The trimmed sequences are non-empty: The Allele is a substitution (perhaps multi-residue).

  3. The reference sequence is empty: The Allele is a net insertion.

  4. The state sequence is empty: The Allele is a net deletion.

When the Allele refers to a reference state (case 1), trimming would reduce the variant to a null change. However, reduction to a null state would make it impossible to refer to a specific span of reference sequence. In order to permit users to refer to spans of reference sequence, VRS does not require normalizing reference agreement Alleles.

The trimming step applies only when the reference or the state sequences are empty (cases 3 and 4). When these occur in the context of repeating reference sequence that matches the inserted or deleted sequence, the Allele may be shuffled left and right to identify the fully-justified location of the variation. (See Normalization for details.)

In rare cases, data originators might have reason to associate an annotation with a specific repeating unit in the context of repeated sequence. In order to support this case, normalization is not strictly required.

Most users will normalize most Alleles. Normalization should be skipped only when doing so would decrease the intended precision of an Allele.

Inter-residue Coordinates

Sequence ranges use an inter-residue coordinate system. Inter-residue coordinate conventions are used in this terminology because they provide conceptual consistency that is not possible with residue-based systems.

Important

The choice of what to count — residue or inter-residue positions — has significant semantic implications for the interpretation of coordinates. Although inter-residue coordinates and the “0-based” residue coordinates are often numerically identical, we favor “inter-residue” to emphasize the meaning of these coordinates.

When humans refer to a range of residues within a sequence, the most common convention is to use an interval of ordinal residue positions in the sequence. While natural for humans, this convention has several shortcomings when dealing with sequence variation.

For example, interval coordinates are interpreted as exclusive coordinates for insertions, but as inclusive coordinates for substitutions and deletions; in effect, the interpretation of coordinates depends on the variant type, which is an unfortunate coupling of distinct concepts.

Modelling Language

The VRS collaborators investigated numerous options for modelling data, generating code, and writing the wire protocol. Required and desired selection criteria included:

  • language-neutral – or at least C/C++, java, python

  • high-quality tooling/libraries

  • high-quality code generation

  • documentation generation

  • supported constructs and data types
    • typedefs/aliases

    • enums

    • lists, maps, and maps of lists/maps

    • nested objects

  • protocol versioning (but not necessarily automatic adaptation)

Initial versions of the VRS logical model were implemented in UML, protobuf, and swagger/OpenAPI, and JSON Schema. We have implemented our schema in JSON Schema. Nonetheless, it is anticipated that some adopters of the VRS logical model may implement the specification in other protocols.

Serialization Strategy

There are many packages and proposals that aspire to a canonical form for json in many languages. Despite this, there are no ratified or de facto winners. Many packages have similar names, which makes it difficult to discern whether they are related or not (often not). Although some packages look like good single-language candidates, none are ready for multi-language use. Many seem abandoned. The need for a canonical json form is evident, and there was at least one proposal for an ECMA standard.

Therefore, we implemented our own serialization format, which is very similar to Gibson Canonical JSON (not to be confused with OLPC Canonical JSON).

Not using External Chromosome Declarations

In ChromosomeLocation, the tuple <species,chromosome name> refers an archetypal chromosome for the species. WikiData and MeSH provide such definitions (e.g., Human Chr 1 at WikiData and MeSH) and were considered, and rejected, for use in VRS. Both ontologies were anticipated to increase complexity that was not justified by the benefit to VRS. In addition, data in WikiData are crowd-sourced and therefore potentially unstable, and the species coverage in MeSH was insufficient for anticipated VRS uses.

Development Process

Release Cycle

_images/dev-process.png

The VRS development process.

The release cycle is implemented in the VR project board, which is the authoritative source of information about development status.

Planned Features

Feature requests from the community are made through the generation of GitHub issues on the VRS repository, which are open for public review and discussion.

Project Leadership Review

Open issues are reviewed and triaged by the Project Leadership. Feature requests identified to support an unmet need are added to the Backburner project column and scheduled for discussion in our weekly VR calls. These discussions are used to inform whether or not a feature will be planned for development. The Project Maintainers are responsible for making the final determination on whether or not a feature should be added to VRS.

Requirements Gathering

Once a planned feature is introduced in call, the issue moves to the Planning project column. During this phase, community feedback on use cases and technical requirements will be collected (see example requirement issues). Deadlines for submitting cases will be set by the Project Maintainers.

Requirements Discussion

Once the requirements gathering phase has been completed, the issue moves to the Backlog/Ready for Dev project column. In this phase, the requirements undergo review and discussion by the community on VR calls.

Feature Development

After community review of requirements, the issue moves on to the In Progress project column. In this stage, the draft features will be developed as a draft Pull Request (PR). The draft author will indicate that a feature is ready for community review by marking the PR as “Ready for review” (at which point the PR loses “draft” status).

Feature Review

Once a PR is ready for review, the Project Maintainers will move the corresponding issue to the QA/Feedback project column. Pull requests ready for public review MAY be merged into the main (stable release) branch through review and approval by at least one (non-authoring) Project Maintainer. Merged commits MAY be tagged as alpha releases when needed. After merging, corresponding issues are moved to the Done project column and are closed.

Version Review and Release

After completion of all planned features for a new minor or major version, a request for community review will be indicated by a beta release of the new version. Community stakeholders involved in the feature requests and requirements gathering for the included features are notified by Project Maintainers for review and approval of the release. After a community review period of at least one week, the Project Leadership will review and address any raised concerns for the reviewed version.

After passing review, new minor versions are released to production. If any features in the reviewed version are deemed to be significant additions to the specification by the Project Leadership, or if it is a major version change, instead a release candidate version will be released and submitted for GA4GH product approval. After approval, the new version is released to production.

VRS follows GA4GH project versioning recommendations, based on Semantic Versioning 2.0.

Leadership

Project Leadership

As a product of the Genomic Knowledge Standards (GKS) Work Stream, project leadership is comprised of the Work Stream leadership:

Project Maintainers

Project maintainers are the leads of the GKS Variation Representation working group:

Future Plans

Overview

VRS covers a fundamental subset of data types to represent variation, thus far predominantly related to the replacement of a subsequence in a reference sequence. Increasing its applicability will require supporting more complex types of variation, including:

  • genotypes

  • structural variation

  • mosaicism and chimerism

  • categorical variation

_images/schema-future.png

Planned Variation Representation Specfication Schema

See Current schema diagram for legend.

Existing classes are colored green. Components that are undergoing testing and evaluation and are candidates for the next release cycle are yellow. Components that are planned but still undergoing requirement gathering and initial development are colored red.

[source]

The following sections provide a preview of planned concepts under way to address a broader representation of variation.

Intervals and Locations

VRS uses Location subclasses to define where variation occurs. The schema is designed to be extensible to new kinds of Intervals and Locations in order to support, for example, fuzzy coordinates or feature-based locations.

ComplexInterval

Representation of complex coordinates based on relative locations or offsets from a known location. Examples include “left of” a given position and intronic positions measured from intron-exon junctions.

Computational definition

Under development.

Information model

Under development.

Variation Classes

Additional Variation concepts that are being planned for future consideration in the specification. See Variation for more information.

Structural Variation

Note

This concept is being refined. Please comment at https://github.com/ga4gh/vrs/issues/103

The aberrant joining of two segments of DNA that are not typically contiguous. In the context of joining two distinct coding sequences, translocations result in a gene fusion, which is also covered by this VRS definition.

Computational definition

A joining of two sequences is defined by two Location objects and an indication of the join “pattern” (advice needed on conventional terminology, if any).

Information model

Under consideration. See https://github.com/ga4gh/vrs/issues/28.

Examples

t(9;22)(q34;q11) in BCR-ABL

Categorical Variation

Some variations are defined by categorical concepts, rather than specific locations and states. These variations go by many terms, including categorical variants, bucket variants, container variants, or variant classes. These forms of variation are not described by any broadly-recognized variation format, but modeling them is a key requirement for the representation of aggregate variation descriptions as commonly found in biomedical literature. Our future work will focus on the formal specification for representing these variations with sets of rules, which we currently call Categorical Variation.

Implementations

The libraries and applications listed below have implemented the GA4GH Variation Representation Specification to store and exchange variation data. They are listed here to demonstrate utility and as a resource for those considering implementing VRS. These packages are not supported by GA4GH.

Libraries

Libraries facilitate the use of the VRS, but do not implement a particular use or application. Although there is only one library currently, it is expected that others will eventually appear as VRS is adopted.

vrs-python: GA4GH VRS Python Implementation

The GA4GH VRS Python Implementation is an implementation for the GA4GH VRS. It supports all types covered by the VRS, implements Allele normalization and computed identifier generation, and provides “extra” features such as translation from HGVS, SPDI, and VCF formats.

VRS MAY be used without using the Python implementation.

Applications and Web Services

Applications implement VRS to support specific use cases. Projects known to implement VRS are listed below. Descriptions are provided by the application authors.

ClinGen Allele Registry

ClinGen Allele Registry [1] provides identifiers for more than 900 million variants. Each identifier (canonical allele identifiers: CAIds) is an abstract concept which represents a group of identical variants based on alignment. Identifiers are retrievable irrespective of the reference sequence and normalization status.

As a Driver Project for GA4GH, ClinGen Allele Registry implements two standards: RefGet and VRS in the first implementation.

The API endpoints that support data retrieval in this two key standards are summarized in the following table.

HOST: https//reg.clinicalgenome.org/

API Path

Parameters

Response Format

Example

RefGet

[GET] /sequence/service-info

-

Refget v1.0.0

/sequence/service-info

[GET] /sequence/{id}

id => TRUNC512 digest for reference sequence

Refget v1.0.0

/sequence/vYfm5TA_F-_BtIGjfzjGOj8b6IK5hCTx

[GET] /sequence/{id}/metadata

id => TRUNC512 digest for reference sequence

Refget v1.0.0

/sequence/vYfm5TA_F-_BtIGjfzjGOj8b6IK5hCTx/metadata

VRS

[GET] /vrAllele?hgvs={hgvs}

hgvs => HGVS expression

VRS v1.0

/vrAllele?hgvs=NC_000007.14:g.55181320A>T /vrAllele?hgvs=NC_000007.14:g.55181220del

Support for GA4GH refget and VRS provided in ClinGen Allele Registry is independent from VRS-Python. Support for this community standards is implemented in ClinGen Allele Registry through extension of code written in C++.

BRCA Exchange

The goal of BRCA Exchange (https://brcaexchange.org/) is to expand approaches to integrate and disseminate information on BRCA variants in Hereditary Breast and Ovarian Cancer (HBOC), as an exemplar for additional genes and additional heritable disorders [2]. The BRCA Exchange web portal provides information on the annotation and clinical interpretation of 40,000 variants to date. As a GA4GH Driver Project, BRCA Exchange is contributing to and adopting the Variant Annotation (VA), Pedigree (Ped) and Variant Representation (VRS) standards. BRCA Exchange displays the VRS identifiers of all variants, and provides an API endpoint for querying variants by VRS identifier. With this endpoint, if BRCA Exchange contains a variant that matches the VRS identifier, it returns data on that variant. Otherwise, it returns a Server 500 error.

Example query:
VICC Meta-knowledgebase

The Variant Interpretation for Cancer Consortium (VICC; https://cancervariants.org) has a collection of ~20K clinical interpretations associated with ~3,500 somatic variations and variation classes in a harmonized meta-knowledgebase [3] (see documentation at http://docs.cancervariants.org). Each interpretation is be linked to one or more variations or a variation class.

As a Driver Project for GA4GH, VICC is contributing to and/or adopting several GA4GH standards, including VRS, Variant Annotation (VA), and service_info. VICC supports queries on all VRS computed identifiers at the searchAssociations endpoint (vicc-docs). Features associated with each interpretation are represented as VRS objects.

Example queries:

References:

Relationship of VRS to existing standards

Because a primary objective of the GA4GH Variation Representation Specification (VRS) effort is to unify disparate efforts to represent biological sequence variation, it is important to describe how this document relates to previous work in order to avoid “reinventing the wheel”.

The Variant Call Format (VCF) is the de facto standard for representing alleles, particularly for use during primary analysis in high-throughput sequencing pipelines. VCF permits a wide range of annotations on alleles, such as quality and likelihood scores. VCF is a file-based format and is exclusively for genomic alleles. In contrast, the VRS data model abstractly represents Alleles, Haplotypes, and Genotypes on all sequence types, is independent of medium, and is well-suited to secondary analyses, allele interpretation, aggregation, and system-level interoperability.

The HGVS nomenclature recommendations describe how sequence variation should be presented to human beings. In addition to representing a wide variety of sequence changes from single residue variation through large cytogenetic events, HGVS attempts to also encode in strings notions of biological mechanism (e.g., inversion as a kind of deletion-insertion event), predicted events (e.g., parentheses for computing protein sequence), and complex states (e.g., mosaicism). In practice, HGVS recommendations are difficult to implement fully and consistently, leading to ambiguity in presentation. In contrast, the VRS is a formal specification that improves consistency of representation among computer systems. VRS is currently less expressive than HGVS for rarer cases of variation, such as cytogenetic variation or context-based allele representations (e.g., insT written as dupT when the insertion follows a T). Future versions of the specification will seek to address limitations while preserving principles of conceptual clarity and precision.

The Sequence Ontology (SO) is a set of terms and relationships used to describe the features and attributes of biological sequence. The focus of the SO has been the annotation of, or placement of ‘meaning’, onto genomic sequence regions. The VRS effort seeks to use the same descriptive definitions where possible, and to inform the refinement of SO.

The Genotype Ontology (GENO) builds on the SO to include richer modeling of genetic variation at different levels of granularity that are captured in genotype representations. Unlike the SO which is used primarily for annotation of genomic features, GENO was developed by the Monarch Initiative to support semantic data models for integrated representation of genotypes and genetic variants described in human and model organism databases. The core of the GENO model decomposes a genotype specifying sequence variation across an entire genome into smaller components of variation (e.g. allelic composition at a particular locus, haplotypes, gene alleles, and specific sequence alterations). GENO also enables description of biological attributes of these genetic entities (e.g. zygosity, phase, copy number, parental origin, genomic position), and their causal relationships with phenotypes and diseases.

ClinVar is an archive of clinically reported relationships between variation and phenotypes along with interpretations and supporting evidence. Data in ClinVar are submitted primarily by diagnostic labs. ClinVar includes expert reviews and data links to other clinically-relevant resources at NCBI. VRS is expected to facilitate data submissions by providing unified guidelines for data structure and allele normalization.

ClinGen provides a centralized database of genomic and phenotypic data provided by clinicians, researchers, and patients. It standardizes clinical annotation and interpretation of genomic variants and provides evidence-based expert consensus for curated genes and variants. ClinGen has informed the VRS effort and is committed to harmonizing and collaborating on the evolution of the VRS specification to achieve improved data sharing.

HL7 FHIR Genomics, Version 2 Clinical Genomics Implementation Guide, CDA Genetic Test Report: There are several standards developed under the HL7 umbrella that include a genomics component. The FHIR Genomics component was released as part of the overall FHIR specification (latest is Release 3) based on standardized use cases. The HL7 Clinical Genomics (CG) Work Group focuses on developing standards for clinical genomic data and related relevant information within EHRs. The specifications developed by the CG work group primarily utilize the HL7 v2 messaging standard and the newer HL7 FHIR (Fast Healthcare Interoperability Resources) framework.

The SPDI format created to represent alleles in NCBI’s Variation Services has four components: the sequence identifier, which is specified with a sequence accession and version; the 0-based inter-residue coordinate where the deletion starts; the deleted sequence (or its length) and the inserted sequence. The Variation Services return the minimum deleted sequence required to avoid over precision. For example, a deletion of one G in a run of 4 is specified with deleted and inserted sequences of GGGG and GGG respectively, avoiding the need to left or right shift the minimal representation. This reduces ambiguity, but can lead to long allele descriptions.

From https://github.com/ga4gh/vrs/issues/305:

VRS is being designed as an informational model that is designed as atomic building blocks that can be composed into higher order variant representations. It is designed for the primary function of precise computational data exchange.

VRS is also extensible. It is not limited to simple SNVs, DelIns and any subset of variation and such can be used as a standard that will grow with the types of variation that are often limited by other methods, nomenclatures and authorizing registries (SPDI, VCF and HGVS)

VRS is not limited to genomic sequence, but any type of sequence (genomic, transcript, protein).

VRS is not limited to sequence based variation (cytobands, systemic expression, genetic features)

SPDI is only about alleles and precise genomic variation, SPDI’s nomenclature is built on VOCA (variant overprecision correction algorithm) as specified by NCBI. VRS is built on VOCA as well for the types of variation that fall within its domain.

VCF is genomic only. VCF is a file format. VCF is primarily designed for high-volume, compact variant calls. VCF is not designed to be extensible in the same way as VRS to support much broader representations of variation independent of samples or cohorts. VCF does not normalize the small precise SNVs and DelInss using the same VOCA based normalization.

HGVS is a nomenclature. HGVS is designed primarily for human-readability not computational identification. HGVS is not applied consistently in reporting, literature, and databases even though there has been great strides to provide tooling to validate HGVS syntax. HGVS does not normalize variation using VOCA. Several HGVS expressions can represent the same variant. VRS is not designed to be human-readable (we have started designing implementation guidance for wrapping VRS representations in Value Object Descriptors to allow exchange systems to add human-readable and useful attributes that improve the productivity of data exchange contracts involving variation - see VRSATILE).

Associating Annotations with VRS Objects

Information is never embedded within VRS objects. Instead, it is associated with those objects by means of their ids. This approach to annotations scales better in size and distributes better across multiple data sources.

The Genomic Knowledge Standards Work Stream is currently developing a Value Object Descriptors policy to provide a standardized way to associate common annotations with VRS objects as part of the VRSATILE framework. This approach enables standard and verbose exchange while maintaining the advantages of the VRS value object design philosophy.

This example demonstrates how to associate information with VRS objects. Although the examples use the GA4GH VRS Python Implementation library, the principles apply regardless of implementation.

import collections
from ga4gh.vrs import ga4gh_identify, models
from ga4gh.vrs.dataproxy import SeqRepoRESTDataProxy
from ga4gh.vrs.extras.translator import Translator

# Requires seqrepo REST interface is running on this URL (e.g., using docker image)
seqrepo_rest_service_url = "http://localhost:5000/seqrepo"
dp = SeqRepoRESTDataProxy(base_url=seqrepo_rest_service_url)

tlr = Translator(data_proxy=dp)
# Declare some data as human-readable RS id labels with HGVS expressions
data = (
    ("rs7412C",   "NC_000019.10:g.44908822="),
    ("rs7412T",   "NC_000019.10:g.44908822C>T"),
    ("rs429358C", "NC_000019.10:g.44908684="),
    ("rs429358T", "NC_000019.10:g.44908684T>C")
)
# Parse the HGVS expressions and generate three dicts:
# alleles[allele_id] ⇒ allele object
# rs_names[allele_id] ⇒ rs label
# hgvs_name[allele_id] ⇒ original hgvs expression

# For convenience, also build
# rs_to_id[rs_name] ⇒ allele_id

alleles = {}
rs_names = {}
hgvs_names = collections.defaultdict(lambda: dict())
for rs, hgvs_expr in data:
    allele = tlr.from_hgvs(hgvs_expr)
    allele_id = ga4gh_identify(allele)
    alleles[allele_id] = allele
    hgvs_names[allele_id] = hgvs_expr
    rs_names[allele_id] = rs

rs_to_id = {r: i for i, r in rs_names.items()}
# Now, build a new set of annotations: allele frequencies
# This is more complicated because it maps to a map of frequences
# It should be clear that other frequencies could be easily added here
# or as a separate data source
freqs = {
    "gnomad": {
        "global": {
            rs_to_id["rs7412C"]: 0.9385,
            rs_to_id["rs7412T"]: 0.0615,
            rs_to_id["rs429358C"]: 0.1385,
            rs_to_id["rs429358T"]: 0.8615,
        }
    }
}
# It might be convenient to save these data
# A saved document might have structure like this:
doc = {
    "alleles": alleles,
    "hgvs_names": hgvs_names,
    "rs_names": rs_names,
    "freqs": freqs
}
# For the benefit of pretty printing, let's replace the allele objects with their dict representations
doc["alleles"] = {i: a.as_dict() for i, a in doc["alleles"].items()}
import json
print(json.dumps(doc, indent=2))
{
  "alleles": {
    "ga4gh:VA.UUvQpMYU5x8XXBS-RhBhmipTWe2AALzj": {
      "location": {
        "interval": {
          "end": 44908822,
          "start": 44908821,
          "type": "SimpleInterval"
        },
        "sequence_id": "ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl",
        "type": "SequenceLocation"
      },
      "state": {
        "sequence": "C",
        "type": "SequenceState"
      },
      "type": "Allele"
    },
    "ga4gh:VA.EgHPXXhULTwoP4-ACfs-YCXaeUQJBjH_": {
      "location": {
        "interval": {
          "end": 44908822,
          "start": 44908821,
          "type": "SimpleInterval"
        },
        "sequence_id": "ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl",
        "type": "SequenceLocation"
      },
      "state": {
        "sequence": "T",
        "type": "SequenceState"
      },
      "type": "Allele"
    },
    "ga4gh:VA.LQrGFIOAP8wEAybwNBo8pJ3yIG7tXWoh": {
      "location": {
        "interval": {
          "end": 44908684,
          "start": 44908683,
          "type": "SimpleInterval"
        },
        "sequence_id": "ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl",
        "type": "SequenceLocation"
      },
      "state": {
        "sequence": "T",
        "type": "SequenceState"
      },
      "type": "Allele"
    },
    "ga4gh:VA.iXjilHZiyCEoD3wVMPMXG3B8BtYfL88H": {
      "location": {
        "interval": {
          "end": 44908684,
          "start": 44908683,
          "type": "SimpleInterval"
        },
        "sequence_id": "ga4gh:SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl",
        "type": "SequenceLocation"
      },
      "state": {
        "sequence": "C",
        "type": "SequenceState"
      },
      "type": "Allele"
    }
  },
  "hgvs_names": {
    "ga4gh:VA.UUvQpMYU5x8XXBS-RhBhmipTWe2AALzj": "NC_000019.10:g.44908822=",
    "ga4gh:VA.EgHPXXhULTwoP4-ACfs-YCXaeUQJBjH_": "NC_000019.10:g.44908822C>T",
    "ga4gh:VA.LQrGFIOAP8wEAybwNBo8pJ3yIG7tXWoh": "NC_000019.10:g.44908684=",
    "ga4gh:VA.iXjilHZiyCEoD3wVMPMXG3B8BtYfL88H": "NC_000019.10:g.44908684T>C"
  },
  "rs_names": {
    "ga4gh:VA.UUvQpMYU5x8XXBS-RhBhmipTWe2AALzj": "rs7412C",
    "ga4gh:VA.EgHPXXhULTwoP4-ACfs-YCXaeUQJBjH_": "rs7412T",
    "ga4gh:VA.LQrGFIOAP8wEAybwNBo8pJ3yIG7tXWoh": "rs429358C",
    "ga4gh:VA.iXjilHZiyCEoD3wVMPMXG3B8BtYfL88H": "rs429358T"
  },
  "freqs": {
    "gnomad": {
      "global": {
        "ga4gh:VA.UUvQpMYU5x8XXBS-RhBhmipTWe2AALzj": 0.9385,
        "ga4gh:VA.EgHPXXhULTwoP4-ACfs-YCXaeUQJBjH_": 0.0615,
        "ga4gh:VA.LQrGFIOAP8wEAybwNBo8pJ3yIG7tXWoh": 0.1385,
        "ga4gh:VA.iXjilHZiyCEoD3wVMPMXG3B8BtYfL88H": 0.8615
      }
    }
  }
}

Equivalence Between Concepts

VRS allows for the expressive representation of variation concepts. Sometimes this allows for forms that can be reduced from one to another, and sometimes bi-directionally. Examples of this include the bi-directional translation of chromosomal bands to sequence coordinates via a sequence-band mapping, the uni-directional translation of a gene to one or more sequence location(s), and the use of different Sequence Expression instances that would resolve to the same sequence. Similarly, authority-based concepts such as Gene are entirely dependent on the definition of the concept by that authority–we provide no guidance on how to translate or relate such concepts to one another.

We provide no guidance or mechanism to enforce “equivalence” between these concepts, because the semantics of one representation to another are distinct, even when there exists functions that equate or translate between two distinct concepts. Instead, we encourage communities to adopt policies about how and when to use the various concepts provided by VRS to represent different forms of variation. To assist in that effort, the GA4GH Genomic Knowledge Standards Work Stream is developing a specification for resource-defined Variation Concept Origination Policies (VCOPs). You can learn more about VCOPs in the VRSATILE framework.

Using Sequence Expressions

When using Sequence Expressions, our general recommendation is to use LiteralSequenceExpression for when the precise sequence state is of importance to the Variation concept; this is the most common use case. When the precise state is not important but instead it is desired to refer to the general sequence derived from a location on a reference sequence, we recommend using a DerivedSequenceExpression; this is typically used when describing large sequences that are approximately reference for use in some large-scale Molecular Variation or Systemic Variation concepts. RepeatedSequenceExpression is typically used for the semantic importance of describing a specific, repeated subsequence by count, such as description of CAG repeats in the ATXN7 gene, where the repeat count is a diagnostic biomarker for severe neurodegenerative disorder spinocerebellar ataxia type 7 [1].

Proposal for GA4GH-wide Computed Identifier Standard

This appendix describes a proposal for creating a GA4GH-wide standard for serializing data, computing digests on serialized data, and constructing CURIE identifiers from the digests. Essentially, it is a generalization of the Computed Identifiers section.

This standard is proposed now because VRS needs a well-defined mechanism for generating identifiers. Changing the identifier mechanism later will create significant issues for VR adopters.

Background

The GA4GH mission entails structuring, connecting, and sharing data reliably. A key component of this effort is to be able to identify entities, that is, to associate identifiers with entities. Ideally, there will be exactly one identifier for each entity, and one entity for each identifier. Traditionally, identifiers are assigned to entities, which means that disconnected groups must coordinate on identifier assignment.

The computed identifier scheme proposed in VRS computes identifiers from the data itself. Because identifers depend on the data, groups that independently generate the same variation will generate the same computed identifier for that entity, thereby obviating centralized identifier systems and enabling identifiers to be used in isolated settings such as clinical labs.

The computed identifier mechanism is broadly applicable and useful to the entire GA4GH ecosystem. Adopting a common identifier scheme will make interoperability of GA4GH entities more obvious to consumers, will enable the entire organization to share common entity definitions (such as sequence identifiers), and will enable all GA4GH products to share tooling that manipulate identified data. In short, it provides an important consistency within the GA4GH ecosystem.

As a result, we are proposing that the computed identifier scheme described in VRS be considered for adoption as a GA4GH-wide standard. If the proposal is accepted by the GA4GH executive committee, the current VRS proposal will stand as-is; if the proposal is rejected, the VRS proposal will be modified to rescope the computed identifier mechanism to VRS and under admininstration of the VR team.

Proposal

The following algorithmic processes, described in depth in the VRS Computed Identifiers proposal, are included in this proposal by reference:

  • GA4GH Digest Serialization is the process of converting an object to a canonical binary form based on JSON and inspired by similar (but unratified) JSON standards. This serialization for is used only for the purposes of computing a digest.

  • GA4GH Truncated Digest is a convention for using SHA-512, truncated to 24 bytes, and encoding using base64url.

  • GA4GH Identification is the CURIE-based syntax for constructing a namespaced and typed identifier for an object.

Type Prefixes

A GA4GH identifier is proposed to be constructed according to this syntax:

"ga4gh" ":" type_prefix "." digest

The digest is computed as described above. The type_prefix is a short alphanumeric code that corresponds to the type of object being represented. If this propsal is accepted, this “type prefix map” would be administered by GA4GH. (Currently, this map is maintained in a YAML file within the VRS repository, but it would be relocated on approval of this proposal.)

We propose the following guidelines for type prefixes:

  • Prefixes SHOULD be short, approximately 2-4 characters.

  • Prefixes SHOULD be for concrete types, not polymorphic parent classes.

  • A prefix MUST map 1:1 with a schema type.

  • Variation Representation types SHOULD start with V.

  • Variation Annotation types SHOULD start with A.

Administration

If accepted, administration of these guidelines should be transferred to a technical steering committee. If not accepted, the VR team will assume administration of the existing prefixes.

Truncated Digest Timing and Collision Analysis

The GA4GH Digest uses a truncated SHA-512 digest in order to generate a unique identifier based on data that defines the object. This notebook discusses the choice of SHA-512 over other digest methods and the choice of truncation length.

Note

Please see this Jupyter notebook in Python SeqRepo library for code and updates. A fuller explanation is given in [Hart2020].

Conclusions

  • The computational time for SHA-512 is similar to that of other digest methods. Given that it is believed to distribute input bits more uniformly with no increased computational cost, it should be preferred for our use (and likely most uses).

  • 24 bytes (192 bits) of digest is ample for VRS uses. Arguably, we could choose much smaller without significant risk of collision.

import hashlib
import math
import timeit

from IPython.display import display, Markdown

from ga4gh.vrs.extras.utils import _format_time

algorithms = {'sha512', 'sha1', 'sha256', 'md5', 'sha224', 'sha384'}

Digest Timing

This section provides a rationale for the selection of SHA-512 as the basis for the Truncated Digest.

def blob(l):
    """return binary blob of length l (POSIX only)"""
    return open("/dev/urandom", "rb").read(l)

def digest(alg, blob):
    md = hashlib.new(alg)
    md.update(blob)
    return md.digest()

def magic_run1(alg, blob):
    t = %timeit -o digest(alg, blob)
    return t

def magic_tfmt(t):
    """format TimeitResult for table"""
    return "{a} ± {s} ([{b}, {w}])".format(
        a = _format_time(t.average),
        s = _format_time(t.stdev),
        b = _format_time(t.best),
        w = _format_time(t.worst),
    )
blob_lengths = [100, 1000, 10000, 100000, 1000000]
blobs = [blob(l) for l in blob_lengths]
table_rows = []
table_rows += [["algorithm"] + list(map(str,blob_lengths))]
table_rows += [["-"] * len(table_rows[0])]
for alg in sorted(algorithms):
    r = [alg]
    for i in range(len(blobs)):
        blob = blobs[i]
        t = timeit.timeit(stmt='digest(alg, blob)', setup='from __main__ import alg, blob, digest', number=1000)
        r += [_format_time(t)]
    table_rows += [r]
table = "\n".join(["|".join(map(str,row)) for row in table_rows])
display(Markdown(table))

algorithm

100

1000

10000

100000

1000000

md5

1.02 ms

2.51 ms

23.4 ms

145 ms

1.44 s

sha1

1.02 ms

1.91 ms

11.3 ms

101 ms

1 s

sha224

1.21 ms

3.16 ms

23.1 ms

224 ms

2.2 s

sha256

1.18 ms

3.29 ms

23.3 ms

223 ms

2.2 s

sha384

1.17 ms

2.54 ms

16 ms

150 ms

1.47 s

sha512

1.2 ms

2.55 ms

16.1 ms

148 ms

1.47 s

Conclusion: SHA-512 computational time is comparable to that of other digest methods.

This is result was not expected initially. On further research, there is a clear explanation: The SHA-2 series of digests (which includes SHA-224, SHA-256, SHA-384, and SHA-512) is defined using 64-bit operations. When an implementation is optimized for 64-bit systems (as used for these timings), the number of cycles is essentially halved when compared to 32-bit systems and digests that use 32-bit operations. SHA-2 digests are indeed much slower than SHA-1 and MD5 on 32-bit systems, but such legacy platforms is not relevant to the Truncated Digest.

Collision Analysis

Our question: For a hash function that generates digests of length b (bits) and a corpus of m messages, what is the probability p that there exists at least one collision? This is the so-called Birthday Problem [6].

Because analyzing digest collision probabilities typically involve choices of mathematical approximations, multiple “answers” appear online. This section provides a quick review of prior work and extends these discussions by focusing the choice of digest length for a desired collision probability and corpus size.

Throughout the following, we’ll use these variables:

  • \(P\) = Probability of collision

  • \(P'\) = Probability of no collision

  • \(b\) = digest size, in bits

  • \(s\) = digest space size, \(s = 2^b\)

  • \(m\) = number of messages in corpus

The length of individual messages is irrelevant.

References
Background: The Birthday Problem

Directly computing the probability of one or more collisions, \(P\), in a corpus is difficult. Instead, we first seek to solve for \(P'\), the probability that a collision does not exist (i.e., that the digests are unique). Because are only two outcomes, \(P + P' = 1\) or, equivalently, \(P = 1 - P'\).

For a corpus of size \(m=1\), the probabability that the digests of all \(m=1\) messages are unique is (trivially) 1:

\[P' = s/s = 1\]

because there are \(s\) ways to choose the first digest from among \(s\) possible values without a collision.

For a corpus of size \(m=2\), the probabability that the digests of all \(m=2\) messages are unique is:

\[P' = 1 \times (\frac{s-1}{s})\]

because there are \(s-1\) ways to choose the second digest from among \(s\) possible values without a collision.

Continuing this logic, we have:

\[P' = \prod\nolimits_{i=0}^{m-1} \frac{(s-i)}{s}\]

or, equivalently,

\[P' = \frac{s!}{s^m \cdot (s-m)!}\]

When the size of the corpus becomes greater than the size of the digest space, the probability of uniques is zero by the pigeonhole principle. Formally, the above equation becomes:

\[\begin{split}P' = \left\{ \begin{array}{ll} 1 & \text{if }m = 0 \\ \prod\nolimits_{i=0}^{m-1} \frac{(s-i)}{s} & \text{if }1 \le m\le s\\ 0 & \text{if }m \gt s \end{array} \right.\end{split}\]

For the remainder of this section, we’ll focus on the case where \(1 \le m \ll s\). In addition, notice that the brute force computation is not feasible in practice because \(m\) and \(s\) will be very large (both \(\gg 2^9\)).

Approximation #1: Taylor approximation of terms of P’

The Taylor series expansion of the exponential function is

\[e^x = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + ...\]

For \(|x| \ll 1\), the expansion is dominated by the first terms and therecore \(e^x \approx 1 + x\).

In the above expression for \(P'\), note that the product term \((s-i)/s\) is equivalent to \(1-i/s\). Combining this with the Taylor expansion, where \(x = -i/s\) (⇒ \(m \ll s\)):

\[\begin{split}\begin{split} P' & \approx \prod\nolimits_{i=0}^{m-1} e^{-i/s} \\ & = e^{-m(m-1)/2s} \end{split}\end{split}\]

(The latter equivalence comes from converting the product of exponents to a single exponent of a summation of \(-i/s\) terms, factoring out \(1/s\), and using the series sum equivalence \(\sum\nolimits_{j=0}^{n} j = n(n+1)/2\) for \(n\ge0\).)

Approximation #2: Taylor approximation of P’

The above result for \(P'\) is also amenable to Taylor approximation. Setting \(x = -m(m-1)/2s\), we continue from the previous derivation:

\[\begin{split}\begin{split} P' & \approx e^{-(m(m-1)/2s} \\ & \approx 1 + \frac{-m(m-1)}{2s} \end{split}\end{split}\]
Approximation #3: Square approximation

For large \(m\), we can approximate \(m(m-1)\) as \(m^2\) to yield

\[P' \approx 1-m^2/2s\]
Summary of equations

We may now summarize equations to approximate the probability of digest collisions.

Summary of Equations

Method

Probability of uniqueness(\(P'\))

Probability of collision(\(P=1-P'\))

Assumptions

Source/Comparison

exact

\(\prod_\nolimits{i=0}^{m-1} \frac{(s-i)}{s}\)

\(1-P'\)

\(1 \le m\le s\)

[1]

Taylor approximation on #1

\(e^{-m(m-1)/2s}\)

\(1-P'\)

\(m \ll s\)

[1]

Taylor approximation on #2

\(1 - \frac{m(m-1)}{2s}\)

\(\frac{m(m-1)}{2s}\)

(same)

[1]

Large square approximation

\(1 - \frac{m^2}{2s}\)

\(\frac{m^2}{2s}\)

(same)

[2] (where \(s=2^n\))

Choosing a digest size

Now, we turn the problem around:

What digest length :math:`b` is required to achieve a collision probability less than :math:`P` for :math:`m` messages?

From the above summary, we have \(P = m^2 / 2s\) for \(m \ll s\). Rewriting with \(s=2^b\), we have the probability of a collision using \(b\) bits with \(m\) messages (sequences) is:

\[P(b, m) = m^2 / 2^{b+1}\]

Note that the collision probability depends on the number of messages, but not their size.

Solving for the minimum number of bits \(b\) as a function of an expected number of sequences \(m\) and a desired tolerance for collisions of \(P\):

\[b(m, P) = \log_2{\left(\frac{m^2}{P}\right)} - 1\]

This equation is derived from equations that assume that \(m \ll s\), where \(s = 2^b\). When computing \(b(m,P)\), we’ll require that \(m/s \le 10^{-3}\) as follows:

\[m/s \le 10^{-3}\]

is approximately equivalent to:

\[m/2^b \le 2^{-5}\]
\[m \le 2^{b-5}\]
\[log_2 m \le b-5\]
\[b \ge 5 + log_2 m\]
For completeness:

Solving for the number of messages:

\[m(b, P) = \sqrt{P * 2^{b+1}}\]

This equation is not used further in this analysis.

def b2B3(b):
    """Convert bits b to Bytes, rounded up modulo 3

    We report modulo 3 because the intent will be to use Base64 encoding, which is
    most efficient when inputs have a byte length modulo 3. (Otherwise, the resulting
    string is padded with characters that provide no information.)

    """
    return math.ceil(b/8/3) * 3

def B(P, m):
    """return the number of bits needed to achieve a collision probability
    P for m messages

    Assumes m << 2^b.

    """
    b = math.log2(m**2 / P) - 1
    if b < 5 + math.log2(m):
        return "-"
    return b2B3(b)
m_bins = [1E6, 1E9, 1E12, 1E15, 1E18, 1E21, 1E24, 1E30]
P_bins = [1E-30, 1E-27, 1E-24, 1E-21, 1E-18, 1E-15, 1E-12, 1E-9, 1E-6, 1E-3, 0.5]
table_rows = []
table_rows += [["#m"] + ["P<={P}".format(P=P) for P in P_bins]]
table_rows += [["-"] * len(table_rows[0])]
for n_m in m_bins:
    table_rows += [["{:g}".format(n_m)] + [B(P, n_m) for P in P_bins]]
table = "\n".join(["|".join(map(str,row)) for row in table_rows])
table_header = "### digest length (bytes) required for expected collision probability $P$ over $m$ messages \n"
display(Markdown(table_header +  table))
digest length (bytes) required for expected collision probability \(P\) over \(m\) messages

#m

P<= 1e- 30

P<= 1e- 27

P<= 1e- 24

P<= 1e- 21

P<= 1e- 18

P<= 1e- 15

P<= 1e- 12

P<= 1e- 09

P<= 1e- 06

P<= 0.0 01

P<= 0.5

1e+ 06

18

18

15

15

15

12

12

9

9

9

6

1e+ 09

21

21

18

18

15

15

15

12

12

9

9

1e+ 12

24

24

21

21

18

18

15

15

15

12

12

1e+ 15

27

24

24

24

21

21

18

18

15

15

15

1e+ 18

30

27

27

24

24

24

21

21

18

18

15

1e+ 21

30

30

30

27

27

24

24

24

21

21

18

1e+ 24

33

33

30

30

30

27

27

24

24

24

21

1e+ 30

39

39

36

36

33

33

30

30

30

27

27

Frequently (Asked and) Answered Questions

How can I learn more about VRS? How can I get involved?

See Getting Involved.

Why does VRS …? Why did you use interresidue coordinates? Are they they same as 0-based coordinates? Why aren’t sequences typed?

The first stop for these questions is Design Decisions.

How does VRS handle strandedness?

It doesn’t. VRS presumes that all locations are with respect to the positive/forward/Watson strand.

How do you deal with variation that need to hold large amounts of data?

VRS models are minimal, meaning that they contain only the minimum information required to represent the instance. They do not contain related information or annotations of any sort. If an instance entails the insertion of a large arbitrary sequence, then the object will be large. Computed identifiers are fixed length and independent of the size of an object.

How do you handle variant representations and annotation across multiple transcripts and reference builds?

VRS does not currently structure any of the many notions of variant equivalence, although prototypes have been written. As of VRS mid-2021, readers are advised to consult VRSATILE.

How do you represent genotypes, especially for mosaicism and somatic variants (multi-ploidy)? What existing tools can help bridge single-location variants and genotypes with VRS?

VRS does not currently represent genotypes or mosaicism. Genotypes are expected in version 1.3 and will include support for moscaicim and chimerism. VRS may currently be used to represent somatic variation; no specialized support is required.

How do you represent different types of variation in a unified way (e.g. gene fusions)?

VRS does not currently represent structural variation such as fusions or translocations. Both are expected in version 1.3.

How do you communicate the uncertainty about variants meaningfully to other providers?

VRS represents variation only. All annotations about variation are left to other systems.

What makes it special/different/better than SPDI, VCF, and others?

See Relationship of VRS to existing standards.

Glossary

computed identifier

An identifier that is generated from the object’s data. Multiple groups who generated computed identifiers the same way will generate the same identifier for the same underlying data.

digest, ga4gh_digest

A digest is a digital fingerprint of a block of binary data. A digest is always the same size, regardless of the size of the input data. It is statistically extremely unlikely for two fingerprints to match when the underlying data are distinct.

identifiable object

An identifiable object in VRS is any data structure for which VRS defines a serialization method, which is the precursor to generating a computational digest. All Sequence, Location, and Variation types are identifiable.

serialization

The process of converting an object in memory into a stream of bytes that may be sent via the network, saved in a database, or written to a file.