About

This is a free service for phasing high coverage sequenced human samples hosted by the Department of Statistics, University of Oxford. The approach uses large reference panels of haplotypes from the Haplotype Reference Consortium, together with novel statistical methods implemented in the SHAPEIT2 program to carry out highly accurate phasing.

You can upload VCF files containing called genotypes. Phasing will proceed on a secure server, and once completed, you will be able to download the resulting haplotypes.


Access

This service is free for academic use for researchers/clinicians who have small numbers of research or clinical samples and wish to gain accurate phase information about these samples. At this point in time we can not provide unrestricted access to the servive due to limited resources. Commercial users are free to enquire about using this service.

To gain access please email Prof Jonathan Marchini clearly stating the institution you work for and the following details about the samples you wish to phase:

  1. Number of samples
  2. Coverage of sequencing performed
  3. Method of calling genotypes used
  4. Any other relevant information

If approved, your email address will be activated on this site and you will be able to sign up for an account.


Data security

The security of your data is our number one concern. The phasing server is located behind a firewall at the Oxford Data Center and is not linked to any other server.

Once phased you will be given a limited time window to download your data, after which all the genotype and phased data will be permanently deleted.

All communications between you and the phasing server are encrypted over HTTPS. We maintain the server through a secure shell connection that can only be initiated from IP addresses belonging to the University of Oxford Statistics Department.


Reference panels

Below is a description of the reference panels that can be used to phase uploaded data.

Description

The aim of the Haplotype Reference Consortium (HRC) is to create a large reference panel of human haplotypes by combining together sequencing data from multiple cohorts. This release consists of the haplotypes of 32,488 humans of predominantly European ancestry. More information about the HRC can be found here.

Reference panel information

Chromosomes
All autosomes
VCF lines
39,235,157
Bi-allelic SNPs
39,139,470
Tri-allelic SNPs
95,423
Quad-allelic SNPs
132
Samples
32,488
Ethnicity
Mostly pan European + 1000 Genomes Phase 3

Downloads

File SHA-256 sum
site list 56cd17105094983873e8fea1a9a8b1bb7c862b648c0b3f778c987c801e306d09

Description

The aim of the Haplotype Reference Consortium (HRC) is to create a large reference panel of human haplotypes by combining together sequencing data from multiple cohorts. This release consists of the haplotypes of 32,469 humans of predominantly European ancestry. More information about the HRC can be found here.

Reference panel information

Chromosomes
All autosomes
VCF lines
40,405,505
Bi-allelic SNPs
40,176,563
Tri-allelic SNPs
114,183
Quad-allelic SNPs
192
Samples
32,469
Ethnicity
Mostly pan European + 1000 Genomes Phase 3

Downloads

File SHA-256 sum
site list 94a8960bd9f661551d62ca12e0f947818dc316cda11665d103d5c703a7290b4c

Description

This is the November 2013 version of the UK 10k reference panel. The UK 10k haplotypes are based on the sequencing data of almost 4000 humans of predominantly British ancestry. This panel consists of the phased haplotypes of 3,755 unrelated individuals. More information about the UK 10k project can be found here.

Reference panel information

Chromosomes
All autosomes
VCF lines
42,795,273
Bi-allelic SNPs
40,057,860
Bi-allelic Indels
2,737,413
Samples
3,755
Ethnicity
Mostly UK

Downloads

File SHA-256 sum
site list 74bea0434c26155e8abc58b53407ca8813a42db48ad603a8b56411d4e7a6e2d1

Using the server

VCF Data preparation

You will need to store your data in a single VCF file before uploading it to our server.

We require the VCF file to:

  1. be a valid VCF
  2. be a block compressed gzip file
  3. be indexable using bcftools
  4. only contain biallelic SNPs or Indels
  5. only contain genotypes
  6. not contain missing genotypes
  7. contain chromosomes that have the same name as the chromosomes found in the reference panel site list
  8. only contain variants that have the same coordinate, REF, and ALT allele as a variant in the reference panel site list

Below, we will explain how to make sure your data conforms to all of these requirements with the help of a tool called bcftools.

Preparing your data using bcftools

Bcftools is an open source tool for working with VCF and BCF files that is maintained primarily by researchers at the Sanger Institute. We will demonstrate how to prepare your data using this tool because it is fast, and flexible enough to perform all the operations that we need. Bcftools depends on htslib. Instructions for installing bcftools can be found on the bcftools website.

If you need a script to check your VCF that "just works", then this bash script might work for you. Your VCF needs to be called geno.vcf.gz and the site list of your target reference panel needs to be called var_list.vcf.gz. Copy those two files into a new directory together with check_vcf.sh and run the command:

  bash ./check_vcf.sh

If all checks pass, then the last line of your output should read:

  ALL CHECKS OK

Breaking down the VCF checks

For all examples we will assume that you are on a Unix-like operating system, are running in a bash environment, have installed bcftools successfully, and have your genotypes stored in the current directory under the name geno.vcf.gz. For some examples you will need the site list of the reference panel you intend to use for phasing. We will assume that you have downloaded that site list and stored it in a file called var_list.vcf.gz.

We recommend you work through the examples in order, as later examples will make assumptions about your file that are checked by earlier examples.

Checking your VCF is valid

The first thing to try is to parse your VCF using bcftools and see if it throws any errors:

  bcftools view geno.vcf.gz >/dev/null

This reads the VCF and redirects the standard output to /dev/null. If you see any output, then the output must be coming from STDERR, and therefore must be an error.

Block compressing your VCF

Bcftools can be used to block compress your VCF, while storing the block compressed version of the VCF in out.vcf.gz:

  bcftools view geno.vcf.gz -Oz -o out.vcf.gz
  mv out.vcf.gz geno.vcf.gz

Making sure your VCF is indexable by bcftools

The indexing your VCF allows bcftools to quickly extract chromosomes and regions of your VCF. Indexing is only possible if you have previously block compressed your VCF. The following command should not throw any errors:

  bcftools index geno.vcf.gz

Keeping only biallelic SNPs and Indels

This command only keeps variants that contain exactly two alleles (one REF and one ALT allele), and are either SNPs or Indels:

  bcftools view -m2 -M2 -v snps,indels geno.vcf.gz -Oz -o out.vcf.gz
  mv out.vcf.gz geno.vcf.gz

Making sure your VCF contains only genotypes

What we mean is that your VCF should only contain the GT FORMAT tag. This command drops all other FORMAT tags:

  bcftools annotate -x FORMAT geno.vcf.gz -Oz -o out.vcf.gz
  mv out.vcf.gz geno.vcf.gz

Making sure your VCF does not contain any missing genotypes

The command below checks every line of your VCF to look for any missing genotypes (something like ./. or .). All lines that have at least one missing genotype are removed. You may want to first remove individuals with high rates of missing genotypes, but that is beyond the scope of these examples.

  bcftools view -g ^miss geno.vcf.gz -Oz -o out.vcf.gz
  mv out.vcf.gz geno.vcf.gz

Making sure your VCF contains the chromosomes in the variant list

  bcftools index geno.vcf.gz
  for i in $(seq 1 22); do bcftools view -G -r $i geno.vcf.gz >/dev/null
  done

If the code above throws any errors regarding chromosomes, then you might need to rename the chromosomes in your VCF. This can be done by creating a map from your current chromosome names to the numbers 1 through 22 and then using bcftools annotate --rename-chrs.

Subsetting your VCF down to the reference panel site list

First we index both files and then use bcftools to intersect the files.

  bcftools index geno.vcf.gz
  bcftools index var_list.vcf.gz
  bcftools isec -n=2 -w1 geno.vcf.gz var_list.vcf.gz -Oz -o out.vcf.gz
  mv out.vcf.gz geno.vcf.gz

Phase informative read data preparation

This server supports the use of phase informative reads (PIRs) for phasing. The extension to the SHAPEIT model used by the server is described in

O. Delaneau, B. Howie, A. Cox, J-F. Zagury, J. Marchini (2013). Haplotype estimation using sequence reads. American Journal of Human Genetics 93 (4) 787-696.

The paper is availabe at http://dx.doi.org/10.1016/j.ajhg.2013.09.002 . Delaneau et al. show that this extended model leads to improvements in the quality of the resulting haplotypes, especially at rare variants. If used, PIRs need to be uploaded as a file along with a VCF. Instructions for creating a PIR file from BAM and VCF files can be found at the SHAPEIT website .

Logging in

To log in, click the "Log in" link at the top right of your screen.

Jumbotron 021696500bdbb0bb1d68eb70936475de399ab3c00f957909e56e47d498b41c7d

You will need to type in your email address and password.

Login typing c2bd1e5d403dbe5152024b1d3338c98882456c13c412c5c245a0ad41f28b292c

VCF uploading

After logging in, a link called "Upload file" will appear in the navigation panel at the top right of your screen. Clicking on it will allow you to upload your VCF to the server.

Udata upload 30e5c97f67d32e12d39e69e784a7014601f208ef5d028089859e5331bf18e225

You will need to specify a project name for the VCF, a reference panel to use for phasing, which chromosomes are contained in your VCF, and the location on your computer where the VCF is located. Multiple chromosomes can be selected using the shift key. Optionally, you may upload a PIR file as well. If you do, then your VCF may only contain one chromosome.

After clicking the "Upload file" button, you will be redirected to your home page that now displays a panel representing your uploaded VCF. The VCF is automatically checked for errors.

Home checking ebbb9c515bd3ae067ec61c2fc8157c98c3849f4cd304b28de1c52f6adc323b92

At any time, you may click on the "view" button to see more details on how the checks are going.

Clicking on the "delete" button deletes the VCF from the server.

Phasing

Once the VCF passes all checks, a panel will appear underneath your VCF that allows you to create a job to phase one of the chromosomes in your VCF. Click the "Phase Chromosome" button to start a job.

Home ok dfc7fcdd376e8164197f6b8e14de62411ecbe36d8f25ae1a2426e2c681b091a8

Only chromosomes that were selected in the upload screen will be available from the chromosome dropdown menu.

After the phasing job has started, the status of the job will automatically update itself.

Home job series 3021beabefc2d5cb2cf1c60f85e84c981f51fba12e8f40a68c3b67ace65ef26a

Downloading phased haplotypes

Once the status of the job is "completed", you may download the phased haplotypes by first clicking on the "view" link, and then on the "download" button on the job page.

Job completed 54f185b49e9377a63a6136c63df95e1e75179eef991db2c5b7314f856e3093f1

The format of the download is a tar file containing the phased haplotypes in Impute2 format.

Deleting data

Data files and jobs can be deleted by clicking the "delete" link.


Publications

The Oxford Statistics Phasing Server builds upon several methodological developments and reference panels described in the following papers:

  • Kevin Sharp, Olivier Delaneau, Warren Kretzschmar, Jonathan Marchini. Phasing for medical sequencing using rare variants and large haplotype reference panels. Bioinformatics doi:10.1093/bioinformatics/btw065 [Link]
  • The Haplotype Reference Consortium. A reference panel of 64,976 haplotypes for genotype imputation. Nature Genetics (under review) [bioRxiv]
  • The 1000 Genomes Project Consortium (2015) A global reference for human genetic variation. Nature [Link] [Editorial] [News and Views]
  • J. O’Connell, D. Gurdasani, O. Delaneau, et al. (2014) A general approach for haplotype phasing across the full spectrum of relatedness. PLoS Genetics doi: 10.1371/journal.pgen.1004234 [Link]
  • O. Delaneau, J. Marchini, The 1000 Genomes Project Consortium (2014) Integrating sequence and array data to create an improved 1000 Genomes Project haplotype reference panel. Nature Communications [Link]
  • O. Delaneau, B. Howie, A. Cox, J-F. Zagury, J. Marchini (2013) Haplotype estimation using sequence reads. American Journal of Human Genetics 93 (4) 787-696 [Link]
  • O. Delaneau, J-F. Zagury and J. Marchini (2013) Improved whole chromosome phasing for disease and population genetic studies. Nature Methods 10, 5-6. [Link] [Supplementary Material] [Software]
  • O. Delaneau, J. Marchini, JF. Zagury (2012) A linear complexity phasing method for thousands of genomes. Nature Methods doi:10.1038/nmeth.1785 [Link] [Software]

Contact

For any questions about this service please email Prof Jonathan Marchini.