Taiwan Biobank Imputation Server

Introduction

Taiwan Biobank Imputation Server provides a free genotype imputation service using Minimac4. You can upload phased or unphased GWAS genotypes and receive phased and imputed genomes in return. This server offers imputation from HapMap Phase 2, 1000 Genomes Phase 3 (Version 5) reference panel for all registered users, and Taiwan Biobank 1.5k reference panel for users who have finished the application process of Taiwan Biobank (detail). For all uploaded datasets an extensive QC is performed.
The complete source code is hosted on GitHub using Travis CI for continuous integration.

Please cite this paper if you use Taiwan Biobank Imputation Server in your publication:

Das S, Forer L, Schönherr S, Sidore C, Locke AE, Kwong A, Vrieze S, Chew EY, Levy S, McGue M, Schlessinger D, Stambolian D, Loh PR, Iacono WG, Swaroop A, Scott LJ, Cucca F, Kronenberg F, Boehnke M, Abecasis GR, Fuchsberger C. Next-generation genotype imputation service and methods. Nature Genetics 48, 1284–1287 (2016).

Getting Started

To use Taiwan Biobank Imputation Server, a registration is required. We send an activation mail to the provided address. Please follow the instructions in the email to activate your account. If it doesn’t arrive, ensure you have entered the correct email address and check your spam folder.

After the email address has been verified, the service can be used without any costs.

Setup your first imputation job

Please login with your credentials and click on the Run tab to start a new imputation job. The submission dialog allows you to specify the properties of your imputation job.

The following options are available:

Reference panel

Our server offers genotype imputation from different reference panels. Please select one that fulfills your needs and supports the population of your input data:

  • 1000 Genomes Phase 3 (1000G Phase 3 v5, detail)
  • HapMap Phase 2 (HapMap 2, detail)
  • Taiwan Biobank 1.5k (TWB hg38 1.5k, for users who have finished application process of Taiwan Biobank (detail).)

Upload VCF files from your computer

When using the file upload, data is uploaded from your local file system to Taiwan Biobank Imputation Server. By clicking on Select Files an open dialog appears where you can select your VCF files:

Multiple files can be selected using the ctrl, cmd or shift keys, depending on your operating system. After you have confirmed your choice, all selected files are listed in the submission dialog:

Please make sure that all files fulfill the requirements.

Important

Since version 1.7.2 URL-based uploads (sftp and http) are no longer supported. Please use direct file uploads instead.

Build

Please select the build of your data. Currently the options hg19 and hg38 are supported. Taiwan Biobank Imputation Server automatically updates the genome positions (liftOver) of your data.

rsq Filter

To minimize the file size, Taiwan Biobank Imputation Server includes a r2 filter option, excluding all imputed SNPs with a r2-value (= imputation quality) smaller then the specified value.

Phasing

If your uploaded data is unphased, Eagle v2.4 will be used for phasing. In case your uploaded VCF file already contains phased genotypes, please select the “No phasing” option.

Algorithm Description
Eagle v2.4 The Eagle algorithm estimates haplotype phase using the HRC reference panel. This method is also suitable for single sample imputation. After phasing or imputation you will receive phased genotypes in your VCF files.

Population

Please select the population of your uploaded samples. This information is used to compare the allele frequencies between your data and the reference panel. Please note that not every reference panel supports all sub-populations.

Population Supported Reference Panels
AFR 1000G Phase 3 v5
AMR 1000G Phase 3 v5
EUR 1000G Phase 3 v5, HapMap 2
EAS 1000G Phase 3 v5, TWB hg38 1.5k
SAS 1000G Phase 3 v5
Other/Mixed 1000G Phase 3 v5

In case your population is not listed or your samples are from different populations, please select Other/Mixed to skip the allele frequency check. For Other/Mixed populations, no QC-Report will be created.

Mode

Please select if you want to run Quality Control & Imputation, Quality Control & Phasing Only or Quality Control Only.

AES 256 encryption

All Imputation Server results are returned as an encrypted .zip file by default. If you select this option, we will use stronger AES 256 encryption instead of the default encryption method. However, note that AES encryption does not work with standard unzip programs. If this option is selected, we recommend using 7-zip to open your results.

Start your imputation job

After confirming our Terms of Service, the imputation process can be started immediately by clicking on Submit Job. Input Validation and Quality Control are executed immediately to give you feedback about the data-format and its quality. If your data passed this steps, your job is added to our imputation queue and will be processed as soon as possible. You can check the position in the queue on the job summary page.

We notify you by email as soon as the job is finished or your data don’t pass the Quality Control steps.

Input validation

In a first step we check if your uploaded files are valid and we calculate some basic statistics such as amount of samples, chromosomes and SNPs.

After Input Validation has finished, basic statistics can be viewed directly in the web interface.

If you encounter problems with your data, please read this tutorial about Data Preparation to ensure your data is in the correct format.

Quality control

In this step we check each variant and exclude it in case of:

  1. contains invalid alleles
  2. duplicates
  3. indels
  4. monomorphic sites
  5. allele mismatch between reference panel and uploaded data
  6. SNP call rate < 90%

All filtered variants are listed in a file called statistics.txt which can be downloaded by clicking on the provided link. More informations about our QC pipeline can be found here.

If you selected a population, we compare the allele frequencies of the uploaded data with those from the reference panel. The result of this check is available in the QC report and can be downloaded by clicking on qcreport.html.

Pre-phasing and imputation

Imputation is achieved with Minimac4. The progress of all uploaded chromosomes is updated in real time and visualized with different colors.

Data compression and encryption

If imputation was successful, we compress and encrypt your data and send you a random password via mail.

This password is not stored on our server at any time. Therefore, if you lost the password, there is no way to resend it to you, and you will need to re-impute your results.

Download results

The user is notified by email, as soon as the imputation job has finished. A zip archive including the results can be downloaded directly from the server. To decrypt the results, a one-time password is generated by the server and included in the email. The QC report and filter statistics can be displayed and downloaded as well.

All data is deleted automatically after 30 days

Be sure to download all needed data in this time period. We send you a reminder 7 days before we delete your data. Once your job hast the state retired, we are not able to recover your data!

Download via a web browser

All results can be downloaded directly via your browser by clicking on the filename.

In order to download results via the commandline using wgetor aria2 you need to click on the share symbol (located right to the file size) to get the needed private links.

A new dialog appears which provides you the private link. Click on the tab wget command to get a copy & paste ready command that can be used on Linux or MacOS to download the file via the command-line.

Download all results at once

To download all files of a folder (for example folder Imputation Results) you can click on the share symbol of the folder:

A new dialog appears which provides you all private links at once. Click on the tab wget commands to get copy & paste ready commands that can be used on Linux or MacOS to download all files.

Data Preparation

Taiwan Biobank Imputation Server accepts VCF files compressed with bgzip. Please make sure the following requirements are met:

  • Create a separate vcf.gz file for each chromosome.
  • Variations must be sorted by genomic position.
  • GRCh37 or GRCh38 coordinates are required.
  • If your input data is GRCh37/hg19, please ensure chromosomes are encoded without prefix (e.g. 20).
  • If your input data is GRCh38/hg38, please ensure chromosomes are encoded with prefix ‘chr’ (e.g. chr20).
Note

Several *.vcf.gz files can be uploaded at once.

Quality control for 1000G imputation

Will Rayner provides a great toolbox to prepare data: 1000G Pre-imputation Checks.

The main steps for 1000G are:

Download tool and sites

wget http://www.well.ox.ac.uk/~wrayner/tools/HRC-1000G-check-bim-v4.2.7.zip
wget ftp://ngs.sanger.ac.uk/production/hrc/HRC.r1-1/HRC.r1-1.GRCh37.wgs.mac5.sites.tab.gz

Convert ped/map to bed

plink --file <input-file> --make-bed --out <output-file>

Create a frequency file

plink --freq --bfile <input> --out <freq-file>

Execute script

perl HRC-1000G-check-bim.pl -b <bim file> -f <freq-file> -r HRC.r1-1.GRCh37.wgs.mac5.sites.tab -h sh Run-plink.sh

Create vcf using VcfCooker

vcfCooker --in-bfile <bim file> --ref <reference.fasta> --out <output-vcf> --write-vcf bgzip <output-vcf>

Additional tools

Convert ped/map files to VCF files

Several tools are available: plink2, BCFtools or VcfCooker.

plink --ped study_chr1.ped --map study_chr1.map --recode vcf --out study_chr1

Create a sorted vcf.gz file using BCFtools:

bcftools sort study_chr1.vcf -Oz -o study_chr1.vcf.gz

CheckVCF

Use checkVCF to ensure that the VCF files are valid. checkVCF proposes “Action Items” (e.g. upload to sftp server), which can be ignored. Only the validity should be checked with this command.

checkVCF.py -r human_g1k_v37.fasta -o out mystudy_chr1.vcf.gz

Pipeline Overview

Our pipeline performs the following steps:

Quality control

  • Create chunks with a size of 20 Mb

  • For each 20Mb chunk we perform the following checks:

    On Chunk level:

    • Determine amount of valid variants: A variant is valid iff it is included in the reference panel. At least 3 variants must be included.
    • Determine amount of variants found in the reference panel: At least 50 % of the variants must be be included in the reference panel.
    • Determine sample call rate: At least 50 % of the variants must be called for each sample.

    Chunk exclusion: if (#variants < 3 || overlap < 50% || sampleCallRate < 50%)

    On Variant level:

    • Check alleles: Only A,C,G,T are allowed
    • Calculate alternative allele frequency (AF): Mark all with a AF > 0.5.
    • Calculate SNP call rate
    • Calculate chi square for each variant (reference panel vs. study data)
    • Determine allele switches: Compare ref and alt of reference panel with study data (A/T and C/G variants are ignored).
    • Determine strand flips: After eliminating possible allele switches, flip and compare ref/alt from reference panel with study data.
    • Determine allele switches in combination with strand flips: Combine the two rules from above.

    Variant exclusion: Variants are excluded in case of:

    1. invalid alleles occur (!(A,C,G,T))
    2. duplicates (DUP filter or (pos - 1 == pos))
    3. indels
    4. monomorphic sites
    5. allele mismatch between reference panel and study
    6. SNP call rate < 90%

    On Sample level:

    • For chr1-22, a chunk is excluded if one sample has a call rate < 50 %. Only complete chunks are excluded, not samples (see “On Chunk level” above)
  • Perform a liftOver step, if build of input data and reference panel does not match (b37 vs b38).

Phasing

  • Execute for each chunk one of the following phasing algorithms (we use an overlap of 5 Mb). For example, chr20:1-20000000 and reference population EUR:

Eagle2

./eagle --vcfRef HRC.r1-1.GRCh37.chr20.shapeit3.mac5.aa.genotypes.bcf
--vcfTarget chunk_20_0000000001_0020000000.vcf.gz
--geneticMapFile genetic_map_chr20_combined_b37.txt
--outPrefix chunk_20_0000000001_0020000000.phased
--bpStart 1
--bpEnd 25000000
--allowRefAltSwap
--vcfOutFormat z

Please note: Target-only sites for unphased data are not included in the final output.

Imputation

  • Execute for each chunk minimac in order to impute the phased data (we use a window of 500 kb):
./Minimac4 --refHaps HRC.r1-1.GRCh37.chr1.shapeit3.mac5.aa.genotypes.m3vcf.gz
--haps chunk_1_0000000001_0020000000.phased.vcf
--start 1
--end 20000000 
--window 500000 
--prefix chunk_1_0000000001_0020000000
--cpus 1
--chr 20
--noPhoneHome 
--format GT,DS,GP 
--allTypedSites 
--meta 
--minRatio 0.00001

Compression and encryption

  • Merge all chunks of one chromosome into one single vcf.gz
  • Encrypt data with one-time password

Chromosome X pipeline

Additionally to the standard QC, the following per-sample checks are executed for chrX:

  • Ploidy Check: Verifies if all variants in the nonPAR region are either haploid or diploid.
  • Mixed Genotypes Check: Verifies if the amount of mixed genotypes (e.g. 1/.) is < 10 %.

For phasing and imputation, chrX is split into three independent chunks (PAR1, nonPAR, PAR2). These splits are then automatically merged by Taiwan Biobank Imputation Server and are returned as one complete chromosome X file. Only Eagle is supported.

b37 coordinates

Independent chunk Region
ChrX PAR1 Region chr X1 (< 2699520)
ChrX nonPAR Region chr X2 (2699520 - 154931044)
ChrX PAR2 Region chr X3 (> 154931044)

b38 coordinates

Independent chunk Region
ChrX PAR1 Region chr X1 (< 2781479)
ChrX nonPAR Region chr X2 (2781479-155701383)
ChrX PAR2 Region chr X3 (> 155701383)

Data Security

Since data is transfered to our server located in National Center for High-performance Computing, a wide array of security measures are in force:

  • The complete interaction with the server is secured with HTTPS.
  • Input data is deleted from our servers as soon it is not needed anymore.
  • We only store the number of samples and markers analyzed, we don’t ever “look” at your data in anyway.
  • All results are encrypted with a strong one-time password - thus, only you can read them.
  • After imputation is finished, the data uploader has 7 days to use an encrypted connection to get results back.
  • The complete source code is available via public Github repositories:

Who has access?

To upload and download data, users must register with a unique e-mail address and strong password. Each user can only download imputation results for samples that they have themselves uploaded; no other imputation server users will be able to access your data.

Cookies

We value your privacy and are committed to transparency regarding the use of cookies on our website. Below, we outline our cookie policy to provide you with clarity and assurance.

What are cookies?

Cookies are small text files that are placed on your device when you visit a website. They serve various purposes, including enhancing user experience, facilitating website functionality, and analyzing website traffic.

How do we use cookies?

We use cookies only for the purpose of facilitating login functionality. These cookies help us recognize your device and authenticate your access to our platform securely. We do not track any personal information or analyze user activities through cookies.

Why do we use cookies?

Cookies are essential for providing seamless login experiences to our users. By storing authentication information, cookies enable you to access your account efficiently without the need for repetitive login procedures. We respect your privacy and limit cookie usage exclusively to login purposes.

What security or firewalls protect access?

A wide array of security measures are in force on the imputation servers:

  • SSH login to the servers is restricted to only systems administrators.
  • Direct root login via SSH is not allowed from the public Internet.
  • The public-facing side of the servers sits behind the School of Public Health’s Checkpoint virtual firewall instance where a default-deny policy is used on inbound traffic; only explicitly allowed TCP ports are passed.
  • The School of Public Health also makes use of NIDS technologies such as Snort and Peakflow on its network links for traffic analysis and threat detection.
  • On imputation server itself, updates are run regularly by systems administrators who follow several zero-day computer security announcement lists; the OSSEC HIDS is used for log analysis and anomaly detection; and Denyhosts is used to thwart brute-force SSH login attacks.

What encryption of the data is used while the data are present?

Imputation results are encrypted with a one-time password generated by the system. The password consists of lower characters, upper characters, special characters and numbers with max. 3 duplicates.

FAQ

I did not receive a password for my imputation job

Taiwan Biobank Imputation Server creates a random password for each imputation job. This password is not stored on server-side at any time. If you didn’t receive a password, please check your Mail SPAM folder. Please note that we are not able to re-send you the password. If you lose it, you will need to re-run your imputation job.

Unzip command is not working

Please check the following points: (1) When selecting AES256 encryption, please use 7z to unzip your files (Debian: sudo apt-get install p7zip-full). For our default encryption all common programs should work. (2) If your password includes special characters (e.g. \), please put single or double quotes around the password when extracting it from the command line (e.g. 7z x -p"PASSWORD" chr_22.zip).

Can I download all results at once?

We provide wget command for all results. Please open the results tab. The last column in each row includes direct links to all files.