Data preparation¶
Michigan Imputation Server 2 accepts VCF files compressed with bgzip. Please ensure that the following requirements are met:
- Create a separate vcf.gz file for each chromosome.
- Variants must be sorted by genomic position.
- GRCh37 or GRCh38 coordinates are required.
Note
Multiple *.vcf.gz files can be uploaded at once.
Quality Control for HRC, 1000G and CAAPA imputation¶
Will Rayner provides an excellent toolbox for preparing data: HRC or 1000G Pre-imputation Checks.
The main steps for using HRC are:
Download tool and sites¶
wget http://www.well.ox.ac.uk/~wrayner/tools/HRC-1000G-check-bim-v4.2.7.zip
wget ftp://ngs.sanger.ac.uk/production/hrc/HRC.r1-1/HRC.r1-1.GRCh37.wgs.mac5.sites.tab.gz
Convert ped/map to bed¶
Create a frequency file¶
Execute script¶
perl HRC-1000G-check-bim.pl -b <bim file> -f <freq-file> -r HRC.r1-1.GRCh37.wgs.mac5.sites.tab -h
sh Run-plink.sh
Create vcf using VcfCooker¶
vcfCooker --in-bfile <bim file> --ref <reference.fasta> --out <output-vcf> --write-vcf
bgzip <output-vcf>
Additional Tools¶
Convert ped/map files to VCF files¶
Several tools are available: plink2, BCFtools or VcfCooker.
Create a sorted vcf.gz file using bcftools:
CheckVCF¶
Use checkVCF to ensure that the VCF files are valid. CheckVCF provides “Action Items” (e.g., uploading to an SFTP server) that can be ignored. Focus solely on verifying the validity of the files with this tool.