Data

This tutorial uses simulated genotype and phenotype data based on the 1000 Genomes Project Human Omni 2.5 array.
The dataset represents a subset of ~2,500 samples and approximately 2 million SNPs, focusing on chromosome 20. The data are designed to illustrate GWAS, imputation, and polygenic score (PGS) analysis workflows using realistic, but simulated, inputs derived from real genotype array structures.

Imputation Workshop Slides (ASHG 2025)

Filename: Imputation Workshop ASHG 2025 (19 MB)

Chip Array Data

Filename: gwas.array.hapmap.chr20.vcf.gz (20.4 MB)
Description: Simulated genotype array data for chromosome 20 (23,907 variants).
Format: VCF (Variant Call Format), compressed with gzip.
Use: Serves as the input array dataset for genotype imputation on the Michigan Imputation Server

Imputed Data

Filename: gwas.imputed.chr20.dose.vcf.gz (774.5 MB)
Description: Imputed genotype data for chromosome 20 using the HapMap reference panel on the Michigan Imputation Server. A total of 63,402 variants were obtained after imputation..
Format: VCF with dosage values (0–2) for each SNP, compressed with gzip.
Use: Used for downstream association tests and polygenic score calculations.

Phenotype Data

Filename: phenotypes.txt (131.6 KB)
Description: Simulated phenotype file containing four traits.
Format: Plain text (tab-delimited).
Columns:
- sample_id — individual sample identifiers
- pheno_1, pheno_2, pheno_3, pheno_4 — four simulated phenotypes
Use: Provides phenotype data for GWAS and PGS evaluation.

Phenotype	Type	Highly Genetic?
pheno_1	Binary (e.g., case/control)	✅ Yes
pheno_2	Continuous (e.g., height)	✅ Yes
pheno_3	Binary or categorical	❌ No
pheno_4	Continuous	❌ No

Covariates (Principal Components)

Filename: covariates.txt (292.3 KB)
Description: Covariate file containing the first 10 genetic principal components (PCs) for each individual.
Format: Tab-delimited, compressed with gzip.
Columns:
- sample — individual identifiers matching the genotype data
- PC1–PC10 — first ten principal components representing population structure
Use: Used as covariates in association analyses to correct for population stratification.

Polygenic Scores

Filename: scores.txt (234.4 KB)
Description: Simulated polygenic scores (PGS) corresponding to the four phenotypes.
Format: Comma-delimited.
Columns:
- sample — individual identifiers matching the genotype data
- score_1, score_2, score_3, score_4 — simulated PGS values
Use: Demonstrates PGS evaluation and performance comparison across phenotypes.

Note

All data are simulated for educational purposes. They do not contain any identifiable or real genetic or phenotype information from human participants.