Data
This tutorial uses simulated genotype and phenotype data based on the 1000 Genomes Project Human Omni 2.5 array.
The dataset represents a subset of ~2,500 samples and approximately 2 million SNPs, focusing on chromosome 20. The data are designed to illustrate GWAS, imputation, and polygenic score (PGS) analysis workflows using realistic, but simulated, inputs derived from real genotype array structures.
Imputation Workshop Slides (ASHG 2025)
- Filename: Imputation Workshop ASHG 2025 (19 MB)
Chip Array Data
- Filename: gwas.array.hapmap.chr20.vcf.gz (20.4 MB)
- Description: Simulated genotype array data for chromosome 20 (23,907 variants).
- Format: VCF (Variant Call Format), compressed with gzip.
- Use: Serves as the input array dataset for genotype imputation on the Michigan Imputation Server
Imputed Data
- Filename: gwas.imputed.chr20.dose.vcf.gz (774.5 MB)
- Description: Imputed genotype data for chromosome 20 using the HapMap reference panel on the Michigan Imputation Server. A total of 63,402 variants were obtained after imputation..
- Format: VCF with dosage values (0–2) for each SNP, compressed with gzip.
- Use: Used for downstream association tests and polygenic score calculations.
Phenotype Data
- Filename: phenotypes.txt (131.6 KB)
- Description: Simulated phenotype file containing four traits.
- Format: Plain text (tab-delimited).
- Columns:
sample_id— individual sample identifierspheno_1,pheno_2,pheno_3,pheno_4— four simulated phenotypes
- Use: Provides phenotype data for GWAS and PGS evaluation.
| Phenotype | Type | Highly Genetic? |
|---|---|---|
| pheno_1 | Binary (e.g., case/control) | ✅ Yes |
| pheno_2 | Continuous (e.g., height) | ✅ Yes |
| pheno_3 | Binary or categorical | ❌ No |
| pheno_4 | Continuous | ❌ No |
Covariates (Principal Components)
- Filename: covariates.txt (292.3 KB)
- Description: Covariate file containing the first 10 genetic principal components (PCs) for each individual.
- Format: Tab-delimited, compressed with gzip.
- Columns:
sample— individual identifiers matching the genotype dataPC1–PC10— first ten principal components representing population structure
- Use: Used as covariates in association analyses to correct for population stratification.
Polygenic Scores
- Filename: scores.txt (234.4 KB)
- Description: Simulated polygenic scores (PGS) corresponding to the four phenotypes.
- Format: Comma-delimited.
- Columns:
sample— individual identifiers matching the genotype datascore_1,score_2,score_3,score_4— simulated PGS values
- Use: Demonstrates PGS evaluation and performance comparison across phenotypes.
Note
All data are simulated for educational purposes. They do not contain any identifiable or real genetic or phenotype information from human participants.