Data

This tutorial uses simulated genotype and phenotype data based on the 1000 Genomes Project Human Omni 2.5 array.
The dataset represents a subset of ~2,500 samples and approximately 2 million SNPs, focusing on chromosome 20. The data are designed to illustrate GWAS, imputation, and polygenic score (PGS) analysis workflows using realistic, but simulated, inputs derived from real genotype array structures.


Imputation Workshop Slides (ASHG 2025)

Chip Array Data

  • Filename: gwas.array.hapmap.chr20.vcf.gz (20.4 MB)
  • Description: Simulated genotype array data for chromosome 20 (23,907 variants).
  • Format: VCF (Variant Call Format), compressed with gzip.
  • Use: Serves as the input array dataset for genotype imputation on the Michigan Imputation Server

Imputed Data

  • Filename: gwas.imputed.chr20.dose.vcf.gz (774.5 MB)
  • Description: Imputed genotype data for chromosome 20 using the HapMap reference panel on the Michigan Imputation Server. A total of 63,402 variants were obtained after imputation..
  • Format: VCF with dosage values (0–2) for each SNP, compressed with gzip.
  • Use: Used for downstream association tests and polygenic score calculations.

Phenotype Data

  • Filename: phenotypes.txt (131.6 KB)
  • Description: Simulated phenotype file containing four traits.
  • Format: Plain text (tab-delimited).
  • Columns:
    • sample_id — individual sample identifiers
    • pheno_1, pheno_2, pheno_3, pheno_4 — four simulated phenotypes
  • Use: Provides phenotype data for GWAS and PGS evaluation.
Phenotype Type Highly Genetic?
pheno_1 Binary (e.g., case/control) ✅ Yes
pheno_2 Continuous (e.g., height) ✅ Yes
pheno_3 Binary or categorical ❌ No
pheno_4 Continuous ❌ No

Covariates (Principal Components)

  • Filename: covariates.txt (292.3 KB)
  • Description: Covariate file containing the first 10 genetic principal components (PCs) for each individual.
  • Format: Tab-delimited, compressed with gzip.
  • Columns:
    • sample — individual identifiers matching the genotype data
    • PC1PC10 — first ten principal components representing population structure
  • Use: Used as covariates in association analyses to correct for population stratification.

Polygenic Scores

  • Filename: scores.txt (234.4 KB)
  • Description: Simulated polygenic scores (PGS) corresponding to the four phenotypes.
  • Format: Comma-delimited.
  • Columns:
    • sample — individual identifiers matching the genotype data
    • score_1, score_2, score_3, score_4 — simulated PGS values
  • Use: Demonstrates PGS evaluation and performance comparison across phenotypes.

Note

All data are simulated for educational purposes. They do not contain any identifiable or real genetic or phenotype information from human participants.