Code for the All-x-All AoU project.
Note: The scripts in this GitHub repository are not compatible with the All of Us Researcher Workbench. We are in the process of adapting the code and developing a public workspace to enable users to reproduce the analyses within the Workbench environment. Updates will be provided once the workspace becomes available.
GWAS and RVAS on the AoU data.
- Wenhan Lu (@wlu04)
Before running any QoB job, make sure the configurations are set up correctly by doing:
hailctl config set batch/billing_project all-by-aou
hailctl config set batch/remote_tmpdir gs://aou_tmp
# hailctl config set batch/tmp_dir gs://aou_tmp
hailctl config set query/backend batch
hailctl config list
- Description:
- Generate random phenotypes using GRM from SAIGE step 0
- Usage:
Rscript random_phenos.R \ -g ~/Downloads/250k_data_utils_grm_aou_afr._relatednessCutoff_0.125_2000_randomMarkersUsed.sparseGRM.mtx \ -s ~/Downloads/250k_data_utils_grm_aou_afr._relatednessCutoff_0.125_2000_randomMarkersUsed.sparseGRM.mtx.sampleIDs.txt \ -p 0.01 \ -o data/random_pheno_afr- for more information: Run
Rscript random_phenos.R -h
- for more information: Run
- Description:
- Read locus and pvalue information from input
-f - Generate QQ-plot(s)
- (Optional) Gerate Manhattan plot
- Required arguments: input file
-f, with pvalue column-p, chromosome column-c+ base-pare column-bpOR comma-seperated locus identifier column-i
- Read locus and pvalue information from input
- Usage:
- single phenotype:
Rscript manhattan_and_qq_plot.R -f ~/Downloads/amr_3446_both_sexes.txt.bgz -p Pvalue -c chr -b pos - multiple phenotypes:
Rscript manhattan_and_qq_plot.R -f ~/Downloads/afr_variant_results_pilot_af.txt.bgz -p Pvalue -m phenoname
- single phenotype:
- Options:
-q TRUE: generate qq-plot(s) only, ignore manhattan plot(s)- for more information: Run
Rscript manhattan_and_qq_plot.R -h
- Usage:
python3 saige_aou.py --run_pipeline --phenos height,p_0.5_continuous_1 --pops eur --irnt --single_variant_only --skip_saige --skip_bgen --test
- Description:
- Load the original .tsv.gz VAT from the original bucket
- Parse the non-string fields to the correct datatype
- Sort the table and key by
locusandalleles
- Usage:
python3 reformat_vat.py
- Description:
- Load original .csv phenotype files to HTs and parse the non-string fields to the correct datatype (
--update-raw-phenotypes) - Load the .tsv sample information files to HTs and parse the non-string fields to the correct datatype (
--update-sample-util-ht) - Merge the parsed phenotype HTs (
--update-phenotype-ht) and annotate sample information (--annotate-phenotype-ht) - Generate sample meta information table with standard covariates to be used for association tests (
--update-meta-ht)
- Load original .csv phenotype files to HTs and parse the non-string fields to the correct datatype (
- Usage:
python3 process_phenotype.py --update-sample-util-ht --update-raw-phenotypes --update-phenotype-ht --annotate-phenotype-ht --update-meta-ht - Options:
--batchrun jobs with Hail Batch in case nothing else works
- Description: pre-processing data to generate the GRM for producing population-specific random phenotypes
- Usage:
- QoB:
python3 pre_process_random_pheno.py --create-plink-file --create-sparse-grm --pop amr --overwrite-variant-ht --overwrite-variant-mt --ld-prune --overwrite-ld-ht --overwrite-plink --overwrite-sample-file - Dataproc:
hailctl dataproc submit clustername pre_process_random_pheno.py --pop eur --create-plink-file --ld-prune --overwrite-ld-ht --overwrite-plink --overwrite-sample-file --pyfiles ~/Dropbox\ \(Partners\ HealthCare\)/github_repo/aou_gwas/ - Note:
- LD pruning will run out of memory on QoB for EUR, which should be run on dataproc.
- To run the pipeline on dataproc, use
from aou_gwas import * - To run the pipeline on QoB, use
from utils.utils import * // from utils.resources import * - hail 0.2.124 causes transient error
- QoB:
- Options:
--popcan be a comma-separated list of any combinations from['afr', 'amr', 'eas', 'eur', 'mid', 'sas', 'all']--ld-prunewhether to run ld pruning before exporting to plink files (required for GRM)
- Description: chunking VDS into Bgen files each covers an interval of approximately
N_GENE_PER_GROUPgenes - Usage:
python3 export_vds_to_bgen.py --test --mean_impute_missing --update-vds