Package 'stmgp'

Title: Rapid and Accurate Genetic Prediction Modeling for Genome-Wide Association or Whole-Genome Sequencing Study Data
Description: Rapidly build accurate genetic prediction models for genome-wide association or whole-genome sequencing study data by smooth-threshold multivariate genetic prediction (STMGP) method. Variable selection is performed using marginal association test p-values with an optimal p-value cutoff selected by Cp-type criterion. Quantitative and binary traits are modeled respectively via linear and logistic regression models. A function that works through PLINK software (Purcell et al. 2007 <DOI:10.1086/519795>, Chang et al. 2015 <DOI:10.1186/s13742-015-0047-8>) <https://www.cog-genomics.org/plink2> is provided. Covariates can be included in regression model.
Authors: Masao Ueki [aut, cre]
Maintainer: Masao Ueki <[email protected]>
License: GPL (>= 2)
Version: 1.0.4.1
Built: 2025-02-12 07:09:03 UTC
Source: https://github.com/cran/stmgp

Help Index


Rapid and Accurate Genetic Prediction Modeling for Genome-Wide Association or Whole-Genome Sequencing Study Data

Description

Rapidly build accurate genetic prediction models for genome-wide association or whole-genome sequencing study data by smooth-threshold multivariate genetic prediction (STMGP) method. Variable selection is performed using marginal association test p-values with an optimal p-value cutoff selected by Cp-type criterion. Quantitative and binary traits are modeled respectively via linear and logistic regression models. A function that works through PLINK software (Purcell et al. 2007 <DOI:10.1086/519795>, Chang et al. 2015 <DOI:10.1186/s13742-015-0047-8>) <https://www.cog-genomics.org/plink2> is provided. Covariates can be included in regression model.

Details

The DESCRIPTION file: Index of help topics:

lapprox                 Compute lapprox
stmgeplink              Smooth-threshold multivariate genetic
                        prediction (incorporating gene-environment
                        interactions) for genome-wide association or
                        whole-genome sequencing data in PLINK format
stmgp                   Smooth-threshold multivariate genetic
                        prediction
stmgp-package           Rapid and Accurate Genetic Prediction Modeling
                        for Genome-Wide Association or Whole-Genome
                        Sequencing Study Data
stmgplink               Smooth-threshold multivariate genetic
                        prediction for genome-wide association or
                        whole-genome sequencing data in PLINK format

Author(s)

Maintainer: Masao Ueki <[email protected]>

References

Ueki M, Tamiya G, for Alzheimer's Disease Neuroimaging Initiative (2016) Smooth-thresholdmultivariate genetic prediction with unbiased model selection. Genet Epidemiol 40:233-43. <https://doi.org/10.1002/gepi.21958>

Ueki M (2009) A note on automatic variable selection using smooth-threshold estimating equations. Biometrika 96:1005-11. <https://doi.org/10.1093/biomet/asp060>

Ueki M, Fujii M, Tamiya G, for Alzheimer's Disease Neuroimaging Initiative and the Alzheimer's Disease Metabolomics Consortium (2019) Quick assessment for systematic test statistic inflation/deflation due to null model misspecifications in genome-wide environment interaction studies. PLoS ONE 14:e0219825. <https://doi.org/10.1371/journal.pone.0219825>

Ueki M, Tamiya G, for Alzheimer's Disease Neuroimaging Initiative (2021) mooth-threshold multivariate genetic prediction incorporating gene-nvironment interactions. G3 Genes|Genomes|Genetics 11:jkab278. <https://doi.org/10.1093/g3journal/jkab278>


Compute lapprox

Description

Compute the lapprox statistic developed in Ueki et al. (2019) to assess inflation/deflation in gene-environment interaction test before genome-wide scan. No genotype data is used in computing the lapprox. The statistic far from 1 implies inflation/deflation in genome-wide gene-environment interaction test (Kraft joint test), suggesting that the null model can be misspecified, where the null model assumes unrelated samples without population stratification. Quantitative and binary phenotypes are modeled via linear and logistic regression, respectively.

Usage

lapprox(Z, X, y)

Arguments

Z

Covariates to be adjusted for in regression model. Intercept term (i.e. 1 vector) should be included.

X

Covariates to be multipled by genetic variants (G), which should be a subset of X. Environment variables can be included in X. The terms to be tested are X[,1]*G, X[,2]*G, ..., X[,ncol(X)]*G. See example below.

y

A response variable, either quantitative or binary (coded 0 or 1).

Details

See Ueki et al. (2019).

Value

Value of lapprox. Deviation from 1 implies inflation/deflation in genome-wide gene-environment interaction test.

References

Kraft P, Yen YC, Stram DO, Morrison J, Gauderman WJ (2007) Exploiting gene-environment interaction to detect genetic associations. Human Heredity 63:111-9.

Ueki M, Fujii M, Tamiya G (2019) Quick assessment for systematic test statistic inflation/deflation due to null model misspecifications in genome-wide environment interaction studies. PLoS ONE 14: e0219825.

Examples

## Not run: 


set.seed(3222)

# lapprox both for quantitative and binary phenotypes
# input:
# Z: covariate matrix including the intercept
# X: matrix for joint GxE interaction effects (marginal effect must be represented by the intercept)
# y: phenotype vector (0/1 code needed for binary phenotype)


n = 400  # sample size
q = 2  # number of covariates
Z = matrix(rnorm(n*q),n,q)  # covariates, the first component is used as the environment variable
mu = -1.5 + Z[,1]
y = mu + rnorm(n)  # quantitative phenotype
Y = rbinom(n,size=1,prob=1/(1+exp(-mu*5)))  # binary phenotype


#Example usage for quantitative trait

# lapprox for joint GxE interaction test (correctly specified)
# y ~ b0*1 + b1*Z[,1] + b2*Z[,2] + b3*G*1 + b4*G*Z[,1]
# H0: b3 = b4 = 0 is tested
lapprox(Z=cbind(1,Z),X=cbind(1,Z[,1]),y=y)
# [1] 1.012104

# lapprox for joint GxE interaction test (misspecified)
# y ~ b0*1 + b1*Z[,1]*Z[,1] + b2*Z[,2]*Z[,2] + b3*G*1 + b4*G*Z[,1]*Z[,1]
# H0: b3 = b4 = 0 is tested
lapprox(Z=cbind(1,Z*Z),X=cbind(1,Z[,1]*Z[,1]),y=y)
# [1] 1.969929

# lapprox for marginal association test
# y ~ b0*1 + b1*Z[,1] + b2*Z[,2] + b3*G*1
# H0: b3 = 0 is tested is tested
lapprox(Z=cbind(1,Z),X=matrix(1,nrow(Z),1),y=y)
# [1] 1.010101


#Example usage for binary trait

# lapprox for joint GxE interaction test (correctly specified)
# y ~ b0*1 + b1*Z[,1] + b2*Z[,2] + b3*G*1 + b4*G*Z[,1]
# H0: b3 = b4 = 0 is testedd
lapprox(Z=cbind(1,Z),X=cbind(1,Z[,1]),y=Y)
# [1] 0.9388417

# lapprox for joint GxE interaction test (misspecified)
# y ~ b0*1 + b1*Z[,1]*Z[,1] + b2*Z[,2]*Z[,2] + b3*G*1 + b4*G*Z[,1]*Z[,1]
# H0: b3 = b4 = 0 is tested
lapprox(Z=cbind(1,Z*Z),X=cbind(1,Z[,1]*Z[,1]),y=Y)
# [1] 1.150531

# lapprox for marginal association test
# y ~ b0*1 + b1*Z[,1] + b2*Z[,2] + b3*G*1
# H0: b3 = 0 is tested
lapprox(Z=cbind(1,Z),X=matrix(1,nrow(Z),1),y=Y)
# [1] 1.078894



## End(Not run)

Smooth-threshold multivariate genetic prediction

Description

Smooth-threshold multivariate genetic prediction (STMGP) method, which is based on the smooth-threshold estimating equations (Ueki 2009). Variable selection is performed based on marginal association test p- values (i.e. test of nonzero slope parameter in univariate regression for each predictor variable) with an optimal p-value cutoff selected by a Cp-type criterion. Quantitative and binary phenotypes are modeled via linear and logistic regression, respectively.

Usage

stmgp(y, X, Z = NULL, tau, qb, maxal, gamma = 1, ll = 50,
  lambda = 1, alc = NULL, pSum = NULL)

Arguments

y

A response variable, either quantitative or binary (coded 0 or 1); Response type is specified by qb.

X

Predictor variables subjected to variable selection.

Z

Covariates; Z=NULL means unspecified.

tau

tau parameter (allowed to be a vector object); NULL (default) specifies tau=n/log(n)^0.5 as suggested in Ueki and Tamiya (2016).

qb

Type of response variable, qb="q" and "b" specify quantitative and binary traits, respectively.

maxal

Maximum p-value cutoff for search.

gamma

gamma parameter; gamma=1 is default as suggested in Ueki and Tamiya (2016).

ll

Number of candidate p-value cutoffs for search (default=50) as determined by 10^seq( log10(maxal),log10(5e-8), length=ll).

lambda

lambda parameter (default=1).

alc

User-specified candidate p-value cutoffs for search; ll option is effective if alc=NULL.

pSum

User-specified p-values matrix from other studies that are independent of the study data (optional, default=NULL), a matrix object having rows with the same size of X and columns for each study (multiple studies are capable). Missing p-values must be coded as NA. Summary p-values are combined with p-values in the study data by the Fisher's method.

Details

See Ueki and Tamiya (2016).

Value

Muhat

Estimated phenotypic values from linear model evaluated at each candidate tuning parameters (al and tau) whose size is of (sample size) x (length of al) x (length of tau).

gdf

Generalized degrees of freedom (GDF, Ye 1998) whose size is of (length of al) x (length of tau).

sig2hat

Error variance estimates (=1 for binary traits) whose size is of (length of al) x (length of tau).

df

Number of nonzero regression coefficients whose size is of (length of al) x (length of tau).

al

Candidate p-value cutoffs for search.

lopt

An optimal tuning parameter indexes for al and tau selected by Cp-type criterion, CP

BA

Estimated regression coefficient matrix whose size is of (1 + number of columns of Z + number of columns of X) x (length of al)) x (length of tau)); the first element, the second block and third block correspond to intercept, Z and X, respectively.

Loss

Loss (sum of squared residuals or -2*loglikelihood) whose size is of (length of al) x (length of tau).

sig2hato

An error variance estimate (=1 for binary traits) used in computing the variance term of Cp-type criterion.

tau

Candidate tau parameters for search.

CP

Cp-type criterion whose size is of (length of al) x (length of tau).

References

Ye J (1988) On measuring and correcting the effects of data mining and model selection. J Am Stat Assoc 93:120-31.

Ueki M (2009) A note on automatic variable selection using smooth-threshold estimating equations. Biometrika 96:1005-11.

Examples

## Not run: 


set.seed(22200)

wd = system.file("extdata",package="stmgp")

D = read.table(unzip(paste(wd,"snps.raw.zip",sep="/"),exdir=tempdir()),header=TRUE)

X = D[,-(1:6)]
X = (X==1) + 2*(X==2)
p = ncol(X)
n = nrow(X)
ll = 30
p0 = 50; b0 = log(rep(1.2,p0))
iA0 = sample(1:p,p0)
Z = as.matrix(cbind(rnorm(n),runif(n)))  # covariates
eta = crossprod(t(X[,iA0]),b0) - 4 + crossprod(t(Z),c(0.5,0.5))


# quantitative trait
mu = eta
sig = 1.4
y = mu + rnorm(n)*sig
STq = stmgp(y,X,Z,tau=n*c(1),qb="q",maxal=0.1,gamma=1,ll=ll)
boptq = STq$BA[,STq$lopt[1],STq$lopt[2]]  # regression coefficient in selected model
nonzeroXq = which( boptq[(1+ncol(Z))+(1:p)]!=0 )  # nonzero regression coefficient
# check consistency
cor( STq$Muhat[,STq$lopt[1],STq$lopt[2]], crossprod(t(cbind(1,Z,X)),boptq) )
cor( STq$Muhat[,STq$lopt[1],STq$lopt[2]], eta)  # correlation with true function
# proportion of correctly identified true nonzero regression coefficients
length(intersect(which(boptq[-(1:(ncol(Z)+1))]!=0),iA0))/length(iA0)


# binary trait
mu = 1/(1+exp(-eta))
Y = rbinom(n,size=1,prob=mu)
STb = stmgp(Y,X,Z,tau=n*c(1),qb="b",maxal=0.1,gamma=1,ll=ll)
boptb = STb$BA[,STb$lopt[1],STb$lopt[2]]  # regression coefficient in selected model
nonzeroXb = which( boptb[(1+ncol(Z))+(1:p)]!=0 )  # nonzero regression coefficient
# check consistency
cor( STb$Muhat[,STb$lopt[1],STb$lopt[2]], crossprod(t(cbind(1,Z,X)),boptb) )
Prob = 1/(1+exp(-STb$Muhat[,STb$lopt[1],STb$lopt[2]]))  # Pr(Y=1) (logistic regression)
cor( STb$Muhat[,STb$lopt[1],STb$lopt[2]], eta)  # correlation with true function
# proportion of correctly identified true nonzero regression coefficients 
length(intersect(which(boptb[-(1:(ncol(Z)+1))]!=0),iA0))/length(iA0)



# simulated summary p-values
pSum = cbind(runif(ncol(X)),runif(ncol(X)));
pSum[iA0,1] = pchisq(rnorm(length(iA0),5,1)^2,df=1,low=F); # study 1 summary p-values
pSum[iA0,2] = pchisq(rnorm(length(iA0),6,1)^2,df=1,low=F); # study 2 summary p-values
pSum[sample(1:length(pSum),20)] = NA
head(pSum)


# quantitative trait using summary p-values
STqs = stmgp(y,X,Z,tau=n*c(1),qb="q",maxal=0.1,gamma=1,ll=ll,pSum=pSum)
boptqs = STqs$BA[,STqs$lopt[1],STqs$lopt[2]]  # regression coefficient in selected model
nonzeroXqs = which( boptqs[(1+ncol(Z))+(1:p)]!=0 )  # nonzero regression coefficient
# check consistency
cor( STqs$Muhat[,STqs$lopt[1],STqs$lopt[2]], crossprod(t(cbind(1,Z,X)),boptqs) )
cor( STqs$Muhat[,STqs$lopt[1],STqs$lopt[2]], eta)  # correlation with true function
# proportion of correctly identified true nonzero regression coefficients 
length(intersect(which(boptqs[-(1:(ncol(Z)+1))]!=0),iA0))/length(iA0)



# binary trait using summary p-values
STbs = stmgp(Y,X,Z,tau=n*c(1),qb="b",maxal=0.1,gamma=1,ll=ll,pSum=pSum)
boptbs = STbs$BA[,STbs$lopt[1],STbs$lopt[2]]  # regression coefficient in selected model
nonzeroXbs = which( boptbs[(1+ncol(Z))+(1:p)]!=0 )  # nonzero regression coefficient
# check consistency
cor( STbs$Muhat[,STbs$lopt[1],STbs$lopt[2]], crossprod(t(cbind(1,Z,X)),boptbs) )
Prob = 1/(1+exp(-STbs$Muhat[,STbs$lopt[1],STbs$lopt[2]]))  # Pr(Y=1) (logistic regression)
cor( STbs$Muhat[,STbs$lopt[1],STbs$lopt[2]], eta)  # correlation with true function
# proportion of correctly identified true nonzero regression coefficients 
length(intersect(which(boptbs[-(1:(ncol(Z)+1))]!=0),iA0))/length(iA0)






## End(Not run)