{"id":728,"date":"2015-02-08T15:14:11","date_gmt":"2015-02-08T22:14:11","guid":{"rendered":"http:\/\/homepages.uc.edu\/~yaozo\/wordpress\/?p=728"},"modified":"2015-02-08T15:14:11","modified_gmt":"2015-02-08T22:14:11","slug":"collinearity-and-stepwise-vif-selection","status":"publish","type":"post","link":"https:\/\/zhuoyao.net\/index.php\/2015\/02\/08\/collinearity-and-stepwise-vif-selection\/","title":{"rendered":"Collinearity and stepwise VIF selection"},"content":{"rendered":"<p>Collinearity, or excessive correlation among explanatory variables, can complicate or prevent the identification of an optimal set of explanatory variables for a statistical model. For example, forward or backward selection of variables could produce inconsistent results, variance partitioning analyses may be unable to identify unique sources of variation, or parameter estimates may include substantial amounts of uncertainty. The temptation to build an ecological model using all available information (i.e., all variables) is hard to resist. Lots of time and money are exhausted gathering data and supporting information. We also hope to identify every significant variable to more accurately characterize relationships with biological relevance. Analytical limitations related to collinearity require us to think carefully about the variables we choose to model, rather than adopting a naive approach where we blindly use all information to understand complexity. The purpose of this blog is to illustrate use of some techniques to reduce collinearity among explanatory variables using a simulated dataset with a known correlation structure.<\/p>\n<p>A simple approach to identify collinearity among explanatory variables is the use of variance inflation factors (VIF). VIF calculations are straightforward and easily comprehensible; the higher the value, the higher the collinearity. A VIF for a single explanatory variable is obtained using the r-squared value of the regression of that variable against all other explanatory variables:<\/p>\n<p align=\"center\"><img decoding=\"async\" class=\"latex\" title=\"\\displaystyle VIF_j=\\frac{1}{1-R_j^2} \" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cdisplaystyle+VIF_j%3D%5Cfrac%7B1%7D%7B1-R_j%5E2%7D+&amp;bg=ffffff&amp;fg=000000&amp;s=0\" alt=\"\\displaystyle VIF_j=\\frac{1}{1-R_j^2} \" \/><\/p>\n<p>where the <img decoding=\"async\" class=\"latex\" title=\"VIF\" src=\"https:\/\/s0.wp.com\/latex.php?latex=VIF&amp;bg=ffffff&amp;fg=000000&amp;s=0\" alt=\"VIF\" \/> for variable <img decoding=\"async\" class=\"latex\" title=\"j\" src=\"https:\/\/s0.wp.com\/latex.php?latex=j&amp;bg=ffffff&amp;fg=000000&amp;s=0\" alt=\"j\" \/> is the reciprocal of the inverse of <img decoding=\"async\" class=\"latex\" title=\"R^2\" src=\"https:\/\/s0.wp.com\/latex.php?latex=R%5E2&amp;bg=ffffff&amp;fg=000000&amp;s=0\" alt=\"R^2\" \/> from the regression. A VIF is calculated for each explanatory variable and those with high values are removed. The definition of \u2018high\u2019 is somewhat arbitrary but values in the range of 5-10 are commonly used.<\/p>\n<p>Several packages in R provide functions to calculate VIF: <a href=\"http:\/\/rss.acs.unt.edu\/Rdoc\/library\/HH\/html\/vif.html\">vif<\/a> in package HH, <a href=\"http:\/\/hosho.ees.hokudai.ac.jp\/~kubo\/Rdoc\/library\/car\/html\/vif.html\">vif<\/a>in package car, <a href=\"http:\/\/www.inside-r.org\/packages\/cran\/fmsb\/docs\/VIF\">VIF<\/a> in package fmsb, <a href=\"http:\/\/rss.acs.unt.edu\/Rdoc\/library\/faraway\/html\/vif.html\">vif<\/a> in package faraway, and <a href=\"http:\/\/cran.open-source-solution.org\/web\/packages\/VIF\/index.html\">vif<\/a> in package VIF. The number of packages that provide VIF functions is surprising given that they all seem to accomplish the same thing. One exception is the function in the VIF package, which can be used to create linear models using VIF-regression. The nuts and bolts of this function are a little unclear since <a href=\"http:\/\/cran.open-source-solution.org\/web\/packages\/VIF\/VIF.pdf\">the documentation<\/a> for the package is sparse. However, what this function does accomplish is something that the others do not: stepwise selection of variables using VIF. Removing individual variables with high VIF values is insufficient in the initial comparison using the full set of explanatory variables. The VIF values will change after each variable is removed. Accordingly, a more thorough implementation of the VIF function is to use a stepwise approach until all VIF values are below a desired threshold. For example, using the full set of explanatory variables, calculate a VIF for each variable, remove the variable with the single highest value, recalculate all VIF values with the new set of variables, remove the variable with the next highest value, and so on, until all values are below the threshold.<\/p>\n<p>In this blog we\u2019ll use a custom function for stepwise variable selection. I\u2019ve created this function because I think it provides a useful example for exploring stepwise VIF analysis. The function is a wrapper for the vif function in fmsb. We\u2019ll start by simulating a dataset with a known correlation structure.<\/p>\n<pre><code>require(MASS)\nrequire(clusterGeneration)\n\nset.seed(2)\nnum.vars&lt;-15\nnum.obs&lt;-200\ncov.mat&lt;-genPositiveDefMat(num.vars,covMethod=\"unifcorrmat\")$Sigma\nrand.vars&lt;-mvrnorm(num.obs,rep(0,num.vars),Sigma=cov.mat)\n<\/code><\/pre>\n<p>We\u2019ve created fifteen \u2018explanatory\u2019 variables with 200 observations each. The<a href=\"http:\/\/stat.ethz.ch\/R-manual\/R-patched\/library\/MASS\/html\/mvrnorm.html\">mvrnorm<\/a> function (MASS package) was used to create the data using a covariance matrix from the <a href=\"http:\/\/rss.acs.unt.edu\/Rdoc\/library\/clusterGeneration\/html\/genPositiveDefMat.html\">genPositiveDefMat<\/a> function (clusterGeneration package). These functions provide a really simple approach to creating data matrices with arbitrary correlation structures. The covariance matrix was chosen from a uniform distribution such that some variables are correlated while some are not. A more thorough explanation about creating correlated data matrices can be found <a href=\"http:\/\/www.quantumforest.com\/2011\/10\/simulating-data-following-a-given-covariance-structure\/\">here<\/a>. The correlation matrix for the random variables should look very similar to the correlation matrix from the actual values (as sample size increases, the correlation matrix approaches cov.mat).<\/p>\n<p><a href=\"https:\/\/beckmw.files.wordpress.com\/2013\/02\/cor_tab1.pdf\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-large wp-image-363\" src=\"https:\/\/beckmw.files.wordpress.com\/2013\/02\/cor_tab1.jpg?w=525&amp;h=196\" alt=\"cor_tab\" width=\"525\" height=\"196\" \/><\/a><\/p>\n<p>Now we create our response variable as a linear combination of the explanatory variables. First, we create a vector for the parameters describing the relationship of the response variable with the explanatory variables. Then, we use some matrix algebra and a randomly distributed error term to create the response variable. This is the standard form for a linear regression model.<\/p>\n<pre><code>parms&lt;-runif(num.vars,-10,10)\ny&lt;-rand.vars %*% matrix(parms) + rnorm(num.obs,sd=20)\n<\/code><\/pre>\n<p>We would expect a regression model to indicate each of the fifteen explanatory variables are significantly related to the response variable, since we know the true relationship of y with each of the variables. However, our explanatory variables are correlated. What happens when we create the model?<\/p>\n<pre><code>lm.dat&lt;-data.frame(y,rand.vars)\nform.in&lt;-paste('y ~',paste(names(lm.dat)[-1],collapse='+'))\nmod1&lt;-lm(form.in,data=lm.dat)\nsummary(mod1)\n<\/code><\/pre>\n<p align=\"center\"><a href=\"https:\/\/beckmw.files.wordpress.com\/2013\/02\/mod1.pdf\"><img decoding=\"async\" class=\"alignnone size-large wp-image-369\" src=\"https:\/\/beckmw.files.wordpress.com\/2013\/02\/mod1.jpg?w=525&amp;h=804\" alt=\"mod1\" \/><\/a><\/p>\n<p>The model shows that only four of the fifteen explanatory variables are significantly related to the response variable (at <img decoding=\"async\" class=\"latex\" title=\"\\alpha = 0.05\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Calpha+%3D+0.05&amp;bg=ffffff&amp;fg=000000&amp;s=0\" alt=\"\\alpha = 0.05\" \/>), yet we know that every one of the variables is related to y. As we\u2019ll see later, the standard errors are also quite large.<\/p>\n<p>We can try an alternative approach to building the model that accounts for collinearity among the explanatory variables. We can implement the custom VIF function as follows.<\/p>\n<pre><code>vif_func(in_frame=rand.vars,thresh=5,trace=T)\n\n  var vif             \n X1  26.7776302460193\n X2  35.7654696801389\n X3  14.8902623488606\n X4  50.6259723278776\n X5  10.599371257556 \n X6  108.343545737888\n X7  48.2508656429107\n X8  183.136179797657\n X9  51.6790552123906\n X10 63.8699838164383\n X11 38.9458133633031\n X12 44.3534264537944\n X13 9.35861427426385\n X14 63.1574276237521\n X15 30.0137537949494\n\nremoved:  X8 183.1362 \n\n var vif             \n X1  5.57731851381497\n X2  10.0195886727232\n X3  5.55663566788945\n X4  6.8064112804091 \n X5  9.7815324084451 \n X6  22.3236741700758\n X7  36.854990561001 \n X9  16.972399679086 \n X10 57.2665930293009\n X11 22.4854807367867\n X12 43.1006397357538\n X13 8.54661668063361\n X14 29.7536838039265\n X15 21.6340334562738\n\nremoved:  X10 57.26659 \n\n var vif             \n X1  5.55463656650283\n X2  8.43692519123461\n X3  4.20157496220101\n X4  4.30562228649632\n X5  1.85152657224351\n X6  9.78518916197122\n X7  3.59917695249808\n X9  5.62398393809027\n X11 4.32732961231283\n X12 8.92901049257853\n X13 2.22079922858869\n X14 9.73258301210856\n X15 8.69287102590565\n\nremoved:  X6 9.785189 \n\n var vif             \n X1  4.88431271981048\n X2  3.0066710371039 \n X3  3.92223104412672\n X4  4.03552281755132\n X5  1.85130973105683\n X7  3.56687077767566\n X9  5.55536287729148\n X11 2.11226533056043\n X12 5.58689916270725\n X13 1.86868960383407\n X14 9.39686287473867\n X15 7.80042398111767\n\nremoved:  X14 9.396863 \n\n[1] \"X1\"  \"X2\"  \"X3\"  \"X4\"  \"X5\"  \"X7\"  \"X9\"  \"X11\"\n [9] \"X12\" \"X13\" \"X15\"\n<\/code><\/pre>\n<p>The function uses three arguments. The first is a matrix or data frame of the explanatory variables, the second is the threshold value to use for retaining variables, and the third is a logical argument indicating if text output is returned as the stepwise selection progresses. The output indicates the VIF values for each variable after each stepwise comparison. The function calculates the VIF values for all explanatory variables, removes the variable with the highest value, and repeats until all VIF values are below the threshold. The final output is a list of variable names with VIF values that fall below the threshold. Now we can create a linear model using explanatory variables with less collinearity.<\/p>\n<pre><code>keep.dat&lt;-vif_func(in_frame=rand.vars,thresh=5,trace=F)\nform.in&lt;-paste('y ~',paste(keep.dat,collapse='+'))\nmod2&lt;-lm(form.in,data=lm.dat)\nsummary(mod2)\n<\/code><\/pre>\n<p align=\"center\"><a href=\"https:\/\/beckmw.files.wordpress.com\/2013\/02\/mod2.pdf\"><img decoding=\"async\" class=\"alignnone size-large wp-image-369\" src=\"https:\/\/beckmw.files.wordpress.com\/2013\/02\/mod2.jpg?w=525&amp;h=804\" alt=\"mod1\" \/><\/a><\/p>\n<p>The updated regression model is much improved over the original. We see an increase in the number of variables that are significantly related to the response variable. This increase is directly related to the standard error estimates for the parameters, which look at least 50% smaller than those in the first model. The take home message is that true relationships among variables will be masked if explanatory variables are collinear. This creates problems in model creation which lead to complications in model inference. Taking the extra time to evaluate collinearity is a critical first step to creating more robust ecological models.<\/p>\n<p>Function and example code:<\/p>\n<div id=\"gist4717702\" class=\"gist\">\n<div class=\"gist-file\">\n<div class=\"gist-data gist-syntax\">\n<div class=\"file-data\">\n<pre class=\"\">#stepwise VIF function used below\nvif_func&lt;-function(in_frame,thresh=10,trace=T,...){\n\n  require(fmsb)\n  \n  if(class(in_frame) != 'data.frame') in_frame&lt;-data.frame(in_frame)\n  \n  #get initial vif value for all comparisons of variables\n  vif_init&lt;-NULL\n  for(val in names(in_frame)){\n      form_in&lt;-formula(paste(val,' ~ .'))\n      vif_init&lt;-rbind(vif_init,c(val,VIF(lm(form_in,data=in_frame,...))))\n      }\n  vif_max&lt;-max(as.numeric(vif_init[,2]))\n\n  if(vif_max &lt; thresh){\n    if(trace==T){ #print output of each iteration\n        prmatrix(vif_init,collab=c('var','vif'),rowlab=rep('',nrow(vif_init)),quote=F)\n        cat('\\n')\n        cat(paste('All variables have VIF &lt; ', thresh,', max VIF ',round(vif_max,2), sep=''),'\\n\\n')\n        }\n    return(names(in_frame))\n    }\n  else{\n\n    in_dat&lt;-in_frame\n\n    #backwards selection of explanatory variables, stops when all VIF values are below 'thresh'\n    while(vif_max &gt;= thresh){\n      \n      vif_vals&lt;-NULL\n\n      for(val in names(in_dat)){\n        form_in&lt;-formula(paste(val,' ~ .'))\n        vif_add&lt;-VIF(lm(form_in,data=in_dat,...))\n        vif_vals&lt;-rbind(vif_vals,c(val,vif_add))\n        }\n      max_row&lt;-which(vif_vals[,2] == max(as.numeric(vif_vals[,2])))[1]\n\n      vif_max&lt;-as.numeric(vif_vals[max_row,2])\n\n      if(vif_max&lt;thresh) break\n      \n      if(trace==T){ #print output of each iteration\n        prmatrix(vif_vals,collab=c('var','vif'),rowlab=rep('',nrow(vif_vals)),quote=F)\n        cat('\\n')\n        cat('removed: ',vif_vals[max_row,1],vif_max,'\\n\\n')\n        flush.console()\n        }\n\n      in_dat&lt;-in_dat[,!names(in_dat) %in% vif_vals[max_row,1]]\n\n      }\n\n    return(names(in_dat))\n    \n    }\n  \n  }<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"wpcnt\"><\/div>\n","protected":false},"excerpt":{"rendered":"<p>Collinearity, or excessive correlation among explanatory variables, can complicate or prevent the identification of an optimal set of explanatory variables for a statistical model. For&hellip; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[],"class_list":["post-728","post","type-post","status-publish","format-standard","hentry","category-r"],"_links":{"self":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts\/728","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/comments?post=728"}],"version-history":[{"count":0,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts\/728\/revisions"}],"wp:attachment":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/media?parent=728"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/categories?post=728"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/tags?post=728"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}