Skip to contents

Redundancy analysis (RDA)

#Install required packages
rda_packages()

If using RDA, please cite the following: Capblancq T., Forester B.R. (2021) Redundancy analysis: A Swiss Army Knife for landscape genomics. Methods in Ecology and Evolution 12:2298-2309. DOI:10.1111/2041-210X.13722

Redundancy analysis (RDA) is a genotype-environment association (GEA) method that uses constrained ordination to detect outlier loci that are significantly associated with environmental variables. It does so by combining regression (in which response variables are genetic data, in our case, while explanatory variables are environmental data) with ordination (a PCA). Importantly, RDA is multivariate (i.e., multiple loci can be considered at a time), which is appealing in cases where multilocus selection may be occurring.

Model selection is performed by starting with a null model wherein the response is only explained by an intercept, and environmental variables are added until the amount of variance explained by a model that includes all variables is reached. By doing so, this method minimizes the redundancy among explanatory variables. Importantly, we can also perform an RDA that accounts for covariables (or “conditioning variables”) such as those that account for neutral population structure. When covariables are included, this is called a partial RDA. In algatr, population structure is quantified by running a PCA on the genetic data, and selecting a number of PC axes to best represent this type of neutral variation.

Some of the earliest examples of uses of RDA for identifying environment-associated loci are Lasky et al. 2012 and Forester et al. 2016. Since then, there have been several reviews and walkthroughs of the method that provide additional information, including workflows and comparisons with other GEA methods (e.g., Forester et al. 2018). Much of algatr’s code is adapted from one such paper, Capblancq & Forester 2021, the code of which is available here. algatr’s general workflow roughly follows the framework of Capblancq & Forester (2021).

RDA cannot take in missing values. Imputation based on the per-site median is commonly performed, but there are several other ways researchers can deal with missing values. For example, algatr contains the str_impute() function to impute missing values based on population structure using the LEA::impute() function. However, here, we’ll opt to impute to the median, but strongly urge researchers to use extreme caution when using this form of simplistic imputation. We mainly provide code to impute on the median for testing datasets and highly discourage its use in further analyses (please use str_impute() instead!).

The general workflow to perform an RDA with algatr is as follows:

  • Simple vs. partial RDAs: We can run a simple RDA (no covariables) or partial RDA (conditioning covariables included) using rda_run()

  • Variable selection: Both simple and partial RDAs can be performed considering all variables ("full" model) or by performing variable selection to determine variables that contribute most to genetic variance ("best" model)

  • Variance partitioning: We can perform variance partitioning considering only variables that contribute significantly to genetic variance using rda_varpart()

  • Detecting outlier loci: Based on the results from above, we can detect outlier loci with rda_getoutliers()

Read in and process genetic data

Running an RDA requires two data files for input: a genotype dosage matrix (the gen argument) and the environmental values extracted at sampling coordinates (the env argument). Let’s first convert our vcf to a dosage matrix using the vcf_to_dosage() function. N.B.: our code assumes that sample IDs from our genetic data and our coords are in the same order; be sure to check this before moving forward!

load_algatr_example()
#> 
#> ---------------- example dataset ----------------
#>  
#> Objects loaded: 
#> *liz_vcf* vcfR object (1000 loci x 53 samples) 
#> *liz_gendist* genetic distance matrix (Plink Distance) 
#> *liz_coords* dataframe with x and y coordinates 
#> *CA_env* RasterStack with example environmental layers 
#> 
#> -------------------------------------------------
#> 
#> 
# Convert from vcf to dosage matrix:
gen <- vcf_to_dosage(liz_vcf)
#> Loading required package: vcfR
#> 
#>    *****       ***   vcfR   ***       *****
#>    This is vcfR 1.15.0 
#>      browseVignettes('vcfR') # Documentation
#>      citation('vcfR') # Citation
#>    *****       *****      *****       *****
#> Loading required namespace: adegenet

As mentioned above, running an RDA requires that your genotype matrix contains no missing values. Let’s impute missing values based on the per-site median. N.B.: this type of simplistic imputation is strongly not recommended for downstream analyses and is used here for example’s sake!

# Are there NAs in the data?
gen[1:5, 1:5]
#>        Locus_10 Locus_15 Locus_22 Locus_28 Locus_32
#> ALT3          0        0        0        0        0
#> BAR360        0        0        0       NA        0
#> BLL5          0       NA        0        0       NA
#> BNT5          0        0       NA       NA        0
#> BOF1          0       NA       NA       NA       NA
gen <- simple_impute(gen)
# Check that NAs are gone
gen[1:5, 1:5]
#>        Locus_10 Locus_15 Locus_22 Locus_28 Locus_32
#> ALT3          0        0        0        0        0
#> BAR360        0        0        0        0        0
#> BLL5          0        0        0        0        0
#> BNT5          0        0        0        0        0
#> BOF1          0        0        0        0        0

Process environmental data

Let’s extract environmental variables using the extract() function from the raster package. We also need to standardize environmental variables. This is particularly important if we are using (for example) bioclimatic variables as input, as units of measurement are completely different (e.g., mm for precipitation vs. degrees Celsius for temperature). To do so, we’ll use the scale() function within the raster package.

# Extract environmental vars
env <- raster::extract(CA_env, liz_coords)

# Standardize environmental variables and make into dataframe
env <- scale(env, center = TRUE, scale = TRUE)
env <- data.frame(env)

Running simple and partial RDAs using rda_run(), with and without variable selection


The main function within algatr to perform an RDA is rda_run(), which uses the rda() function within the vegan package.

Run a simple RDA with no variable selection

A simple RDA is one in which our model will not account for covariables in the model, and is the default for rda_run(). Let’s first run a simple RDA on all environmental variables (i.e., no variable selection). This is specified using the model = "full" argument.

mod_full <- rda_run(gen, env, model = "full")

The resulting object is large, containing 12 elements relevant to the RDA. Let’s take a look at what function was called. We can see that all environmental variables (CA_rPCA1, 2, and 3) were included in the model, and that geography or structure were not included in the model (and there were no conditioning variables).

mod_full$call
#> rda(formula = gen ~ CA_rPCA1 + CA_rPCA2 + CA_rPCA3, data = moddf)

Now, let’s take a look at the summary of this model. One of the most important parts of this object is the partitioning of variance. Within an RDA, the term “inertia” can be interpreted as variance, and our results show us that the amount of variance explained by explanatory variables alone (constrained inertia) is only 9.768%. Unconstrained inertia is the residual variance, which is very high (90.232%). The summary also provides site scores, site constraints, and biplot scores.

head(summary(mod_full))
#> 
#> Call:
#> rda(formula = gen ~ CA_rPCA1 + CA_rPCA2 + CA_rPCA3, data = moddf) 
#> 
#> Partitioning of variance:
#>               Inertia Proportion
#> Total          143.39    1.00000
#> Constrained     14.01    0.09768
#> Unconstrained  129.38    0.90232
#> 
#> Eigenvalues, and their contribution to the variance 
#> 
#> Importance of components:
#>                          RDA1    RDA2    RDA3     PC1     PC2     PC3     PC4
#> Eigenvalue            7.28806 4.98205 1.73568 18.6608 9.31326 7.51572 5.53312
#> Proportion Explained  0.05083 0.03474 0.01210  0.1301 0.06495 0.05241 0.03859
#> Cumulative Proportion 0.05083 0.08557 0.09768  0.2278 0.29277 0.34518 0.38377
#>                           PC5    PC6     PC7     PC8     PC9    PC10    PC11
#> Eigenvalue            4.51893 4.0443 3.69111 3.30413 3.09940 3.04539 3.00412
#> Proportion Explained  0.03152 0.0282 0.02574 0.02304 0.02162 0.02124 0.02095
#> Cumulative Proportion 0.41529 0.4435 0.46923 0.49228 0.51389 0.53513 0.55608
#>                          PC12    PC13    PC14    PC15    PC16   PC17    PC18
#> Eigenvalue            2.83164 2.80969 2.74342 2.58682 2.47393 2.4664 2.32471
#> Proportion Explained  0.01975 0.01959 0.01913 0.01804 0.01725 0.0172 0.01621
#> Cumulative Proportion 0.57583 0.59542 0.61456 0.63260 0.64985 0.6671 0.68326
#>                          PC19    PC20    PC21    PC22    PC23    PC24    PC25
#> Eigenvalue            2.29159 2.22955 2.15249 2.10481 2.01172 1.94388 1.90538
#> Proportion Explained  0.01598 0.01555 0.01501 0.01468 0.01403 0.01356 0.01329
#> Cumulative Proportion 0.69925 0.71480 0.72981 0.74449 0.75852 0.77207 0.78536
#>                          PC26    PC27    PC28    PC29    PC30    PC31    PC32
#> Eigenvalue            1.82439 1.79687 1.68505 1.64293 1.59963 1.56602 1.50690
#> Proportion Explained  0.01272 0.01253 0.01175 0.01146 0.01116 0.01092 0.01051
#> Cumulative Proportion 0.79808 0.81062 0.82237 0.83382 0.84498 0.85590 0.86641
#>                          PC33     PC34    PC35     PC36    PC37     PC38
#> Eigenvalue            1.46174 1.433419 1.34638 1.331355 1.28477 1.253763
#> Proportion Explained  0.01019 0.009997 0.00939 0.009285 0.00896 0.008744
#> Cumulative Proportion 0.87661 0.886602 0.89599 0.905277 0.91424 0.922981
#>                          PC39     PC40    PC41     PC42     PC43     PC44
#> Eigenvalue            1.23738 1.175289 1.13416 1.072082 1.067608 1.017457
#> Proportion Explained  0.00863 0.008197 0.00791 0.007477 0.007446 0.007096
#> Cumulative Proportion 0.93161 0.939807 0.94772 0.955193 0.962639 0.969734
#>                           PC45     PC46     PC47     PC48     PC49
#> Eigenvalue            0.998927 0.910062 0.866439 0.836444 0.727872
#> Proportion Explained  0.006967 0.006347 0.006043 0.005833 0.005076
#> Cumulative Proportion 0.976701 0.983048 0.989090 0.994924 1.000000
#> 
#> Accumulated constrained eigenvalues
#> Importance of components:
#>                         RDA1   RDA2   RDA3
#> Eigenvalue            7.2881 4.9820 1.7357
#> Proportion Explained  0.5204 0.3557 0.1239
#> Cumulative Proportion 0.5204 0.8761 1.0000
#> 
#> Scaling 2 for species and site scores
#> * Species are scaled proportional to eigenvalues
#> * Sites are unscaled: weighted dispersion equal on all dimensions
#> * General scaling constant of scores:  9.29244 
#> 
#> 
#> Species scores
#> 
#>                RDA1       RDA2       RDA3        PC1        PC2        PC3
#> Locus_10  3.073e-02 -1.130e-02  2.569e-02 -1.786e-02  1.512e-02  3.392e-02
#> Locus_15  8.232e-02 -5.560e-02 -9.949e-02 -9.373e-02  9.032e-03  1.796e-02
#> Locus_22  9.598e-03  1.021e-03  2.333e-02  1.249e-01 -2.411e-02 -8.721e-03
#> Locus_28 -7.716e-02 -8.271e-02 -6.968e-02 -2.229e-02  3.464e-02  4.338e-02
#> Locus_32  1.095e-02 -1.692e-04  3.179e-02 -1.580e-02  7.483e-03  2.773e-02
#> Locus_35 -4.632e-32 -1.047e-31 -8.290e-32 -6.887e-17 -1.174e-16  4.446e-17
#> ....                                                                      
#> 
#> 
#> Site scores (weighted sums of species scores)
#> 
#>           RDA1    RDA2    RDA3     PC1      PC2     PC3
#> ALT3    0.1226 -2.0313 -1.2154 -1.4589 -5.19649 -0.2248
#> BAR360 -0.1057  0.4007  1.8064  3.0848 -0.65319 -0.1428
#> BLL5    1.0500  0.2750 -1.8505 -0.3736  0.06138 -0.9738
#> BNT5   -3.0872 -1.0264 -1.1162 -0.3220  0.95783 -0.5125
#> BOF1    1.4722 -1.1017  0.2533 -0.2387  0.85134  0.9245
#> BOT3    0.8133  1.4762 -3.2847 -0.9511 -0.83835 -0.9639
#> ....                                                   
#> 
#> 
#> Site constraints (linear combinations of constraining variables)
#> 
#>           RDA1    RDA2    RDA3     PC1      PC2     PC3
#> ALT3   -1.8641 -1.5623 -0.2679 -1.4589 -5.19649 -0.2248
#> BAR360  0.1641 -0.9647 -0.5288  3.0848 -0.65319 -0.1428
#> BLL5    1.8430  0.6239 -2.8780 -0.3736  0.06138 -0.9738
#> BNT5   -2.2991 -1.2711 -0.6227 -0.3220  0.95783 -0.5125
#> BOF1    1.4722 -1.0680  0.1644 -0.2387  0.85134  0.9245
#> BOT3    0.9142  1.3548 -2.0063 -0.9511 -0.83835 -0.9639
#> ....                                                   
#> 
#> 
#> Biplot scores for constraining variables
#> 
#>             RDA1    RDA2    RDA3 PC1 PC2 PC3
#> CA_rPCA1 -0.2047  0.5350  0.8197   0   0   0
#> CA_rPCA2  0.5188  0.8393 -0.1625   0   0   0
#> CA_rPCA3  0.5561 -0.6174  0.5564   0   0   0

One of the most relevant statistics to present from an RDA model is the adjusted R2 value. The R2 value must be adjusted; if not, it is biased because it will always increase if independent variables are added (i.e., there is no penalization for adding independent variables that aren’t significantly affecting dependent variables within the model). The adjusted R2 value from the full (global) model can also help us determine the stopping point when we go on to do variable selection. As we can see, the unadjusted R2 is 0.098, while the adjusted value is 0.042. Be sure to always report adjusted R2 values.

RsquareAdj(mod_full)
#> $r.squared
#> [1] 0.0976769
#> 
#> $adj.r.squared
#> [1] 0.04243263

Run a simple RDA with variable selection

Now, let’s run an RDA model with variable selection by specifying the "best" model within rda_run(). Variable selection occurs using the ordiR2step() function in the vegan package, and is a forward selection method that begins with a null model and adds explanatory variables one at a time until genetic variance is maximized. To perform forward selection, we need to specify stopping criteria (i.e., when to stop adding more explanatory variables). There are two primary ways to specify stopping criteria: parameters involved in a permutation-based significance test (two parameters: the limits of permutation p-values for adding (Pin) or dropping (Pout) a term to the model, and the number of permutations used in the estimation of adjusted R2 (R2permutations)), and whether to use the adjusted R2 value as the stopping criterion (R2scope; only models with adjusted R2 values lower than specified are accepted).

mod_best <- rda_run(gen, env,
  model = "best",
  Pin = 0.05,
  R2permutations = 1000,
  R2scope = T
)
#> Step: R2.adj= 0 
#> Call: gen ~ 1 
#>  
#>                  R2.adjusted
#> <All variables> 0.0424326339
#> + CA_rPCA2      0.0196214014
#> + CA_rPCA3      0.0137427906
#> + CA_rPCA1      0.0009953656
#> <none>          0.0000000000
#> 
#>            Df    AIC      F Pr(>F)   
#> + CA_rPCA2  1 264.09 2.0407  0.002 **
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Step: R2.adj= 0.0196214 
#> Call: gen ~ CA_rPCA2 
#>  
#>                 R2.adjusted
#> <All variables>  0.04243263
#> + CA_rPCA3       0.03919420
#> + CA_rPCA1       0.01991617
#> <none>           0.01962140
#> 
#>            Df    AIC      F Pr(>F)   
#> + CA_rPCA3  1 263.97 2.0389  0.006 **
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Step: R2.adj= 0.0391942 
#> Call: gen ~ CA_rPCA2 + CA_rPCA3 
#>  
#>                 R2.adjusted
#> <All variables>  0.04243263
#> + CA_rPCA1       0.04243263
#> <none>           0.03919420
#> 
#>            Df    AIC      F Pr(>F)
#> + CA_rPCA1  1 264.72 1.1691  0.158

Let’s look at our best model, and see how it compares to our full model. There is also an additional element in the mod_best object, which are the results from an ANOVA-like permutation test. When we look at the call, we can now see that only environmental PCs 2 and 3 are considered in the model, meaning that PC1 is not considered to be significantly associated with genetic variance. Importantly, our adjusted R2 value is 0.039 (compared to 0.042 for the full model), which tells us that the model with only two of the environmental variables explains nearly as much as the RDA with all three.

mod_best$call
#> rda(formula = gen ~ CA_rPCA2 + CA_rPCA3, data = moddf)
mod_best$anova
#>                   R2.adj Df    AIC      F Pr(>F)   
#> + CA_rPCA2      0.019621  1 264.09 2.0407  0.002 **
#> + CA_rPCA3      0.039194  1 263.97 2.0389  0.006 **
#> <All variables> 0.042433                           
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RsquareAdj(mod_best)
#> $r.squared
#> [1] 0.07614827
#> 
#> $adj.r.squared
#> [1] 0.0391942

Following variable selection, we want to retain two environmental variables (CA_rPCA2 and CA_rPCA3) for all subsequent analyses.

Run a partial RDA with geography as a covariable

A partial RDA, as described at the beginning of the vignette, is when covariables (or conditioning variables) are incorporated into the RDA. Within rda_run(), to correct for geography, users can set correctGEO = TRUE, in which case sampling coordinates (coords) must also be provided. Let’s now run a partial RDA, without any variable selection (the full model) and view the results.

mod_pRDA_geo <- rda_run(gen, env, liz_coords,
  model = "full",
  correctGEO = TRUE,
  correctPC = FALSE
)

Let’s take a look at our partial RDA results. Note that now, the call includes conditioning variables x and y, which correspond to our latitude and longitude values. Conditioning variables are specified within the rda() function using the Condition() function. Finally, when we look at the summary of the resulting object, we can see that along with total, constrained, and unconstrained inertia (or variance) as with our simple RDA, there is an additional row for conditioned inertia because our partial RDA includes conditioning variables. Interestingly, the conditioned inertia is higher than constrained inertia, suggesting that geography is more associated with genetic variance than our environmental variables are.

anova(mod_pRDA_geo)
#> Permutation test for rda under reduced model
#> Permutation: free
#> Number of permutations: 999
#> 
#> Model: rda(formula = gen ~ CA_rPCA1 + CA_rPCA2 + CA_rPCA3 + Condition(x + y), data = moddf)
#>          Df Variance      F Pr(>F)    
#> Model     3   11.306 1.5948  0.001 ***
#> Residual 47  111.068                  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RsquareAdj(mod_pRDA_geo) # 0.0305
#> $r.squared
#> [1] 0.07884828
#> 
#> $adj.r.squared
#> [1] 0.03058243
head(summary(mod_pRDA_geo))
#> 
#> Call:
#> rda(formula = gen ~ CA_rPCA1 + CA_rPCA2 + CA_rPCA3 + Condition(x +      y), data = moddf) 
#> 
#> Partitioning of variance:
#>               Inertia Proportion
#> Total          143.39    1.00000
#> Conditioned     21.01    0.14656
#> Constrained     11.31    0.07885
#> Unconstrained  111.07    0.77459
#> 
#> Eigenvalues, and their contribution to the variance 
#> after removing the contribution of conditiniong variables
#> 
#> Importance of components:
#>                          RDA1    RDA2    RDA3      PC1     PC2     PC3     PC4
#> Eigenvalue            6.87372 2.80216 1.63009 11.00385 7.49907 5.16346 4.68171
#> Proportion Explained  0.05617 0.02290 0.01332  0.08992 0.06128 0.04219 0.03826
#> Cumulative Proportion 0.05617 0.07907 0.09239  0.18231 0.24359 0.28578 0.32404
#>                           PC5     PC6     PC7     PC8     PC9    PC10    PC11
#> Eigenvalue            4.21148 3.72312 3.45801 3.19495 3.01860 2.98188 2.93280
#> Proportion Explained  0.03441 0.03042 0.02826 0.02611 0.02467 0.02437 0.02397
#> Cumulative Proportion 0.35845 0.38888 0.41714 0.44324 0.46791 0.49228 0.51624
#>                          PC12    PC13   PC14    PC15    PC16    PC17    PC18
#> Eigenvalue            2.79673 2.59175 2.5826 2.49887 2.45425 2.32321 2.29777
#> Proportion Explained  0.02285 0.02118 0.0211 0.02042 0.02006 0.01898 0.01878
#> Cumulative Proportion 0.53910 0.56028 0.5814 0.60180 0.62186 0.64084 0.65962
#>                          PC19    PC20    PC21    PC22    PC23   PC24    PC25
#> Eigenvalue            2.15916 2.14664 2.03834 2.00771 1.91618 1.8479 1.80441
#> Proportion Explained  0.01764 0.01754 0.01666 0.01641 0.01566 0.0151 0.01475
#> Cumulative Proportion 0.67726 0.69480 0.71146 0.72787 0.74352 0.7586 0.77337
#>                         PC26    PC27    PC28    PC29    PC30    PC31    PC32
#> Eigenvalue            1.7740 1.68265 1.64600 1.58626 1.52630 1.47782 1.45805
#> Proportion Explained  0.0145 0.01375 0.01345 0.01296 0.01247 0.01208 0.01191
#> Cumulative Proportion 0.7879 0.80162 0.81507 0.82803 0.84050 0.85258 0.86449
#>                          PC33    PC34    PC35    PC36    PC37     PC38     PC39
#> Eigenvalue            1.42866 1.37519 1.30318 1.27465 1.24443 1.221578 1.169694
#> Proportion Explained  0.01167 0.01124 0.01065 0.01042 0.01017 0.009982 0.009558
#> Cumulative Proportion 0.87617 0.88741 0.89805 0.90847 0.91864 0.928622 0.938181
#>                           PC40     PC41     PC42     PC43     PC44     PC45
#> Eigenvalue            1.073702 1.068009 1.045150 1.006285 0.935246 0.866846
#> Proportion Explained  0.008774 0.008727 0.008541 0.008223 0.007643 0.007084
#> Cumulative Proportion 0.946955 0.955682 0.964223 0.972446 0.980088 0.987172
#>                           PC46     PC47
#> Eigenvalue            0.841414 0.728431
#> Proportion Explained  0.006876 0.005952
#> Cumulative Proportion 0.994048 1.000000
#> 
#> Accumulated constrained eigenvalues
#> Importance of components:
#>                        RDA1   RDA2   RDA3
#> Eigenvalue            6.874 2.8022 1.6301
#> Proportion Explained  0.608 0.2478 0.1442
#> Cumulative Proportion 0.608 0.8558 1.0000
#> 
#> Scaling 2 for species and site scores
#> * Species are scaled proportional to eigenvalues
#> * Sites are unscaled: weighted dispersion equal on all dimensions
#> * General scaling constant of scores:  9.29244 
#> 
#> 
#> Species scores
#> 
#>                RDA1       RDA2       RDA3        PC1        PC2        PC3
#> Locus_10  3.455e-02  8.516e-03  2.220e-02  2.631e-02  3.045e-02 -5.347e-03
#> Locus_15  3.361e-02 -8.041e-02 -7.979e-02  5.156e-02  3.933e-02 -1.483e-01
#> Locus_22  4.500e-02  3.760e-03  1.079e-02 -8.323e-02 -1.622e-02 -5.073e-02
#> Locus_28 -6.112e-02 -6.035e-02 -6.888e-02  3.955e-02  3.835e-02  1.034e-02
#> Locus_32  1.352e-02  1.105e-02  2.971e-02  1.883e-02  2.562e-02  1.820e-02
#> Locus_35 -5.728e-32 -2.203e-31 -4.550e-32  5.385e-17 -1.621e-16  6.186e-17
#> ....                                                                      
#> 
#> 
#> Site scores (weighted sums of species scores)
#> 
#>           RDA1    RDA2    RDA3     PC1     PC2      PC3
#> ALT3   -0.6074  1.5856  0.3496 -3.5502  0.5751  1.36394
#> BAR360  1.9028 -0.7660 -0.6439 -2.1245 -0.2723  0.32928
#> BLL5    0.6935  1.9687 -2.1865 -0.8111 -1.6626  0.32307
#> BNT5   -2.6158 -0.2921 -1.0298  0.5356 -0.7971  0.92231
#> BOF1    1.5909 -0.2182  0.1445  0.9682  0.7195  0.00126
#> BOT3   -1.1366 -1.4518 -1.2000 -0.1237 -0.2652 -1.77037
#> ....                                                   
#> 
#> 
#> Site constraints (linear combinations of constraining variables)
#> 
#>           RDA1     RDA2     RDA3     PC1     PC2      PC3
#> ALT3   -1.4796  1.42372 -0.56620 -3.5502  0.5751  1.36394
#> BAR360  0.9953 -1.10207 -0.72739 -2.1245 -0.2723  0.32928
#> BLL5    0.7741  1.44224 -2.79896 -0.8111 -1.6626  0.32307
#> BNT5   -2.0626 -0.81597 -0.56826  0.5356 -0.7971  0.92231
#> BOF1    1.8148 -0.09947 -0.03338  0.9682  0.7195  0.00126
#> BOT3   -0.4265 -0.95308 -1.45302 -0.1237 -0.2652 -1.77037
#> ....                                                     
#> 
#> 
#> Biplot scores for constraining variables
#> 
#>             RDA1    RDA2     RDA3 PC1 PC2 PC3
#> CA_rPCA1 -0.2345  0.4009  0.82275   0   0   0
#> CA_rPCA2  0.2021  0.3349 -0.08967   0   0   0
#> CA_rPCA3  0.6824 -0.2889  0.51661   0   0   0

Run a partial RDA with population genetic structure and geography as covariables

Lastly, let’s run a partial RDA with both geography (correctGEO = TRUE) and population structure (correctPC = TRUE) as conditioning variables. Within algatr, population structure is quantified using a PCA, and a user can specify the number of axes to retain in the RDA model. In this case, we’ll retain three axes, which we can specify using the nPC argument. If this argument is set to "manual", a screeplot will be printed and a user can manually enter the number of PCs that they feel best describe the data.

mod_pRDA_gs <- rda_run(gen, env, liz_coords,
  model = "full",
  correctGEO = TRUE,
  correctPC = TRUE,
  nPC = 3
)

Now, we see that even more inertia/variance is explained by the conditioning variables.

head(summary(mod_pRDA_gs))
#> 
#> Call:
#> rda(formula = gen ~ CA_rPCA1 + CA_rPCA2 + CA_rPCA3 + Condition(PC1 +      PC2 + PC3 + x + y), data = moddf) 
#> 
#> Partitioning of variance:
#>               Inertia Proportion
#> Total         143.389    1.00000
#> Conditioned    48.063    0.33519
#> Constrained     6.672    0.04653
#> Unconstrained  88.654    0.61828
#> 
#> Eigenvalues, and their contribution to the variance 
#> after removing the contribution of conditiniong variables
#> 
#> Importance of components:
#>                          RDA1    RDA2    RDA3     PC1     PC2     PC3     PC4
#> Eigenvalue            2.72234 2.30006 1.64942 5.15035 4.44551 3.82383 3.55107
#> Proportion Explained  0.02856 0.02413 0.01730 0.05403 0.04663 0.04011 0.03725
#> Cumulative Proportion 0.02856 0.05269 0.06999 0.12402 0.17065 0.21077 0.24802
#>                           PC5     PC6     PC7     PC8     PC9    PC10    PC11
#> Eigenvalue            3.24764 3.02810 2.99645 2.95013 2.79913 2.65085 2.59129
#> Proportion Explained  0.03407 0.03177 0.03143 0.03095 0.02936 0.02781 0.02718
#> Cumulative Proportion 0.28209 0.31385 0.34529 0.37623 0.40560 0.43341 0.46059
#>                          PC12    PC13    PC14    PC15    PC16    PC17    PC18
#> Eigenvalue            2.52112 2.46520 2.34331 2.30029 2.16258 2.15304 2.05042
#> Proportion Explained  0.02645 0.02586 0.02458 0.02413 0.02269 0.02259 0.02151
#> Cumulative Proportion 0.48704 0.51290 0.53748 0.56161 0.58430 0.60688 0.62839
#>                          PC19    PC20    PC21   PC22    PC23    PC24    PC25
#> Eigenvalue            2.01366 1.92134 1.85435 1.8112 1.79300 1.68266 1.65184
#> Proportion Explained  0.02112 0.02016 0.01945 0.0190 0.01881 0.01765 0.01733
#> Cumulative Proportion 0.64952 0.66967 0.68912 0.7081 0.72693 0.74458 0.76191
#>                          PC26    PC27    PC28    PC29    PC30    PC31    PC32
#> Eigenvalue            1.58977 1.52980 1.47897 1.46048 1.44680 1.38037 1.30440
#> Proportion Explained  0.01668 0.01605 0.01551 0.01532 0.01518 0.01448 0.01368
#> Cumulative Proportion 0.77859 0.79464 0.81015 0.82547 0.84065 0.85513 0.86882
#>                          PC33    PC34    PC35    PC36    PC37    PC38    PC39
#> Eigenvalue            1.27639 1.24668 1.22365 1.17070 1.07515 1.06819 1.05602
#> Proportion Explained  0.01339 0.01308 0.01284 0.01228 0.01128 0.01121 0.01108
#> Cumulative Proportion 0.88221 0.89528 0.90812 0.92040 0.93168 0.94289 0.95396
#>                          PC40     PC41    PC42     PC43     PC44
#> Eigenvalue            1.00764 0.940164 0.86839 0.842461 0.729857
#> Proportion Explained  0.01057 0.009863 0.00911 0.008838 0.007656
#> Cumulative Proportion 0.96453 0.974396 0.98351 0.992344 1.000000
#> 
#> Accumulated constrained eigenvalues
#> Importance of components:
#>                        RDA1   RDA2   RDA3
#> Eigenvalue            2.722 2.3001 1.6494
#> Proportion Explained  0.408 0.3447 0.2472
#> Cumulative Proportion 0.408 0.7528 1.0000
#> 
#> Scaling 2 for species and site scores
#> * Species are scaled proportional to eigenvalues
#> * Sites are unscaled: weighted dispersion equal on all dimensions
#> * General scaling constant of scores:  9.29244 
#> 
#> 
#> Species scores
#> 
#>                RDA1       RDA2       RDA3        PC1        PC2        PC3
#> Locus_10 -1.265e-02 -1.062e-02  2.431e-02  9.693e-03 -2.614e-03 -1.546e-03
#> Locus_15  7.156e-02 -4.414e-02 -7.566e-02  1.528e-01 -1.710e-02  9.218e-02
#> Locus_22  6.219e-03  1.821e-04  1.421e-02  3.812e-02 -1.478e-02 -2.112e-02
#> Locus_28  7.276e-02  6.051e-02 -7.430e-02 -2.306e-02 -2.088e-02  9.904e-03
#> Locus_32 -1.466e-02  1.498e-03  2.902e-02 -1.417e-02 -3.597e-03  4.626e-02
#> Locus_35  5.557e-32 -1.811e-32 -4.247e-32  6.817e-17 -7.435e-17  1.546e-18
#> ....                                                                      
#> 
#> 
#> Site scores (weighted sums of species scores)
#> 
#>            RDA1     RDA2    RDA3      PC1      PC2      PC3
#> ALT3   -3.04653  0.18212 -1.0379 -0.18125  1.19052 -0.81292
#> BAR360  1.30674  0.16343 -0.8067 -0.88740 -0.67411  0.08499
#> BLL5   -0.49562  0.28265 -1.8134 -1.24385  0.01825  0.64437
#> BNT5    1.40006  2.26681 -0.9036 -1.60174  0.47698  0.20989
#> BOF1    0.02238 -0.58659  0.3529  0.09616  0.13024 -0.17068
#> BOT3    1.17354 -0.09847 -1.3900  1.99555 -0.56848  0.21693
#> ....                                                       
#> 
#> 
#> Site constraints (linear combinations of constraining variables)
#> 
#>            RDA1     RDA2    RDA3      PC1      PC2      PC3
#> ALT3   -2.30233  0.17848 -1.2731 -0.18125  1.19052 -0.81292
#> BAR360  1.37941 -0.06513 -0.6461 -0.88740 -0.67411  0.08499
#> BLL5   -0.43214  0.25009 -2.2783 -1.24385  0.01825  0.64437
#> BNT5    1.44455  1.47169 -0.5322 -1.60174  0.47698  0.20989
#> BOF1   -0.06742 -1.15996  0.1743  0.09616  0.13024 -0.17068
#> BOT3    0.55754 -0.97519 -1.5380  1.99555 -0.56848  0.21693
#> ....                                                       
#> 
#> 
#> Biplot scores for constraining variables
#> 
#>             RDA1     RDA2     RDA3 PC1 PC2 PC3
#> CA_rPCA1 -0.3709  0.31874  0.78617   0   0   0
#> CA_rPCA2 -0.3373 -0.09457 -0.06287   0   0   0
#> CA_rPCA3  0.2301 -0.44499  0.58235   0   0   0

Variance partitioning with partial RDA


In many cases, we want to understand the independent contributions of each explanatory variable in combination with our covariables. Also, we can understand how much confounded variance there is (i.e., variation that is explained by the combination of multiple explanatory variables). The best way to do this is to run partial RDAs on each set of explanatory variables and examine the inertia of each pRDA result as it compares to the full model. In this case, the full model is an RDA that considers neutral processes as explanatory variables rather than conditioning variables (covariables).

The call for this full model looks like this: rda(gen ~ env + covariable, data), rather than a true partial RDA which would take the form: rda(gen ~ env + Condition(covariable), data).

In algatr, we can run variance partitioning using the function rda_varpart(), which will run the full model (as above) and then will run successive pRDAs which are conditioned on three successive sets of variables. These variable sets are significant environmental variables (in our case, environmental PCs 2 and 3), population genetic structure (once again determined using a PC with a user-specified number of PC axes), and geography (sampling coordinates). This function will then provide us with information on how variance is partitioned in the form of a table containing relevant statistics; this table is nearly identical to Table 2 from Capblancq & Forester 2021. The rda_varpart() parameters are identical to those from rda_run(). As with rda_run(), we can provide a number for the number of PCs to consider for population structure (nPC), or this argument can be set to "manual" if a user would like to manually enter a number into the console after viewing the screeplot. We can then use rda_varpart_table() to visualize these results in an aesthetically pleasing way; users can specify whether to display the function call in the table (call_col; defaults to FALSE) and the number of digits to round to digits (the default is 2).

varpart <- rda_varpart(gen, env, liz_coords,
  Pin = 0.05, R2permutations = 1000,
  R2scope = T, nPC = 3
)
#> Step: R2.adj= 0 
#> Call: gen ~ 1 
#>  
#>                  R2.adjusted
#> <All variables> 0.0424326339
#> + CA_rPCA2      0.0196214014
#> + CA_rPCA3      0.0137427906
#> + CA_rPCA1      0.0009953656
#> <none>          0.0000000000
#> 
#>            Df    AIC      F Pr(>F)   
#> + CA_rPCA2  1 264.09 2.0407  0.004 **
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Step: R2.adj= 0.0196214 
#> Call: gen ~ CA_rPCA2 
#>  
#>                 R2.adjusted
#> <All variables>  0.04243263
#> + CA_rPCA3       0.03919420
#> + CA_rPCA1       0.01991617
#> <none>           0.01962140
#> 
#>            Df    AIC      F Pr(>F)   
#> + CA_rPCA3  1 263.97 2.0389  0.006 **
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Step: R2.adj= 0.0391942 
#> Call: gen ~ CA_rPCA2 + CA_rPCA3 
#>  
#>                 R2.adjusted
#> <All variables>  0.04243263
#> + CA_rPCA1       0.04243263
#> <none>           0.03919420
#> 
#>            Df    AIC      F Pr(>F)
#> + CA_rPCA1  1 264.72 1.1691   0.19

rda_varpart_table(varpart, call_col = TRUE)
Variance partitioning
Model pRDA model call Inertia R2 Adjusted R2 p (>F) Prop. of explainable variance Prop. of total variance
Full model gen ~ CA_rPCA2 + CA_rPCA3 + PC1 + PC2 + PC3 + x + y 52.67 0.37 0.27 0.00 1.00 0.37
Pure enviro. model gen ~ CA_rPCA2 + CA_rPCA3 + Condition(PC1 + PC2 + PC3 + x + y) 4.61 0.03 0.00 0.06 0.09 0.03
Pure pop. structure model gen ~ PC1 + PC2 + PC3 + Condition(CA_rPCA2 + CA_rPCA3 + x + y) 23.09 0.16 0.13 0.00 0.44 0.16
Pure geography model gen ~ x + y + Condition(PC1 + PC2 + PC3 + CA_rPCA2 + CA_rPCA3) 5.32 0.04 0.01 0.00 0.10 0.04
Confounded variance
19.66


0.37 0.14
Total unexplained variance
90.72



0.63
Total inertia
143.39



1.00

From the above table, we can see that the full model only explains 27% of the total variance in our genetic data. Furthermore, our pure environmental model explains only a small proportion of the variance (“inertia”) contained within the full model (9%), and that population structure explains substantially more (44%). There is also a large amount of variance (37%) within the full model that is confounded, meaning that variables within the three variable sets explain a substantial amount of the full model’s variance when in combination with one another. This is a relatively common occurrence (see Capblancq & Forester 2021), and users should always keep in mind how collinearity may affect their results (and account for this, if possible; see algatr’s environmental data vignette for more information on how to do so). However, accounting for collinearity may also elevate false positive rates, while failing to account for it at all can lead to elevated false negative rates.

In our case, it is clear from our variance partitioning analysis that much of the variation in our genetic data can explained by population structure, and thus users may want to further explore more sophisticated approaches for quantifying demographic history or population structure before moving forward with outlier loci detection. As with most analyses, careful consideration of the input data and the biology of the system are critical to draw robust and reliable conclusions.

Identifying candidate SNPs with rda_getoutliers()


Having run variance partitioning on our data, we can now move forward with detecting outlier loci that are significantly associated with environmental variables. To do so, we’ll first run a partial RDA on only those significant environmental variables, and we’ll also want to add population structure as a covariable (i.e., a conditioning variable). We’ll only retain two PC axes for our measurement of population structure.

mod_pRDA <- rda_run(gen, env, model = "best", correctPC = TRUE, nPC = 2)

#> Step: R2.adj= 0 
#> Call: gen ~ 1 
#>  
#>                         R2.adjusted
#> <All variables>        0.0202527037
#> + CA_rPCA2             0.0196214014
#> + CA_rPCA3             0.0137427906
#> + CA_rPCA1             0.0009953656
#> <none>                 0.0000000000
#> + Condition(PC1 + PC2) 0.0000000000
#> 
#>            Df    AIC      F Pr(>F)   
#> + CA_rPCA2  1 264.09 2.0407  0.004 **
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Step: R2.adj= 0.0196214 
#> Call: gen ~ CA_rPCA2 
#>  
#>                        R2.adjusted
#> + CA_rPCA3              0.03919420
#> <All variables>         0.02025270
#> + CA_rPCA1              0.01991617
#> <none>                  0.01962140
#> + Condition(PC1 + PC2)  0.01834925

Given this partial RDA, only environmental PC2 is considered significant.

mod_pRDA$anova
#>                   R2.adj Df    AIC      F Pr(>F)   
#> + CA_rPCA2      0.019621  1 264.09 2.0407  0.004 **
#> <All variables> 0.020253                           
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now that we have the RDA model, we can discover SNPs that are outliers. In algatr, this can be done using two different methods to detect outlier loci, specified by the outlier_method argument. The first uses Z-scores for outlier detection, and is from Forester et al. 2018; their code can be found here. The other method transforms RDA loadings into p-values, and was developed by Capblancq et al. 2018 and is further described in Capblancq & Forester 2021; code developed to run this is associated with Capblancq and Forester 2021 and can be found here. We provide brief explanations for each of these methods below.

N.B.: Some researchers may prefer to do MAF filtering and linkage disequilibrium pruning during outlier detection (e.g., Capblancq & Forester 2021). In this case, only one SNP per contig is selected based on the highest loading. The algatr pipeline does this filtering initially when processing genetic data; see data processing vignette for more information.

Identifying outliers using the Z-scores method

To understand how the Z-scores method works to detect outliers, let’s first take a look at the loadings of SNPs on the first two RDA axes using the rda_plot() function. This will show us histograms for each of the selected RDA axes. SNPs that are not associated with environmental variables will fall in the center of these distributions, with outliers falling at the edges. The Z-score method works by setting a Z-score (number of standard deviations) threshold beyond which SNPs will be considered outliers (see here for further details).

If we set this value too stringently, we minimize our false positive rate but risk detecting false negatives; if it’s set too high, we increase our false positive rate. Researchers will want to adjust this parameter based on their tolerance for detecting false positives, but also the strength of selection under which outliers are subject to (assuming that SNPs that are more significantly associated with environmental variables are also under stronger selection).

Because only one environmental variable was associated with our genetic data, we only have one resulting RDA axis.

rda_plot(mod_pRDA, axes = "all", binwidth = 20)

Now, let’s discover outliers with the rda_getoutliers() function. The method is specified using the outlier_method argument, and the number of standard deviations to detect outliers is specified by the z argument; it defaults to 3. The resulting object is a dataframe containing the SNP name, the RDA axis it’s associated with, and the loading along this axis. A screeplot for the RDA axes is automatically generated (but can be switched off using the plot argument).

rda_sig_z <- rda_getoutliers(mod_pRDA, naxes = "all", outlier_method = "z", z = 3, plot = FALSE)

# How many outlier SNPs were detected?
length(rda_sig_z$rda_snps)
#> [1] 19

Identifying outliers using the p-value method

Another method to detect outliers is a p-value method wherein rather than using the loadings along RDA axes to identify extreme outliers (as with the Z-scores method above), the loadings are transformed into p-values. To do so, Mahalanobis distances are calculated for each locus, scaled based on a genomic inflation factor, from which p-values and q-values can be computed. We then adjust our p-values (based on a false discovery rate, for example; this adjustment is done using the p.adjust() function) and subsequently identify outliers based on a user-specified significance threshold. Rather than detecting outliers based on a p-value significance threshold, one could also detect outliers based on q-values (which are also provided as output in algatr), as is done in Capblancq et al. 2018 and Capblancq & Forester 2021. To do so, we would select loci that fall below a user-specified threshold based on the proportion of false positives we allow. For example, if we allow 10% or fewer outliers to be false positives, outliers could be identified based on loci that have q-values =< 0.1.

Now, let’s detect outliers using this method. To do so, along with outlier_method, we need to specify p_adj, which is the method by which p-values are adjusted, as well as sig, which is our significance threshold. We will also tell the function not to produce a screeplot.

One limitation of the p-value outlier method is that you cannot calculate outliers given fewer than two RDA axes (as is the case with our partial RDA results). So, let’s move forward with this outlier detection method but using the results from our simple RDA model with variable selection, mod_best.

rda_sig_p <- rda_getoutliers(mod_best, naxes = "all", outlier_method = "p", p_adj = "fdr", sig = 0.01, plot = FALSE)

# How many outlier SNPs were detected?
length(rda_sig_p$rda_snps)
#> [1] 101

As mentioned above, we could also identify outliers based on q-values. To do so, we can extract the q-values from the resulting object from rda_getoutliers().

# Identify outliers that have q-values < 0.1
q_sig <-
  rda_sig_p$rdadapt %>%
  # Make SNP names column from row names
  mutate(snp_names = row.names(.)) %>%
  filter(q.values <= 0.1)

# How many outlier SNPs were detected?
nrow(q_sig)
#> [1] 170

Note the differences between the two methods: using the p-value method, we found many more outlier SNPs than with the Z-scores method (101 vs. 19). Even more outliers were found when only using q-values as a significant threshold (170). We can also see that 19 SNPs were detected using all three methods.

Reduce(intersect, list(
  q_sig$snp_names,
  rda_sig_p$rda_snps,
  rda_sig_z$rda_snps
))
#>  [1] "Locus_517"  "Locus_771"  "Locus_786"  "Locus_945"  "Locus_1247"
#>  [6] "Locus_1263" "Locus_1278" "Locus_1318" "Locus_1374" "Locus_1439"
#> [11] "Locus_1524" "Locus_1801" "Locus_1949" "Locus_2045" "Locus_2338"
#> [16] "Locus_2431" "Locus_2559" "Locus_2665" "Locus_2947"

Visualizing RDA results with rda_plot()


algatr provides options for visualizing RDA results in two ways with the rda_plot() function: a biplot and a Manhattan plot.

RDA biplot

Let’s build a biplot (or some call it a triplot, since it also shows loadings of environmental variables) with RDA axes 1 and 2, specified using the rdaplot argument. In this case, we also need to specify the number of axes to display using the biplot_axes argument. The default of rda_plot() is to plot all variables, regardless of whether they were considered significant or not.

rda_plot(mod_best, rda_sig_p$rda_snps, biplot_axes = c(1, 2), rdaplot = TRUE, manhattan = FALSE)


# In the case of our partial RDA, there was only one RDA axis, so a histogram is generated
rda_plot(mod_pRDA, rda_sig_z$rda_snps, rdaplot = TRUE, manhattan = FALSE, binwidth = 0.01)

Manhattan plot

Let’s now build a Manhattan plot with a significance level of 0.05 using the manhattan argument; if this is set to TRUE, we must also provide a significance threshold value using the sig argument (the default is 0.05). Be aware that outliers were detected with a different p-value threshold which is why there are grey points above our threshold line (i.e., data points are colorized according to the threshold of the model, not the visualization).

rda_plot(mod_best, rda_sig_p$rda_snps, rda_sig_p$pvalues, rdaplot = FALSE, manhattan = TRUE)

Interpreting RDA results


As discussed in Capblancq & Forester 2021, users should be aware that “statistical selection of variables optimizes variance explained but cannot, by itself, identify the ecological or mechanistic drivers of genetic variation.”

Identifying environmental associations with rda_cor() and rda_table()

Now that we’ve identified outlier loci, it is useful to be able to identify how much they correlate to each environmental variable. To do so, we can run a correlation test of each SNP using rda_cor() and summarize results in a customizable table using rda_table().

# Extract genotypes for outlier SNPs
rda_snps <- rda_sig_p$rda_snps
rda_gen <- gen[, rda_snps]

# Run correlation test
cor_df <- rda_cor(rda_gen, env)

# Make a table from these results (displaying only the first 5 rows):
rda_table(cor_df, nrow = 5)
r p snp var
0.27 0.02 Locus_125 CA_rPCA3
0.39 0.00 Locus_166 CA_rPCA3
0.29 0.01 Locus_249 CA_rPCA2
-0.24 0.03 Locus_263 CA_rPCA1
0.27 0.02 Locus_263 CA_rPCA3

# Order by the strength of the correlation
rda_table(cor_df, order = TRUE, nrow = 5)
r p snp var
-0.49 0 Locus_2947 CA_rPCA2
-0.43 0 Locus_1278 CA_rPCA2
-0.42 0 Locus_2338 CA_rPCA3
-0.42 0 Locus_786 CA_rPCA2
-0.42 0 Locus_1318 CA_rPCA2

# Only retain the top variable for each SNP based on the strength of the correlation
rda_table(cor_df, top = TRUE, nrow = 5)
r p snp var
0.27 0.02 Locus_125 CA_rPCA3
0.39 0.00 Locus_166 CA_rPCA3
0.29 0.01 Locus_249 CA_rPCA2
0.27 0.02 Locus_263 CA_rPCA3
0.28 0.01 Locus_312 CA_rPCA2

# Display results for only one environmental variable
rda_table(cor_df, var = "CA_rPCA2", nrow = 5)
r p snp var
0.29 0.01 Locus_249 CA_rPCA2
0.28 0.01 Locus_312 CA_rPCA2
0.27 0.02 Locus_404 CA_rPCA2
0.34 0.00 Locus_517 CA_rPCA2
0.23 0.04 Locus_612 CA_rPCA2

Running RDA with rda_do_everything()


The algatr package also has an option to run all of the above functionality in a single function, rda_do_everything(). This function is similar in structure to rda_run(), but also performs imputation (impute option), can generate figures, the correlation test table (cortest option; default is TRUE), and variance partitioning (varpart option; default is FALSE). Please be aware that the do_everything() functions are meant to be exploratory. We do not recommend their use for final analyses unless certain they are properly parameterized.

The resulting object from rda_do_everything() contains:

  • Candidate SNPs: rda_snps

  • Data frame of environmental associations with outlier SNPs: cor_df

  • RDA model specifics: rda_mod

  • Outlier test results: rda_outlier_test

  • Relevant R-squared values: rsq

  • ANOVA results: anova

  • List of all p-values: pvalues

  • Variance partitioning results: varpart

Let’s first run a simple RDA, imputing to the median, with variable selection and the p-value outlier method:

results <- rda_do_everything(liz_vcf, CA_env, liz_coords,
                             impute = "simple",
                             correctGEO = FALSE,
                             correctPC = FALSE,
                             outlier_method = "p",
                             sig = 0.05,
                             p_adj = "fdr",
                             cortest = TRUE,
                             varpart = FALSE
)
r p snp var
-0.49 0 Locus_2947 CA_rPCA2
-0.43 0 Locus_1278 CA_rPCA2
-0.42 0 Locus_2338 CA_rPCA3
-0.42 0 Locus_786 CA_rPCA2
-0.42 0 Locus_1318 CA_rPCA2
-0.42 0 Locus_771 CA_rPCA2
-0.41 0 Locus_1374 CA_rPCA2
0.40 0 Locus_636 CA_rPCA3
0.39 0 Locus_1263 CA_rPCA2
0.39 0 Locus_166 CA_rPCA3

Now, let’s run a partial RDA, imputing based on 3 sNMF clusters, correcting for both geography and structure, and instead using the Z-score method to detect outlier loci, including variance partitioning. We’ll keep the number of PCs for population structure set to the default, which is three axes (nPC):

results <- rda_do_everything(liz_vcf, CA_env, liz_coords,
                             impute = "structure",
                             K_impute = 3,
                             model = "full",
                             correctGEO = TRUE,
                             correctPC = TRUE,
                             nPC = 3,
                             varpart = TRUE,
                             outlier_method = "z",
                             z = 3
)
Variance partitioning
Model Inertia R2 Adjusted R2 p (>F) Prop. of explainable variance Prop. of total variance
Full model 70.50 0.44 0.35 0.00 1.00 0.44
Pure enviro. model 4.80 0.03 0.01 0.02 0.07 0.03
Pure pop. structure model 29.48 0.18 0.16 0.00 0.42 0.18
Pure geography model 5.43 0.03 0.01 0.00 0.08 0.03
Confounded variance 30.80


0.44 0.19
Total unexplained variance 89.95



0.56
Total inertia 160.45



1.00
r p snp var
-0.35 0.00 Locus_2767 CA_rPCA1
0.29 0.01 Locus_2053 CA_rPCA1
-0.29 0.01 Locus_1005 CA_rPCA1
-0.27 0.02 Locus_318 CA_rPCA2
-0.25 0.03 Locus_2321 CA_rPCA1

Additional documentation and citations


Citation/URL Details
Associated literature Capblancq & Forester 2021; Github repository for code Paper describing RDA methodology; also provides walkthrough and description of rdadapt() function
Associated literature Capblancq et al. 2018 Description of p-value method of detecting outlier loci
R package documentation Oksanen et al. 2022. algatr uses the rda() and ordiR2step() functions in the vegan package
Associated literature Forester et al. 2018 Review of genotype-environment association methods
Associated tutorial Forester et al. Walkthrough of performing an RDA with the Z-scores outlier detection method
Associated vignette vignette("intro-vegan") Introduction to ordination using the vegan package (including using the rda() function
Associated vignette vignette("partitioning") Introduction to variance partitioning in vegan