Andrew Kerr, Kyle Bistrain, Rachel Roggenkemper, Liam Quach
Introduction and Data Preparation
Here, we are loading the data and preparing it for analysis. This includes cleaning the column names, converting data types, and creating new variables that will be useful for modeling.
Data Read-in
This section reads in the data from the master Excel file. We’ll be working with the combined dataset for the analyses.
Code
# Read in the data from the Excel filedf_combined <-read_xlsx(here::here("data", "Master.xlsx"), sheet ="combined")# Convert relevant columns to factors for modelingdf_combined <- df_combined %>%mutate(sample_id =as.factor(sample_id),treatment =as.factor(treatment),site =as.factor(site),block =as.factor(block),study_year =as.factor(study_year) )# Define the response variables we'll be analyzingresponse_vars <-c("nh4", "no3", "mg", "p", "ec", "p_h")
Handling Values Below the Limit of Detection (LOD)
A common challenge in environmental data is dealing with measurements that are below the limit of detection (LOD). These are recorded as “LOD” in the dataset. We will explore two common approaches to handle these values:
Treating LOD as Censored Data: This is a more statistically robust approach. It acknowledges that the true value is somewhere between 0 and the LOD, rather than assuming it is exactly 0.
Replacing LOD with 0: This is a simple approach, but it can bias the results, especially if there are many LOD values.
Here, we create two datasets to reflect these two approaches.
Code
# Create a dataset where LOD is replaced with 0df_combined_LOD_0 <- df_combined %>%mutate(across(all_of(response_vars),~as.numeric(ifelse(.x =="LOD", 0, .x)) ),date =as.factor(date),# Create combined grouping factorssite_block_id =interaction(site, block, drop =TRUE),trt_site_block_id =interaction(treatment, site, block, drop =TRUE) )# Create a dataset for censored models# A 'censored' column is added for each response variable.# The actual response variable is set to a placeholder (e.g., the detection limit) when censored.lc_model_data <- df_combined %>%mutate(nh4_cens =ifelse(nh4 =="LOD", -1, 0),no3_cens =ifelse(no3 =="LOD", -1, 0),mg_cens =ifelse(mg =="LOD", -1, 0),p_cens =ifelse(p =="LOD", -1, 0),nh4 =ifelse(nh4 =="LOD", nh4_lod, as.numeric(nh4)),no3 =ifelse(no3 =="LOD", n03_lod, as.numeric(no3)),mg =ifelse(mg =="LOD", mg_lod, as.numeric(mg)),p =ifelse(p =="LOD", p_lod, as.numeric(p)),date =as.factor(date),# Create combined grouping factorssite_block_id =interaction(site, block, drop =TRUE),trt_site_block_id =interaction(treatment, site, block, drop =TRUE) )
Depending on which method we use, the models answer slightly different research questions.
Treating LOD as 0
Assumption: You are assuming that any measurement below the Limit of Detection (LOD) is a true zero. However, since your data does not have any true zeros this allows us to use a zero-inflated model to model (1) probability below LOD and (2) mean if above LOD.
Research Question(s) it Answers: This approach splits your research question into two parts:
Presence/Absence: Does the treatment affect the probability that the response is LOD?
Concentration when Present: If the response is present (i.e., not LOD), how does the treatment affect its concentration?
Downside: In most environmental systems, a value below the LOD doesn’t mean the response is completely absent, just that its concentration is too low for your equipment to measure accurately. By substituting zero, you are artificially deflating the mean and variance of your data, which can bias your results and lead you to underestimate the true average concentration.
Treating LOD as Left-Censored
Assumption: You assume that a value below the LOD is not a known value, only that its true concentration is somewhere between 0 and the detection limit. The model doesn’t force it to be zero; it incorporates this uncertainty into the analysis.
Research Question(s) it Answers: This approach answers a single, more direct question: How does the treatment affect the true underlying concentration of the response, while properly accounting for the limitations and uncertainty of the measurement instrument?
Advantage: It doesn’t “invent” data by substituting zero. Instead, it uses all the information you have (i.e., “we know it’s less than X”) to provide a less biased and more accurate estimate of the treatment effects on the actual response concentrations.
Exploratory Data Analysis
Before we build our models, it’s important to explore the data to understand its structure and identify any patterns or issues.
Proportion of Zero/LOD Values
Here, we’ll look at the proportion of zeros (after converting LOD to 0) for each response variable. This will help us understand the extent of the “zero problem”.
Proportions in each Response
Response
LODs
Zeros
nh4
0.8220
0
no3
0.7165
0
p
0.2971
0
mg
0.0032
0
ec
0.0000
0
p_h
0.0000
0
As we can see, nh4 and no3 have a very high proportion of zero values (over 70%). p also has a substantial number of zeros. ec and p_h have no zeros. This confirms that standard linear models assuming normality will be inappropriate for the first three, but may be suitable for ec and p_h.
Visualizing the Data
Boxplots show the distribution of the non-zero data to reveal potential differences between treatments.
Code
# Function to create boxplots for each response variableplot_concentration <-function(r, data) { data %>%filter(.data[[r]] >0) %>%# We only plot non-zero values for clarityggplot(aes(x = treatment, y = .data[[r]], fill = treatment)) +geom_boxplot(alpha =0.7) +geom_point(alpha =0.3) +facet_wrap(vars(study_year), scales ="free") +labs(title =paste(tools::toTitleCase(r), "Concentration (Non-Zero) by Treatment"),x ="Treatment",y ="Value",fill ="Treatment" ) +theme_bw()}# Create and display the plotsplots_0 <-map(response_vars, plot_concentration, data = df_combined_LOD_0)wrap_plots(plots_0, ncol =2, guides ="collect")
Modeling Approach 1: Censored Models for Data with LODs (brms)
This is a statistically robust approach for nh4, no3, mg, and p. It accounts for the fact that LOD values are not true zeros but are values below a certain threshold. We use the brms package for Bayesian modeling.
Note: Bayesian models are computationally intensive and can take a long time to run. For this report, we are loading pre-fitted model objects.
Warning: Censored responses are not shown in 'pp_check'.
Warning: Removed 3787 rows containing non-finite outside the scale range
(`stat_density()`).
Warning: Removed 41 rows containing non-finite outside the scale range
(`stat_density()`).
The cens(no3_cens) term tells brms to treat the data as left-censored. The model summary shows the estimated effects. The pp_check plot shows the model fits the overall distribution reasonably well, though the many warnings about divergent transitions indicate potential convergence problems. The results should be interpreted with caution.
Warning: Censored responses are not shown in 'pp_check'.
Warning: Removed 728 rows containing non-finite outside the scale range
(`stat_density()`).
Similar to the nitrate model, the nh4 model shows many divergent transitions, which is a red flag for model reliability. The posterior predictive check (pp_check) suggests the model struggles to capture the shape of the data perfectly, which is common with highly zero-inflated datasets.
Warning: There were 149 divergent transitions after warmup. Increasing
adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: Censored responses are not shown in 'pp_check'.
The mg model appears to have converged much better than the nh4 and no3 models, with few warnings. The pp_check shows a good fit between the model’s predictions and the observed data. This suggests the censored Gamma model is a good choice for mg.
Warning: Censored responses are not shown in 'pp_check'.
Warning: Removed 1714 rows containing non-finite outside the scale range
(`stat_density()`).
Warning: Removed 18 rows containing non-finite outside the scale range
(`stat_density()`).
The phosphate model also shows signs of convergence issues. This is likely due to the combination of censored data and the complex interaction structure. The model fit, as seen in the pp_check, is reasonable but not perfect.
This approach models the data in two parts: a logistic regression for the probability of a zero vs. a non-zero value, and a separate model (e.g., Gamma or Lognormal) for the non-zero values. This is an alternative for nh4, no3, and p.
Note: We are only modeling Study Year 1 for these models.
Warning in finalizeTMB(TMBStruc, obj, fit, h, data.tmb.old): Model convergence
problem; non-positive-definite Hessian matrix. See vignette('troubleshooting')
Code
# summary(fit_no3_zi)
The warning about the non-positive-definite Hessian matrix indicates convergence problems, meaning the results may not be reliable. The DHARMa residual plots also show some some deviations, suggesting this model is not a great fit.
Warning in finalizeTMB(TMBStruc, obj, fit, h, data.tmb.old): Model convergence
problem; non-positive-definite Hessian matrix. See vignette('troubleshooting')
Similar to the nitrate model, the zero-inflated model for nh4 fails to converge properly. The DHARMa diagnostics, however, look slightly better than for nitrate. Still, the convergence failure is a major issue.
Zero-Inflated Model for Phosphate (p)
Code
fit_p_zi <-glmmTMB( p ~ treatment * date + (1| site_block_id) + (1| trt_site_block_id),ziformula =~treatment,data = df_model,family = glmmTMB::lognormal(link ="log"))Anova(fit_p_zi)
NOTE: Results may be misleading due to involvement in interactions
Note: adjust = "tukey" was changed to "sidak"
because "tukey" is only appropriate for one set of pairwise comparisons
treatment response SE df asymp.LCL asymp.UCL .group
PC0 0.0580 0.00523 Inf 0.0457 0.0735 A
MC1 0.0625 0.00578 Inf 0.0490 0.0797 A
MC2 0.0632 0.00599 Inf 0.0493 0.0811 A
PC1 0.0693 0.00583 Inf 0.0555 0.0865 AB
MC0 0.0726 0.00636 Inf 0.0576 0.0914 AB
PC2 0.0878 0.00689 Inf 0.0714 0.1079 B
Results are averaged over the levels of: date
Confidence level used: 0.95
Conf-level adjustment: sidak method for 6 estimates
Intervals are back-transformed from the log scale
P value adjustment: tukey method for comparing a family of 6 estimates
Tests are performed on the log scale
significance level used: alpha = 0.05
NOTE: If two or more means share the same grouping symbol,
then we cannot show them to be different.
But we also did not show them to be the same.
You are calculating adjusted predictions on the population-level (i.e.
`type = "fixed"`) for a *generalized* linear mixed model.
This may produce biased estimates due to Jensen's inequality. Consider
setting `bias_correction = TRUE` to correct for this bias.
See also the documentation of the `bias_correction` argument.
Code
pred_data <-as.data.frame(pred_effects)ggplot() +geom_point(data = df_model %>%mutate(group = treatment),aes(x = date, y = p),alpha =0.5, color ="gray50", position =position_jitter(width =0.2, height =0) ) +geom_ribbon(data = pred_data,aes(x = x, ymin = conf.low, ymax = conf.high, group = group),alpha =0.2,fill ="cornflowerblue" ) +geom_line(data = pred_data,aes(x = x, y = predicted, group = group),color ="cornflowerblue",linewidth =1 ) +facet_wrap(~ group, scales ="free_y") +labs(x ="Date",y ="Phosphate (P) Concentration",title ="Predicted Values of P, Faceted by Treatment" ) +theme_bw() +theme(axis.text.x =element_text(angle =45, hjust =1))
Warning: Removed 31 rows containing missing values or values outside the scale range
(`geom_point()`).
Code
# print(pred_effects)
The zero-inflated model for p converges properly, and the DHARMa diagnostics show some deviations, suggesting this model may not be a good fit.
Models for Data with No LODs (p_h and ec)
For variables like pH and EC that do not have a zero-inflation problem, we can use more standard generalized linear mixed models.
Model for pH (p_h)
Code
fit_lme_ph <-lme(fixed = p_h ~as.factor(date) * treatment,random =list(site_block_id =~1, # Random intercept for site:block combinationstrt_site_block_id =~1# Random intercept for treatment:site:block combinations ),data = df_model,na.action = na.exclude)anova(fit_lme_ph)
NOTE: Results may be misleading due to involvement in interactions
Note: adjust = "tukey" was changed to "sidak"
because "tukey" is only appropriate for one set of pairwise comparisons
treatment emmean SE df lower.CL upper.CL .group
MC2 7.69 0.0928 7 7.35 8.02 A
PC0 7.75 0.0925 7 7.42 8.09 A
PC2 7.80 0.0933 7 7.47 8.14 A
PC1 7.81 0.0929 7 7.48 8.15 A
MC1 7.83 0.0929 7 7.49 8.16 AB
MC0 8.04 0.0932 7 7.70 8.38 B
Results are averaged over the levels of: date
Degrees-of-freedom method: containment
Confidence level used: 0.95
Conf-level adjustment: sidak method for 6 estimates
P value adjustment: tukey method for comparing a family of 6 estimates
significance level used: alpha = 0.05
NOTE: If two or more means share the same grouping symbol,
then we cannot show them to be different.
But we also did not show them to be the same.
Note: adjust = "tukey" was changed to "sidak"
because "tukey" is only appropriate for one set of pairwise comparisons
date = 2023-12-27:
treatment emmean SE df lower.CL upper.CL .group
MC1 7.60 0.133 7 7.12 8.08 A
PC1 7.67 0.127 7 7.20 8.13 A
MC2 7.67 0.133 7 7.19 8.15 A
PC2 7.77 0.152 7 7.22 8.32 A
PC0 7.90 0.127 7 7.44 8.36 A
MC0 8.07 0.141 7 7.56 8.58 A
date = 2024-01-12:
treatment emmean SE df lower.CL upper.CL .group
MC2 7.77 0.123 7 7.32 8.21 A
PC2 7.84 0.133 7 7.36 8.32 A
PC1 7.85 0.127 7 7.38 8.31 A
PC0 7.86 0.123 7 7.42 8.31 A
MC1 7.94 0.127 7 7.48 8.40 A
MC0 8.18 0.127 7 7.72 8.64 A
date = 2024-01-23:
treatment emmean SE df lower.CL upper.CL .group
MC2 7.78 0.133 7 7.30 8.26 A
PC0 7.84 0.123 7 7.39 8.28 A
PC1 7.92 0.127 7 7.46 8.38 A
PC2 8.01 0.123 7 7.56 8.45 A
MC1 8.04 0.133 7 7.55 8.52 A
MC0 8.11 0.123 7 7.67 8.55 A
date = 2024-02-07:
treatment emmean SE df lower.CL upper.CL .group
PC0 7.82 0.123 7 7.38 8.26 A
PC2 7.88 0.123 7 7.43 8.32 A
MC2 7.91 0.123 7 7.47 8.35 A
MC1 7.96 0.123 7 7.52 8.41 A
PC1 7.99 0.123 7 7.55 8.44 A
MC0 8.17 0.123 7 7.73 8.62 A
date = 2024-02-21:
treatment emmean SE df lower.CL upper.CL .group
MC2 7.63 0.123 7 7.18 8.07 A
PC0 7.70 0.123 7 7.26 8.15 A
MC1 7.75 0.123 7 7.30 8.19 A
PC2 7.76 0.123 7 7.32 8.21 A
PC1 7.78 0.123 7 7.34 8.22 A
MC0 7.96 0.127 7 7.50 8.42 A
date = 2024-03-06:
treatment emmean SE df lower.CL upper.CL .group
MC2 7.67 0.123 7 7.22 8.11 A
PC0 7.73 0.123 7 7.28 8.17 A
PC1 7.80 0.123 7 7.35 8.24 A
PC2 7.80 0.127 7 7.34 8.26 A
MC1 7.85 0.123 7 7.41 8.29 A
MC0 8.04 0.123 7 7.59 8.48 A
date = 2024-03-22:
treatment emmean SE df lower.CL upper.CL .group
MC2 7.55 0.123 7 7.10 7.99 A
PC0 7.64 0.123 7 7.20 8.09 A
PC2 7.75 0.123 7 7.30 8.19 A
MC1 7.79 0.123 7 7.35 8.23 A
PC1 7.85 0.123 7 7.41 8.30 A
MC0 7.91 0.127 7 7.45 8.37 A
date = 2024-04-04:
treatment emmean SE df lower.CL upper.CL .group
MC2 7.74 0.123 7 7.29 8.18 A
PC0 7.74 0.123 7 7.29 8.18 A
PC1 7.80 0.123 7 7.35 8.24 A
MC1 7.80 0.127 7 7.34 8.26 A
PC2 7.80 0.127 7 7.34 8.26 A
MC0 7.99 0.133 7 7.51 8.47 A
date = 2024-04-16:
treatment emmean SE df lower.CL upper.CL .group
MC2 7.49 0.127 7 7.03 7.95 A
PC0 7.54 0.123 7 7.09 7.98 AB
PC2 7.62 0.123 7 7.18 8.07 AB
PC1 7.66 0.133 7 7.18 8.14 AB
MC1 7.71 0.123 7 7.26 8.15 AB
MC0 7.93 0.127 7 7.47 8.39 B
Degrees-of-freedom method: containment
Confidence level used: 0.95
Conf-level adjustment: sidak method for 6 estimates
P value adjustment: tukey method for comparing a family of 6 estimates
significance level used: alpha = 0.05
NOTE: If two or more means share the same grouping symbol,
then we cannot show them to be different.
But we also did not show them to be the same.
Code
emmip(fit_lme_ph, ~ date | treatment, CIs = T) +geom_point(data = df_model,aes(x = date, y = p_h, color = treatment),alpha =0.2,position =position_jitterdodge(jitter.width =0.1, dodge.width =0.1)) +theme_bw()
Warning: Removed 34 rows containing missing values or values outside the scale range
(`geom_point()`).
The model summary shows significant effects of treatment. The diagnostic plots show randomly scattered residuals, suggesting the model assumptions are met and this is a reliable model.
Model for EC (ec)
The residuals for the GLMM for EC showed fanning, therefore a Gamma GLMM is more appropriate than a standard linear model.
Code
fit_ec_gamma <-glmer( ec ~ date * treatment + (1| site_block_id) + (1| trt_site_block_id),data = df_model,family =Gamma(link ="log"),control =glmerControl(optimizer ="bobyqa"), # default optimizer did not convergena.action = na.exclude)
Warning in commonArgs(par, fn, control, environment()): maxfun < 10 *
length(par)^2 is not recommended.
Note: adjust = "tukey" was changed to "sidak"
because "tukey" is only appropriate for one set of pairwise comparisons
date = 2023-12-27:
treatment response SE df asymp.LCL asymp.UCL .group
PC0 0.907 0.212 Inf 0.490 1.68 A
PC2 1.031 0.250 Inf 0.545 1.95 AB
PC1 1.283 0.297 Inf 0.697 2.36 AB
MC0 1.319 0.311 Inf 0.709 2.45 AB
MC2 1.862 0.439 Inf 1.002 3.46 B
MC1 1.874 0.436 Inf 1.016 3.46 B
date = 2024-01-12:
treatment response SE df asymp.LCL asymp.UCL .group
PC0 0.827 0.192 Inf 0.449 1.52 A
PC1 1.005 0.233 Inf 0.546 1.85 AB
PC2 1.054 0.246 Inf 0.570 1.95 AB
MC0 1.281 0.295 Inf 0.699 2.35 AB
MC1 1.674 0.390 Inf 0.907 3.09 B
MC2 1.781 0.411 Inf 0.969 3.27 B
date = 2024-01-23:
treatment response SE df asymp.LCL asymp.UCL .group
PC0 0.697 0.162 Inf 0.379 1.28 A
PC2 0.792 0.181 Inf 0.434 1.45 A
PC1 0.915 0.212 Inf 0.497 1.68 AB
MC0 1.129 0.257 Inf 0.620 2.06 ABC
MC1 1.614 0.373 Inf 0.879 2.96 BC
MC2 1.896 0.438 Inf 1.033 3.48 C
date = 2024-02-07:
treatment response SE df asymp.LCL asymp.UCL .group
PC1 0.841 0.193 Inf 0.460 1.54 A
PC2 0.865 0.198 Inf 0.474 1.58 A
PC0 0.883 0.205 Inf 0.480 1.63 A
MC0 1.166 0.266 Inf 0.640 2.12 A
MC1 1.306 0.301 Inf 0.711 2.40 A
MC2 1.519 0.351 Inf 0.827 2.79 A
date = 2024-02-21:
treatment response SE df asymp.LCL asymp.UCL .group
PC1 0.902 0.207 Inf 0.493 1.65 A
PC0 0.921 0.213 Inf 0.500 1.69 AB
PC2 0.921 0.211 Inf 0.504 1.68 AB
MC1 1.204 0.278 Inf 0.656 2.21 AB
MC0 1.220 0.281 Inf 0.667 2.23 AB
MC2 1.654 0.382 Inf 0.901 3.04 B
date = 2024-03-06:
treatment response SE df asymp.LCL asymp.UCL .group
PC0 0.733 0.170 Inf 0.399 1.35 A
PC2 0.764 0.175 Inf 0.418 1.39 A
PC1 0.778 0.179 Inf 0.426 1.42 A
MC1 1.050 0.242 Inf 0.572 1.93 AB
MC0 1.182 0.269 Inf 0.649 2.15 AB
MC2 1.703 0.393 Inf 0.928 3.13 B
date = 2024-03-22:
treatment response SE df asymp.LCL asymp.UCL .group
PC0 0.892 0.207 Inf 0.485 1.64 A
PC1 0.941 0.216 Inf 0.515 1.72 A
PC2 0.988 0.226 Inf 0.541 1.80 A
MC1 1.216 0.281 Inf 0.663 2.23 AB
MC0 1.343 0.309 Inf 0.733 2.46 AB
MC2 1.906 0.440 Inf 1.038 3.50 B
date = 2024-04-04:
treatment response SE df asymp.LCL asymp.UCL .group
PC0 0.695 0.161 Inf 0.378 1.28 A
PC1 0.746 0.171 Inf 0.408 1.36 A
PC2 0.770 0.178 Inf 0.420 1.41 A
MC1 1.054 0.245 Inf 0.571 1.94 AB
MC0 1.188 0.276 Inf 0.644 2.19 AB
MC2 1.688 0.390 Inf 0.919 3.10 B
date = 2024-04-16:
treatment response SE df asymp.LCL asymp.UCL .group
PC1 0.569 0.133 Inf 0.307 1.05 A
PC2 0.604 0.138 Inf 0.331 1.10 A
PC0 0.616 0.143 Inf 0.335 1.13 A
MC1 0.779 0.180 Inf 0.424 1.43 AB
MC0 1.008 0.232 Inf 0.551 1.85 AB
MC2 1.228 0.286 Inf 0.665 2.27 B
Confidence level used: 0.95
Conf-level adjustment: sidak method for 6 estimates
Intervals are back-transformed from the log scale
P value adjustment: tukey method for comparing a family of 6 estimates
Tests are performed on the log scale
significance level used: alpha = 0.05
NOTE: If two or more means share the same grouping symbol,
then we cannot show them to be different.
But we also did not show them to be the same.
Code
emmip(fit_ec_gamma, ~ date | treatment, type ="response", CIs = T) +geom_point(data = df_model,aes(x = date, y = ec, color = treatment),alpha =0.2,position =position_jitterdodge(jitter.width =0.1, dodge.width =0.1)) +theme_bw()
Warning: Removed 28 rows containing missing values or values outside the scale range
(`geom_point()`).
The model converged without warnings. The diagnostic plots show randomly scattered residuals, suggesting the model assumptions are met and this is a reliable model.
Summary and Recommendations
Models for p_h and ec: For pH (p_h) and EC (ec), the standard linear mixed model and Gamma GLMM, respectively, fit the data well, converged without issues, and passed diagnostic checks.
Censored Models are Promising: The Bayesian censored models (brms) are theoretically the most robust approach for the data with LODs. The mg model worked very well. However, the models for nh4, no3, and p struggled to converge.
Recommendation: For the brms models that had issues, I suggest increasing the number of iterations (e.g., iter = 6000, warmup = 2000) and perhaps trying a more informative prior distribution (currently brms is selecting this for us) if you have prior knowledge about the parameters. This may help the models converge.
Zero-Inflated Models are a Good Alternative: The zero-inflated models (glmmTMB) also faced convergence issues.
Recommendation: Try simplifying the zero-inflated models. For instance, you could remove the interaction term (treatment * date) and use additive effects (treatment + date) to see if that helps with convergence.
Since we did not find a good model fit for nh4, no3, and p, we suggest fitting a binomial GLMM where 1 = above LOD and 0 = below LOD since the convergence issues on our models may come from some combinations of treatment and date not having any non-LOD values.
Source Code
---title: "Statistical Analysis of Soil Data"author: "Andrew Kerr, Kyle Bistrain, Rachel Roggenkemper, Liam Quach"editor: sourceformat: html: toc: true toc-depth: 4 code-fold: true code-tools: true embed-resources: true---```{r}#| message: false#| warning: false#| include: falselibrary(tidyverse)library(janitor)library(readxl)library(writexl)library(patchwork)library(glmmTMB)library(car)library(emmeans)library(nlme)library(lmerTest)library(DHARMa)library(brms)library(ggeffects)library(RColorBrewer)library(gt)library(multcomp)# Formatting for knitted filetheme_set(theme_minimal())``````{r}#| message: false#| eval: false#| include: false# This code takes in your 3 csv files and combines them into a single excel file with 3 sheetsdf_combined <-read_csv(here::here("data", "Combined_Master.csv")) %>%clean_names()df_year_1 <-read_csv(here::here("data", "Year1_Master.csv")) %>%clean_names()df_year_2 <-read_csv(here::here("data", "Year2_Master.csv")) %>%clean_names()# Change date formatdf_combined <- df_combined %>%mutate(date =as.Date(date, format="%m_%d_%y")) %>%select(-starts_with("x"))df_year_1 <- df_year_1 %>%mutate(date =as.Date(date, format="%m_%d_%y")) %>%select(-starts_with("x"))df_year_2 <- df_year_2 %>%mutate(date =as.Date(date, format="%m_%d_%y")) %>%select(-starts_with("x"))# Save as single Excel Fileout <-list("combined"= df_combined, "year_1"= df_year_1, "year_2"= df_year_2)write_xlsx(out, "data/Master.xlsx")```## Introduction and Data PreparationHere, we are loading the data and preparing it for analysis. This includes cleaning the column names, converting data types, and creating new variables that will be useful for modeling.### Data Read-inThis section reads in the data from the master Excel file. We'll be working with the combined dataset for the analyses.```{r}# Read in the data from the Excel filedf_combined <-read_xlsx(here::here("data", "Master.xlsx"), sheet ="combined")# Convert relevant columns to factors for modelingdf_combined <- df_combined %>%mutate(sample_id =as.factor(sample_id),treatment =as.factor(treatment),site =as.factor(site),block =as.factor(block),study_year =as.factor(study_year) )# Define the response variables we'll be analyzingresponse_vars <-c("nh4", "no3", "mg", "p", "ec", "p_h")```### Handling Values Below the Limit of Detection (LOD)A common challenge in environmental data is dealing with measurements that are below the limit of detection (LOD). These are recorded as "LOD" in the dataset. We will explore two common approaches to handle these values:1. **Treating LOD as Censored Data:** This is a more statistically robust approach. It acknowledges that the true value is somewhere between 0 and the LOD, rather than assuming it is exactly 0.2. **Replacing LOD with 0:** This is a simple approach, but it can bias the results, especially if there are many LOD values.Here, we create two datasets to reflect these two approaches.```{r}#| warning: false# Create a dataset where LOD is replaced with 0df_combined_LOD_0 <- df_combined %>%mutate(across(all_of(response_vars),~as.numeric(ifelse(.x =="LOD", 0, .x)) ),date =as.factor(date),# Create combined grouping factorssite_block_id =interaction(site, block, drop =TRUE),trt_site_block_id =interaction(treatment, site, block, drop =TRUE) )# Create a dataset for censored models# A 'censored' column is added for each response variable.# The actual response variable is set to a placeholder (e.g., the detection limit) when censored.lc_model_data <- df_combined %>%mutate(nh4_cens =ifelse(nh4 =="LOD", -1, 0),no3_cens =ifelse(no3 =="LOD", -1, 0),mg_cens =ifelse(mg =="LOD", -1, 0),p_cens =ifelse(p =="LOD", -1, 0),nh4 =ifelse(nh4 =="LOD", nh4_lod, as.numeric(nh4)),no3 =ifelse(no3 =="LOD", n03_lod, as.numeric(no3)),mg =ifelse(mg =="LOD", mg_lod, as.numeric(mg)),p =ifelse(p =="LOD", p_lod, as.numeric(p)),date =as.factor(date),# Create combined grouping factorssite_block_id =interaction(site, block, drop =TRUE),trt_site_block_id =interaction(treatment, site, block, drop =TRUE) )```Depending on which method we use, the models answer slightly different research questions. #### Treating LOD as 0- **Assumption**: You are assuming that any measurement below the Limit of Detection (LOD) is a true zero. However, since your data does not have any true zeros this allows us to use a zero-inflated model to model (1) probability below LOD and (2) mean if above LOD.- **Research Question(s) it Answers**: This approach splits your research question into two parts: 1. Presence/Absence: Does the treatment affect the probability that the response is LOD? 2. Concentration when Present: If the response is present (i.e., not LOD), how does the treatment affect its concentration?- **Downside**: In most environmental systems, a value below the LOD doesn't mean the response is completely absent, just that its concentration is too low for your equipment to measure accurately. By substituting zero, you are artificially deflating the mean and variance of your data, which can bias your results and lead you to **underestimate the true average concentration**.#### Treating LOD as Left-Censored- **Assumption**: You assume that a value below the LOD is not a known value, only that its true concentration is somewhere between 0 and the detection limit. The model doesn't force it to be zero; it incorporates this uncertainty into the analysis.- **Research Question(s) it Answers**: This approach answers a single, more direct question: How does the treatment affect the true underlying concentration of the response, while properly accounting for the limitations and uncertainty of the measurement instrument?- **Advantage**: It doesn't "invent" data by substituting zero. Instead, it uses all the information you have (i.e., "we know it's less than X") to provide a less biased and more accurate estimate of the treatment effects on the actual response concentrations.### Exploratory Data AnalysisBefore we build our models, it's important to explore the data to understand its structure and identify any patterns or issues.#### Proportion of Zero/LOD ValuesHere, we'll look at the proportion of zeros (after converting LOD to 0) for each response variable. This will help us understand the extent of the "zero problem".```{r}#| echo: false# Calculate the proportion of LODs and zeros for each response variablelod <- df_combined %>%group_by(treatment) %>%summarise(across(all_of(response_vars), ~mean(.x =="LOD", na.rm = T)))zero <- df_combined %>%summarise(across(all_of(response_vars), ~mean(.x =="0", na.rm = T))) %>%pivot_longer(cols =everything(), names_to ="Response", values_to ="Zeros")df_combined %>%summarise(across(all_of(response_vars), ~mean(.x =="LOD", na.rm = T))) %>%pivot_longer(cols =everything(), names_to ="Response", values_to ="LODs") %>%left_join(zero, by =c("Response")) %>%arrange(-LODs) %>%gt() %>%tab_header(title ="Proportions in each Response") %>%fmt_number(columns =c(LODs), decimals =4)``````{r, fig.width=20, fig.height=14}#| echo: false#| message: falseresponse_vars_2 <-c("nh4", "no3", "mg", "p")df_combined %>%pivot_longer(cols =all_of(response_vars_2),names_to ="var",values_to ="value") %>%group_by(treatment, date, var) %>%summarise(proportion_lod =mean(value =="LOD", na.rm =TRUE)) %>%ggplot(aes(x = treatment, y = proportion_lod)) +geom_col(aes(fill = proportion_lod ==1), width =0.7) +scale_fill_manual(values =c("FALSE"="lightgreen", "TRUE"="firebrick"),name ="100% at LOD") +scale_y_continuous(labels = scales::percent) +facet_wrap(~ var + date, ncol =length(unique(df_combined$date))) +labs(title ="LOD Proportions by Treatment",x ="Treatment",y ="Percentage at LOD") +theme_bw() +theme(axis.text.x =element_text(angle =45, hjust =1),strip.text =element_text(size =14))```As we can see, `nh4` and `no3` have a very high proportion of zero values (over 70%). `p` also has a substantial number of zeros. `ec` and `p_h` have no zeros. This confirms that standard linear models assuming normality will be inappropriate for the first three, but may be suitable for `ec` and `p_h`.#### Visualizing the DataBoxplots show the distribution of the *non-zero* data to reveal potential differences between treatments.```{r, fig.width=10, fig.height=8}# Function to create boxplots for each response variableplot_concentration <-function(r, data) { data %>%filter(.data[[r]] >0) %>%# We only plot non-zero values for clarityggplot(aes(x = treatment, y = .data[[r]], fill = treatment)) +geom_boxplot(alpha =0.7) +geom_point(alpha =0.3) +facet_wrap(vars(study_year), scales ="free") +labs(title =paste(tools::toTitleCase(r), "Concentration (Non-Zero) by Treatment"),x ="Treatment",y ="Value",fill ="Treatment" ) +theme_bw()}# Create and display the plotsplots_0 <-map(response_vars, plot_concentration, data = df_combined_LOD_0)wrap_plots(plots_0, ncol =2, guides ="collect")```## Modeling Approach 1: Censored Models for Data with LODs (`brms`)This is a statistically robust approach for `nh4`, `no3`, `mg`, and `p`. It accounts for the fact that LOD values are not true zeros but are values below a certain threshold. We use the `brms` package for Bayesian modeling.*Note: Bayesian models are computationally intensive and can take a long time to run. For this report, we are loading pre-fitted model objects.*### Censored Model for Nitrate (`no3`)```{r}# fit_no3 <- brm(# bf(no3 | cens(no3_cens) ~ treatment * date + (1 | site_block_id) + (1 | trt_site_block_id)),# data = lc_model_data, family = Gamma(link = "log"),# iter = 4000, warmup = 1000, chains = 4, cores = 4, seed = 12345,# file = "models/fit_no3"# )fit_no3 <-read_rds(here::here("models", "fit_no3.rds"))``````{r}# summary(fit_no3)pp_check(fit_no3, ndraws =100) +xlim(0, 100)``````{r}#| warning: false#| include: false#| eval: falseconditional_effects(fit_no3, effects ="date:treatment")emmeans(fit_no3, ~ date | treatment, type ="response") %>%cld(Letters = LETTERS, adjust ="tukey")emmip(fit_no3, ~ date | treatment, type ="response", CIs = T) +geom_point(data = lc_model_data,aes(x = date, y = no3, color =as.factor(no3_cens)),alpha =0.7,position =position_jitterdodge(jitter.width =0.1, dodge.width =0.1)) +ylim(0, 1000) +theme_bw()```The `cens(no3_cens)` term tells `brms` to treat the data as left-censored. The model summary shows the estimated effects. The `pp_check` plot shows the model fits the overall distribution reasonably well, though the many warnings about divergent transitions indicate potential convergence problems. The results should be interpreted with caution.### Censored Model for Ammonium (`nh4`)```{r}# fit_nh4 <- brm(# bf(nh4 | cens(nh4_cens) ~ treatment * date + (1 | site_block_id) + (1 | trt_site_block_id)),# data = lc_model_data, family = Gamma(link = "log"),# iter = 4000, warmup = 1000, chains = 4, cores = 4, seed = 12345,# file = "models/fit_no3"# )fit_nh4 <-read_rds(here::here("models", "fit_nh4.rds"))``````{r}# summary(fit_nh4)pp_check(fit_nh4, ndraws =100) +xlim(0, 10)``````{r}#| include: false#| warning: false#| eval: falseconditional_effects(fit_nh4, effects ="date:treatment")emmeans(fit_nh4, ~ date | treatment, type ="response") %>%cld(Letters = LETTERS, adjust ="tukey")emmip(fit_nh4, ~ date | treatment, type ="response", CIs = T) +geom_point(data = lc_model_data,aes(x = date, y = nh4, color =as.factor(nh4_cens)),alpha =0.7,position =position_jitterdodge(jitter.width =0.1, dodge.width =0.1)) +ylim(0, 10) +theme_bw()```Similar to the nitrate model, the `nh4` model shows many divergent transitions, which is a red flag for model reliability. The posterior predictive check (`pp_check`) suggests the model struggles to capture the shape of the data perfectly, which is common with highly zero-inflated datasets.### Censored Model for Magnesium (`mg`)```{r}# fit_mg <- brm(# bf(mg | cens(mg_cens) ~ treatment * date + (1 | site_block_id)) + (1 | trt_site_block_id)),# data = lc_model_data, family = Gamma(link = "log"),# iter = 4000, warmup = 1000, chains = 4, cores = 4, seed = 12345,# file = "models/fit_mg"# )fit_mg <-read_rds(here::here("models", "fit_mg.rds"))``````{r}summary(fit_mg)pp_check(fit_mg, ndraws =100)``````{r}#| include: false#| warning: falseconditional_effects(fit_mg, effects ="date:treatment")emmeans(fit_mg, ~ date | treatment, type ="response") %>%cld(Letters = LETTERS, adjust ="tukey")emmip(fit_mg, ~ date | treatment, type ="response", CIs = T) +geom_point(data = lc_model_data,aes(x = date, y = mg, color =as.factor(mg_cens)),alpha =0.7,position =position_jitterdodge(jitter.width =0.1, dodge.width =0.1)) +theme_bw()```The `mg` model appears to have converged much better than the `nh4` and `no3` models, with few warnings. The `pp_check` shows a good fit between the model's predictions and the observed data. This suggests the censored Gamma model is a good choice for `mg`.### Censored Model for Phosphate (`p`)```{r}# fit_p <- brm(# bf(p | cens(p_cens) ~ treatment * date + (1 | site_block_id)) + (1 | trt_site_block_id)),# data = lc_model_data, family = Gamma(link = "log"),# iter = 4000, warmup = 1000, chains = 4, cores = 4, seed = 12345,# file = "models/fit_p"# )fit_p <-read_rds(here::here("models", "fit_p.rds"))``````{r}# summary(fit_p)pp_check(fit_p, ndraws =100) +xlim(0, 1)``````{r}#| warning: false#| include: false#| eval: falseconditional_effects(fit_p, effects ="date:treatment")emmeans(fit_p, ~ date | treatment, type ="response") %>%cld(Letters = LETTERS, adjust ="tukey")emmip(fit_p, ~ date | treatment, type ="response", CIs = T) +geom_point(data = lc_model_data,aes(x = date, y = p, color =as.factor(p_cens)),alpha =0.7,position =position_jitterdodge(jitter.width =0.1, dodge.width =0.1)) +ylim(0, 4) +theme_bw()```The phosphate model also shows signs of convergence issues. This is likely due to the combination of censored data and the complex interaction structure. The model fit, as seen in the `pp_check`, is reasonable but not perfect.## Modeling Approach 2: Zero-Inflated Models (`glmmTMB`)This approach models the data in two parts: a logistic regression for the probability of a zero vs. a non-zero value, and a separate model (e.g., Gamma or Lognormal) for the non-zero values. This is an alternative for `nh4`, `no3`, and `p`.```{r}#| include: falsedf_model <- df_combined_LOD_0 %>%filter(study_year ==1)```*Note: We are only modeling Study Year 1 for these models.*### Zero-Inflated Model for Nitrate (`no3`)```{r}fit_no3_zi <-glmmTMB( no3 ~ treatment * date + (1| site_block_id) + (1| trt_site_block_id),ziformula =~treatment,data = df_model,family = glmmTMB::lognormal(link ="log"))# summary(fit_no3_zi)``````{r, fig.width=8, fig.height=4}#| include: false#| eval: falsesimulationOutput <-simulateResiduals(fittedModel = fit_no3_zi, n =500)plot(simulationOutput)``````{r}#| include: false#| eval: falseemmeans(fit_no3_zi, specs =~ treatment | date, type ="response") %>%cld(Letters = LETTERS, adjust ="tukey")pred_effects <-ggpredict(fit_no3_zi, terms =c("date", "treatment"))pred_data <-as.data.frame(pred_effects)ggplot() +geom_point(data = df_model %>%mutate(group = treatment),aes(x = date, y = no3),alpha =0.5,color ="gray50",position =position_jitter(width =0.2, height =0) ) +geom_ribbon(data = pred_data,aes(x = x, ymin = conf.low, ymax = conf.high, group = group),alpha =0.2,fill ="cornflowerblue" ) +geom_line(data = pred_data,aes(x = x, y = predicted, group = group),color ="cornflowerblue",linewidth =1 ) +facet_wrap(~ group, scales ="free_y") +labs(x ="Date",y ="Nitrate (NO3) Concentration",title ="Predicted Values of NO3, Faceted by Treatment" ) +theme_bw() +theme(axis.text.x =element_text(angle =45, hjust =1))print(pred_effects)```The warning about the non-positive-definite Hessian matrix indicates convergence problems, meaning the results may not be reliable. The `DHARMa` residual plots also show some some deviations, suggesting this model is not a great fit.### Zero-Inflated Model for Ammonium (`nh4`)```{r}fit_nh4_zi <-glmmTMB( nh4 ~ treatment * date + (1| site_block_id) + (1| trt_site_block_id),ziformula =~treatment,data = df_model,family = glmmTMB::lognormal(link ="log"))# summary(fit_nh4_zi)``````{r, fig.width=8, fig.height=4}#| include: false#| eval: falsesimulationOutput <-simulateResiduals(fittedModel = fit_nh4_zi, n =500)plot(simulationOutput)``````{r}#| eval: falseemmeans(fit_nh4_zi, specs =~ treatment | date, type ="response") %>%cld(Letters = LETTERS, adjust ="tukey")pred_effects <-ggpredict(fit_nh4_zi, terms =c("date", "treatment"))pred_data <-as.data.frame(pred_effects)ggplot() +geom_point(data = df_model %>%mutate(group = treatment),aes(x = date, y = nh4),alpha =0.5, color ="gray50", position =position_jitter(width =0.2, height =0) ) +geom_ribbon(data = pred_data,aes(x = x, ymin = conf.low, ymax = conf.high, group = group),alpha =0.2,fill ="cornflowerblue" ) +geom_line(data = pred_data,aes(x = x, y = predicted, group = group),color ="cornflowerblue",linewidth =1 ) +facet_wrap(~ group, scales ="free_y") +labs(x ="Date",y ="Ammonium (NH4) Concentration",title ="Predicted Values of NH4, Faceted by Treatment" ) +theme_bw() +theme(axis.text.x =element_text(angle =45, hjust =1))# print(pred_effects)```Similar to the nitrate model, the zero-inflated model for `nh4` fails to converge properly. The DHARMa diagnostics, however, look slightly better than for nitrate. Still, the convergence failure is a major issue.### Zero-Inflated Model for Phosphate (`p`)```{r}fit_p_zi <-glmmTMB( p ~ treatment * date + (1| site_block_id) + (1| trt_site_block_id),ziformula =~treatment,data = df_model,family = glmmTMB::lognormal(link ="log"))Anova(fit_p_zi)``````{r, fig.width=8, fig.height=4}simulationOutput <-simulateResiduals(fittedModel = fit_p_zi, n =500)plot(simulationOutput)``````{r}emmeans(fit_p_zi, specs =~ treatment, type ="response") %>%cld(Letters = LETTERS, adjust ="tukey")pred_effects <-ggpredict(fit_p_zi, terms =c("date", "treatment"))pred_data <-as.data.frame(pred_effects)ggplot() +geom_point(data = df_model %>%mutate(group = treatment),aes(x = date, y = p),alpha =0.5, color ="gray50", position =position_jitter(width =0.2, height =0) ) +geom_ribbon(data = pred_data,aes(x = x, ymin = conf.low, ymax = conf.high, group = group),alpha =0.2,fill ="cornflowerblue" ) +geom_line(data = pred_data,aes(x = x, y = predicted, group = group),color ="cornflowerblue",linewidth =1 ) +facet_wrap(~ group, scales ="free_y") +labs(x ="Date",y ="Phosphate (P) Concentration",title ="Predicted Values of P, Faceted by Treatment" ) +theme_bw() +theme(axis.text.x =element_text(angle =45, hjust =1))# print(pred_effects)```The zero-inflated model for `p` converges properly, and the DHARMa diagnostics show some deviations, suggesting this model may not be a good fit.## Models for Data with No LODs (`p_h` and `ec`)For variables like pH and EC that do not have a zero-inflation problem, we can use more standard generalized linear mixed models.### Model for pH (`p_h`)```{r}fit_lme_ph <-lme(fixed = p_h ~as.factor(date) * treatment,random =list(site_block_id =~1, # Random intercept for site:block combinationstrt_site_block_id =~1# Random intercept for treatment:site:block combinations ),data = df_model,na.action = na.exclude)anova(fit_lme_ph)plot(fit_lme_ph)``````{r}emmeans(fit_lme_ph, specs =~ treatment) %>%cld(Letters = LETTERS, adjust ="tukey")emmeans(fit_lme_ph, specs =~ treatment | date) %>%cld(Letters = LETTERS, adjust ="tukey")emmip(fit_lme_ph, ~ date | treatment, CIs = T) +geom_point(data = df_model,aes(x = date, y = p_h, color = treatment),alpha =0.2,position =position_jitterdodge(jitter.width =0.1, dodge.width =0.1)) +theme_bw()```The model summary shows significant effects of treatment. The diagnostic plots show randomly scattered residuals, suggesting the model assumptions are met and this is a reliable model.### Model for EC (`ec`)The residuals for the GLMM for EC showed fanning, therefore a Gamma GLMM is more appropriate than a standard linear model.```{r}fit_ec_gamma <-glmer( ec ~ date * treatment + (1| site_block_id) + (1| trt_site_block_id),data = df_model,family =Gamma(link ="log"),control =glmerControl(optimizer ="bobyqa"), # default optimizer did not convergena.action = na.exclude)Anova(fit_ec_gamma)plot(fit_ec_gamma)``````{r}emmeans(fit_ec_gamma, specs =~ treatment | date, type ="response") %>%cld(Letters = LETTERS, adjust ="tukey")emmip(fit_ec_gamma, ~ date | treatment, type ="response", CIs = T) +geom_point(data = df_model,aes(x = date, y = ec, color = treatment),alpha =0.2,position =position_jitterdodge(jitter.width =0.1, dodge.width =0.1)) +theme_bw()```The model converged without warnings. The diagnostic plots show randomly scattered residuals, suggesting the model assumptions are met and this is a reliable model.## Summary and Recommendations- **Models for `p_h` and `ec`:** For pH (`p_h`) and EC (`ec`), the standard linear mixed model and Gamma GLMM, respectively, fit the data well, converged without issues, and passed diagnostic checks.- **Censored Models are Promising:** The Bayesian censored models (`brms`) are theoretically the most robust approach for the data with LODs. The `mg` model worked very well. However, the models for `nh4`, `no3`, and `p` struggled to converge. - **Recommendation:** For the `brms` models that had issues, I suggest increasing the number of iterations (e.g., `iter = 6000`, `warmup = 2000`) and perhaps trying a more informative prior distribution (currently brms is selecting this for us) if you have prior knowledge about the parameters. This may help the models converge.- **Zero-Inflated Models are a Good Alternative:** The zero-inflated models (`glmmTMB`) also faced convergence issues. - **Recommendation:** Try simplifying the zero-inflated models. For instance, you could remove the interaction term (`treatment * date`) and use additive effects (`treatment + date`) to see if that helps with convergence.- Since we did not find a good model fit for `nh4`, `no3`, and `p`, we suggest fitting a binomial GLMM where 1 = above LOD and 0 = below LOD since the convergence issues on our models may come from some combinations of treatment and date not having any non-LOD values.