# Ordinal regression in R: part 2

A theoretical and applied walkthrough of ordinal regression. Part 2: the Bayesian approach with brms.

Taylor Dunn
2020-03-17
R packages
knitr::opts_chunkset(echo = TRUE) library(tidyverse) library(dunnr) library(gt) library(broom) library(broom.mixed) library(patchwork) extrafont::loadfonts(device = "win", quiet = TRUE) theme_set(theme_td()) set_geom_fonts() set_palette() wine_red <- "#58181F" update_geom_defaults("point", list(color = wine_red)) update_geom_defaults("line", list(color = wine_red))  This is part 2 of me learning ordinal regression in R. Previously, I took the frequentist approach with the ordinal package. Here, I’ll use brms package to fit Bayesian mixed models via Stan. Though I won’t be reproducing their examples, give a great tutorial of using brms for ordinal regression models. It also frames the cumulative model in the terms of a latent (not observable) continuous variable $$\tilde{y}$$, which has been categorized into the observed ordinal variable $$y$$. I found this way of thinking very intuitive, and it helped make a lot of the concepts click. This post also serves as practice in Bayesian inference, so I’ll be comparing the results here to those from part 1, and explore different choices of prior distributions. ## Setup Load brms, tidybayes and the wine data from that was analyzed in part 1. library(brms) library(tidybayes) # Detect and set the number of cores for MCMC options(mc.cores = parallel::detectCores()) library(ordinal) data(wine) wine <- as_tibble(wine)  ## Fitting the models I will be fitting these models of wine bitterness ratings: \begin{align} \text{logit}(p(y_i \leq j)) &= \theta_j - u( \text{judge}_i) \\ \text{logit}(p(y_i \leq j)) &= \theta_j - \beta_1 \text{temp}_i - \beta_2 \text{contact}_i - u( \text{judge}_i) \\ \end{align} where $$p(y_i \leq j)$$ is the probability of a rating less than or equal to $$j$$, $$\theta_j$$ are the thresholds for the $$J-1 = 4$$ levels, $$u(\text{judge}_i)$$ are judge-specific random effects, and $$\beta_1$$ and $$\beta_2$$ are fixed effect coefficients for $$\text{temp}_i$$ and $$\text{contact}_i$$. (see part 1 for more details). f_rating_null <- rating ~ 1 + (1|judge) f_rating_contact_temp <- rating ~ 1 + contact + temp + (1|judge)  ### Null model We will start with the “null” model, with just thresholds and random effects. The default priors for this model are: get_prior(f_rating_null, data = wine, family = cumulative("logit"))   prior class coef group resp dpar nlpar bound student_t(3, 0, 2.5) Intercept student_t(3, 0, 2.5) Intercept 1 student_t(3, 0, 2.5) Intercept 2 student_t(3, 0, 2.5) Intercept 3 student_t(3, 0, 2.5) Intercept 4 student_t(3, 0, 2.5) sd student_t(3, 0, 2.5) sd judge student_t(3, 0, 2.5) sd Intercept judge source default (vectorized) (vectorized) (vectorized) (vectorized) default (vectorized) (vectorized) \begin{align} \text{logit}(p(y_i \leq j)) &= \theta_j - u( \text{judge}_i) \\ \theta_j &\sim \text{Student-}t(3, 0, 2.5) \\ u(\text{judge}_i) &\sim \text{Normal}(0, \sigma_u) \\ \sigma_u &\sim \text{Student-} t(3, 0, 2.5) \end{align} Fit the model: brm_rating_null <- brm( f_rating_null, data = wine, family = cumulative("logit"), sample_prior = TRUE, file = "brm-rating-null" )  Visualize the two priors (on thresholds/Intercepts, and on SD of judge effects) with brms:prior_samples: prior_draws(brm_rating_null) %>% pivot_longer(cols = everything(), names_to = "term", values_to = "value") %>% mutate(samples = "model prior samples") %>% ggplot(aes(x = value, y = samples, fill = samples)) + geom_violin() + # Also visualize some random samples from the t-distribution for comparison geom_violin( data = tibble( value = rstudent_t(n = 4000, df = 3, mu = 0, sigma = 2.5), samples = "student_t(3, 0, 2.5)", term = "Intercept" ) ) + geom_violin( data = tibble( value = abs(rstudent_t(n = 4000, df = 3, mu = 0, sigma = 2.5)), samples = "abs(student_t(3, 0, 2.5))", term = "sd_judge" ) ) + labs(y = NULL) + facet_wrap(~term, scales = "free_x") + coord_cartesian(xlim = c(-15, 15)) + theme(legend.position = "none") + add_facet_borders() Note that, since a standard deviation can’t be negative, brms automatically takes the absolute value of the default student_t(3, 0, 2.5) prior. These priors aren’t terrible. For instance, intercepts (the thresholds $$\theta_j$$) between -10 and +10 correspond to the following probabilities: tibble(theta = seq(-10, 10)) %>% mutate(p = inv_logit_scaled(theta)) %>% ggplot(aes(x = theta, y = p)) + geom_line(size = 1) Any prior values outside of this range would essentially zero cumulative probability for a level $$\leq j$$. Now that we’ve thought about our (default) prior assumptions, investigate the chains: plot(brm_rating_null, ask = FALSE, # Only plot specified variables variable = "^b_|^sd_", regex = TRUE) The intercept trace plots look good to me. There are some spikes in sd_judge__Intercept, but not enough to be concerning. Print the model estimates: brm_rating_null   Family: cumulative Links: mu = logit; disc = identity Formula: rating ~ 1 + (1 | judge) Data: wine (Number of observations: 72) Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; total post-warmup draws = 4000 Group-Level Effects: ~judge (Number of levels: 9) Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS sd(Intercept) 0.73 0.41 0.07 1.64 1.00 1339 Tail_ESS sd(Intercept) 1614 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Intercept -2.79 0.54 -3.93 -1.81 1.00 2240 Intercept -0.56 0.36 -1.30 0.12 1.00 2777 Intercept 1.10 0.38 0.39 1.87 1.00 2986 Intercept 2.42 0.49 1.54 3.47 1.00 3761 Tail_ESS Intercept 2206 Intercept 2750 Intercept 2927 Intercept 2958 Family Specific Parameters: Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS disc 1.00 0.00 1.00 1.00 NA NA NA Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS and Tail_ESS are effective sample size measures, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat = 1). The Rhat values are also a good sign of model convergence. Compare the null Bayesian model estimates to the frequentist estimates: # Can't figure out how to extract random effect SDs from a clmm model, use clmm2 clmm2_rating_null <- clmm2( rating ~ 1, random = judge, data = wine, link = "logistic", Hess = TRUE ) # Unfortunately, clmm2 doesn't have a broom::tidy() function summary(clmm2_rating_null) %>% coef() %>% as_tibble() %>% mutate(term = str_c("b_Intercept[", 1:4, "]")) %>% bind_rows( tibble( Estimate = as.numeric(clmm2_rating_nullstDev),
term = "sd_judge__Intercept"
)
) %>%
janitor::clean_names() %>%
left_join(
broom.mixed::tidyMCMC(brm_rating_null, conf.int = TRUE),
by = "term"
) %>%
relocate(term) %>%
mutate(
pr_z = scales::pvalue(pr_z),
across(where(is.numeric), ~round(., 2))
) %>%
gt() %>%
tab_spanner(
label = "ordinal::clmm",
columns = c(estimate.x, std_error, z_value, pr_z)
) %>%
tab_spanner(
label = "brms::brm",
columns = c(estimate.y, std.error, conf.low, conf.high)
) %>%
fmt_missing(columns = everything(), missing_text = "")

term ordinal::clmm brms::brm
estimate.x std_error z_value pr_z estimate.y std.error conf.low conf.high
b_Intercept -2.72 0.52 -5.25 <0.001 -2.76 0.54 -3.93 -1.81
b_Intercept -0.54 0.32 -1.72 0.085 -0.55 0.36 -1.30 0.12
b_Intercept 1.10 0.34 3.24 0.001 1.10 0.38 0.39 1.87
b_Intercept 2.35 0.46 5.13 <0.001 2.40 0.49 1.54 3.47
sd_judge__Intercept 0.57 0.69 0.41 0.07 1.64

Frequentist estimates are pretty close to the Bayesian estimates with naive priors.

#### Choice of priors

So what are reasonable priors for this data and model? My go-to resource for this kind of thing is this page from the stan wiki, but under the “Prior for cutpoints in ordered logit or probit regression,” they have a couple suggestions like “uniform priors typically should be ok,” but also say “Need to flesh out this section with examples,” so not a lot of help there.

I’ll consider the following priors on the the thresholds:

\begin{align} \theta_j &\sim \text{Student-}t(3, 0, 5.0) \\ \theta_j &\sim \text{Student-}t(3, 0, 2.5) \\ \theta_j &\sim \text{Student-}t(3, 0, 1.0) \\ \theta_j &\sim \text{Normal}(0, 5.0) \\ \theta_j &\sim \text{Normal}(0, 2.5) \\ \theta_j &\sim \text{Normal}(0, 1.0) \\ \end{align}

and simulate some corresponding cumulative probabilities:

bind_rows(
tibble(sigma = c(1, 2.5, 5), dist = "Normal") %>%
mutate(
prior = str_c("Normal(0, ", sigma, ")"),
samples = map(sigma, ~rnorm(2000, mean = 0, sd = .x))
),
tibble(sigma = c(1, 2.5, 5), dist = "Student-t") %>%
mutate(prior = str_c("Student-t(3, 0, ", sigma, ")"),
samples = map(sigma, ~rstudent_t(2000, df = 3, mu = 0, sigma = .x)))
) %>%
#mutate(prior = fct_reorder2(prior, sigma, dist)) %>%
unnest(samples) %>%
mutate(p = inv_logit_scaled(samples)) %>%
ggplot(aes(x = p)) +
geom_histogram(fill = wine_red, binwidth = 0.1) +
facet_wrap(~prior) +
scale_y_continuous(expand = c(0, 0)) + The $$\text{Normal}(0, 5)$$ and $$\text{Student-}t(3, 0, 5)$$ priors place most of the samples at the extremes (cumulative probabilities of 0 and 100%). The $$\text{Normal}(0, 2.5)$$ and $$\text{Student-}t(3, 0, 2.5)$$ are sensible default priors because they are fairly uniform in the probability space, although they do have slight peaks at the extremes. The $$\text{Normal}(0, 1)$$ and $$\text{Student-}t(3, 0, 1)$$ prior distributions are bell-shaped around $$p = 0.5$$, and might be appropriate if we have reason to believe that there will be no extreme probabilities.

As for the scale parameter describing the variance in judge-specific random effects, I’ve seen the half-Cauchy distribution recommended:

tibble(scale = c(1, 2.5, 5)) %>%
crossing(x = seq(0, 10, 0.1)) %>%
mutate(
prior = str_c("Half-Cauchy(0, ", scale, ")"),
dens = dcauchy(x, location = 0, scale = scale)
) %>%
ggplot(aes(x, y = dens)) +
geom_line(aes(color = prior), size = 1) +
theme(legend.position = c(0.6, 0.6), axis.ticks.y = element_blank(),
axis.text.y = element_blank()) This distribution is fairly conservative, with a long right tail that allows for large values.

I don’t think there is anything wrong with the default priors, but just to show how it is done, I’ll fit the null model with the following:

\begin{align} \text{logit}(p(y_i \leq j)) &= \theta_j - u( \text{judge}_i) \\ \theta_j &\sim \text{Normal}(0, 1.5) \\ u(\text{judge}_i) &\sim \text{Normal}(0, \sigma_u) \\ \sigma_u &\sim \text{Half-Cauchy}(0, 2.5) \end{align}

prior_rating_null <- c(
prior(normal(0, 1.5), class = Intercept),
prior(cauchy(0, 2.5), class = sd)
)
brm_rating_null_alt_prior <-
brm(
f_rating_null,
prior = prior_rating_null ,
data = wine,
family = cumulative("logit"),
file = "brm-rating-null-alt-prior"
)
plot(brm_rating_null_alt_prior, ask = FALSE,
variable = "^b_|^sd_", regex = TRUE) As expected, these priors don’t noticeably improve the model convergence (which was already good). Likewise, the model estimates changed only slightly:

tidyMCMC(brm_rating_null) %>%
mutate(priors = "default priors") %>%
bind_rows(
tidyMCMC(brm_rating_null_alt_prior) %>%
mutate(priors = "alternative priors")
) %>%
filter(!str_starts(term, "r_|prior_|disc")) %>%
transmute(
term,
estimate_se = str_c(round(estimate, 2), " (", round(std.error, 2), ")"),
priors
) %>%
pivot_wider(names_from = priors, values_from = estimate_se) %>%
gt()

term default priors alternative priors
b_Intercept -2.76 (0.54) -2.63 (0.48)
b_Intercept -0.55 (0.36) -0.53 (0.33)
b_Intercept 1.1 (0.38) 1.04 (0.33)
b_Intercept 2.4 (0.49) 2.28 (0.43)
sd_judge__Intercept 0.69 (0.41) 0.64 (0.39)

### Fixed effects

We now add the “treatment” effects of temp and contact:

\begin{align} \text{logit}(p(y_i \leq j)) &= \theta_j - \beta_1 \text{temp}_i - \beta_2 \text{contact}_i - u( \text{judge}_i) \\ \end{align}

This introduces two new priors we can specify:

get_prior(f_rating_contact_temp, data = wine, family = cumulative("logit")) %>%
as_tibble() %>%
select(prior, class, coef, group) %>%
print(n = 11)

# A tibble: 11 x 4
prior                  class     coef         group
<chr>                  <chr>     <chr>        <chr>
1 ""                     b         ""           ""
2 ""                     b         "contactyes" ""
3 ""                     b         "tempwarm"   ""
4 "student_t(3, 0, 2.5)" Intercept ""           ""
5 ""                     Intercept "1"          ""
6 ""                     Intercept "2"          ""
7 ""                     Intercept "3"          ""
8 ""                     Intercept "4"          ""
9 "student_t(3, 0, 2.5)" sd        ""           ""
10 ""                     sd        ""           "judge"
11 ""                     sd        "Intercept"  "judge"

We know from part 1 that contactyes ($$\beta_1$$) and tempwarm ($$\beta_2$$) are associated with higher ratings, but we shouldn’t be biasing our priors by using the same data we are modeling. Instead, use a weakly regularizing normal distributions centered at 0:

\begin{align} \text{logit}(p(y_i \leq j)) &= \theta_j - \beta_1 \text{temp}_i - \beta_2 \text{contact}_i - u( \text{judge}_i) \\ \beta_1 &\sim \text{Normal}(0, 5) \\ \beta_2 &\sim \text{Normal}(0, 5) \\ \theta_j &\sim \text{Normal}(0, 1.5) \\ u(\text{judge}_i) &\sim \text{Normal}(0, \sigma_u) \\ \sigma_u &\sim \text{Half-Cauchy}(0, 2.5) \end{align}

prior_rating_contact_temp <-
c(prior_rating_null,
prior(normal(0, 5), class = b))
brm_rating_contact_temp <-
brm(
f_rating_contact_temp,
prior = prior_rating_contact_temp,
data = wine,
family = cumulative("logit"),
file = "brm-rating-contact-temp-weak-prior"
)
# Also fit using the default priors
brm_rating_contact_temp_default_prior <-
brm(
f_rating_contact_temp,
data = wine,
family = cumulative("logit"),
sample_prior = TRUE,
file = "brm-rating-contact-temp-default-prior"
)
brm_rating_contact_temp

 Family: cumulative
Links: mu = logit; disc = identity
Formula: rating ~ 1 + contact + temp + (1 | judge)
Data: wine (Number of observations: 72)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000

Group-Level Effects:
~judge (Number of levels: 9)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
sd(Intercept)     1.14      0.45     0.46     2.20 1.00     1553
Tail_ESS
sd(Intercept)     2226

Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
Intercept    -1.40      0.57    -2.55    -0.31 1.00     3148
Intercept     1.30      0.53     0.26     2.33 1.00     3604
Intercept     3.66      0.67     2.40     5.02 1.00     2992
Intercept     5.34      0.78     3.88     6.91 1.00     2974
contactyes       1.61      0.47     0.69     2.54 1.00     4020
tempwarm         2.69      0.51     1.72     3.71 1.00     3161
Tail_ESS
Intercept     3209
Intercept     3207
Intercept     2650
Intercept     3058
contactyes       2752
tempwarm         2589

Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
disc     1.00      0.00     1.00     1.00   NA       NA       NA

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Compare these estimates to those from clmm and with default priors:

clmm2_rating_contact_temp <-
clmm2(
rating ~ contact + temp, random = judge,
data = wine, link = "logistic", Hess = TRUE
)
tab_brm_clmm_rating_contact_temp <- tidyMCMC(brm_rating_contact_temp) %>%
mutate(model = "brm weak priors") %>%
bind_rows(
tidyMCMC(brm_rating_contact_temp_default_prior) %>%
mutate(model = "brm default priors")
) %>%
filter(!str_detect(term, "r_judge|lp__|prior|disc")) %>%
transmute(
term, model,
estimate_se = str_c(round(estimate, 2), " (", round(std.error, 2), ")")
) %>%
bind_rows(
summary(clmm2_rating_contact_temp) %>%
coef() %>%
as_tibble() %>%
mutate(term = c(str_c("b_Intercept[", 1:4, "]"),
"b_contactyes", "b_tempwarm")) %>%
bind_rows(
tibble(
Estimate = as.numeric(clmm2_rating_contact_temp$stDev), term = "sd_judge__Intercept" ) ) %>% janitor::clean_names() %>% transmute( model = "clmm", term, estimate_se = ifelse( !is.na(std_error), str_c(round(estimate, 2), " (", round(std_error, 2), ")"), round(estimate, 2) ) ) ) %>% pivot_wider(names_from = model, values_from = estimate_se) gt(tab_brm_clmm_rating_contact_temp)  term brm weak priors brm default priors clmm b_Intercept -1.39 (0.57) -1.7 (0.78) -1.62 (0.68) b_Intercept 1.3 (0.53) 1.45 (0.65) 1.51 (0.6) b_Intercept 3.65 (0.67) 4.17 (0.83) 4.23 (0.81) b_Intercept 5.32 (0.78) 6.06 (0.97) 6.09 (0.97) b_contactyes 1.6 (0.47) 1.85 (0.52) 1.83 (0.51) b_tempwarm 2.67 (0.51) 3.1 (0.6) 3.06 (0.6) sd_judge__Intercept 1.08 (0.45) 1.36 (0.56) 1.13 The estimates from the Bayesian regression with default priors are very close to the frequentist estimates. But is the model actually a better fit to the data? We can compare them with leave-one-out cross-validation (LOOCV) based on the posterior likelihoods: loo(brm_rating_contact_temp, brm_rating_contact_temp_default_prior)  Output of model 'brm_rating_contact_temp': Computed from 4000 by 72 log-likelihood matrix Estimate SE elpd_loo -85.2 4.3 p_loo 10.3 0.9 looic 170.4 8.5 ------ Monte Carlo SE of elpd_loo is 0.1. All Pareto k estimates are good (k < 0.5). See help('pareto-k-diagnostic') for details. Output of model 'brm_rating_contact_temp_default_prior': Computed from 4000 by 72 log-likelihood matrix Estimate SE elpd_loo -85.3 5.3 p_loo 12.7 1.3 looic 170.5 10.6 ------ Monte Carlo SE of elpd_loo is 0.1. All Pareto k estimates are good (k < 0.5). See help('pareto-k-diagnostic') for details. Model comparisons: elpd_diff se_diff brm_rating_contact_temp 0.0 0.0 brm_rating_contact_temp_default_prior -0.1 1.1  The output here is the expected log predicted density (elpd_loo), the estimated number of effective parameters (p_loo), and the LOOCV information criteria (looic). Lower values of looic indicate better model fit. These are essentially equal, so the weak priors have not made a large difference to the model fit. ## Aside: adjacent-category models Here are what had to say about the adjacent-category class of ordinal models. • Predict the decision between two adjacent categories $$k$$ and $$k+1$$. • Latent variables $$\tilde{Y}_k$$ with thresholds $$\tau_k$$ and cumulative distribution function $$F$$. • If $$\tilde{Y}_k < \tau_k$$, we choose category $$k$$; $$k+1$$ otherwise. • It is difficult to think of a natural process leading to them; chosen for its mathematical convenience rather than quality of interpretation. Mathematically: $\text{Pr}(Y = k | Y \in \{k, k+1\}) = F(\tau_k)$ Suppose the latent variable $$\tilde{Y}_2$$ is standard normally distributed with distribution function $$\Phi$$, and $$\tau_2$$ = 1. Then the probability of choosing $$Y$$ = 2 over $$Y$$ = 3 would be written as: $\text{Pr}(Y = 2 | Y \in \{2, 3\}) = \Phi(\tau_2) = \Phi(1) = 0.84$ Try fitting the null wine rating model with the acat family in brms: brm_rating_contact_temp_default_prior_acat <- brm( f_rating_contact_temp, family = acat(link = "logit"), data = wine, file = "brm-rating-contact-temp-default-prior-acat" )  How do these estimates compare to the cumulative model? tidyMCMC(brm_rating_contact_temp_default_prior) %>% mutate(model = "cumulative") %>% bind_rows( tidyMCMC(brm_rating_contact_temp_default_prior_acat) %>% mutate(model = "adjacent-category") ) %>% filter(!str_detect(term, "r_judge|lp__|prior|disc")) %>% transmute( term, model, estimate_se = str_c(round(estimate, 2), " (", round(std.error, 2), ")") ) %>% pivot_wider(names_from = model, values_from = estimate_se) %>% gt()  term cumulative adjacent-category b_Intercept -1.7 (0.78) -1.28 (0.67) b_Intercept 1.45 (0.65) 1.19 (0.57) b_Intercept 4.17 (0.83) 3.36 (0.8) b_Intercept 6.06 (0.97) 4.02 (1.03) b_contactyes 1.85 (0.52) 1.32 (0.4) b_tempwarm 3.1 (0.6) 2.3 (0.53) sd_judge__Intercept 1.36 (0.56) 1 (0.43) brms has a convenient function conditional_effects for quickly plotting estimates. For example, the effect of contact: ce_rating_contact_temp_default_prior_acat <- conditional_effects(brm_rating_contact_temp_default_prior_acat, categorical = TRUE, re_formula = NA, ask = FALSE) ce_rating_contact_temp_default_prior <- conditional_effects(brm_rating_contact_temp_default_prior, categorical = TRUE, re_formula = NA, ask = FALSE) ce_rating_contact_temp_default_prior_acat$contact:cats__ %>%
mutate(model = "adjacent-category") %>%
bind_rows(
ce_rating_contact_temp_default_priorcontact:cats__ %>% mutate(model = "cumulative") ) %>% ggplot(aes(x = contact, y = estimate__, color = effect2__)) + geom_point(position = position_dodge(1), size = 3) + geom_linerange(aes(ymin = lower__, ymax = upper__), position = position_dodge(1), size = 1) + facet_wrap(~model) + scale_color_viridis_d() + labs(y = "Estimated probabilities", color = "rating") ## Category-specific effects In all of the models specified so far, all fixed effects were presumed to affect all response categories equally. For example, the effect of temp = warm had a mean effect of $$\beta_1$$ = 2.67 on the thresholds $$\theta_j$$, for all $$j = 1, 2, 3, 4$$. This may not be an appropriate assumption. For example, temp warm might have little relation to the highest rating, but it may strongly predict ratings of 3 relative to 1 or 2. If this is a possibility, then we can model the predictor as having a category-specific effect by estimating $$K-1$$ coefficients for it. The reason we’ve introduced the adjacent-category model is that it is straightforward to incorporate these effects (sequential models work as well). Cumulative models, however, can lead to negative probabilities, and so should be avoided when using category-specific effects. Fit the adjacent-category model with category-specific effects on temp: \begin{align} \text{logit}(p(y_i \leq j)) &= \theta_j - \beta_{1j} \text{temp}_i - \beta_2 \text{contact}_i - u( \text{judge}_i) \\ \text{logit}(p(y_i \leq j)) &= \theta_j - \beta_{1j} \text{temp}_i - u( \text{judge}_i) \\ \end{align} f_rating_cs_temp <- rating ~ 1 + cs(temp) + (1|judge) brm_rating_cs_temp_default_prior_acat <- brm( f_rating_cs_temp, family = acat(link = "logit"), data = wine, file = "brm-rating-cs-temp-default-prior-acat" )  There were many divergent transitions, which is clear from the ugly trace plots: plot(brm_rating_cs_temp_default_prior_acat, ask = FALSE, variable = "^b_|^sd_", regex = TRUE) Most of the divergence is coming from estimating the lowest coefficient $$\beta_{11}$$ (bcs_tempwarm). I will try some regularizing priors (previously defined) and increasing the adapt_delta argument: brm_rating_cs_temp_weak_prior_acat <- brm( f_rating_cs_temp, prior = prior_rating_contact_temp, family = acat(link = "logit"), data = wine, file = "brm-rating-cs-temp-weak-prior-acat", control = list(adapt_delta = 0.9) ) plot(brm_rating_cs_temp_weak_prior_acat, ask = FALSE, variable = "^b_|^sd_", regex = TRUE) This makes a huge difference. Now compare this to a model without category-specific effects: ce_rating_cs_temp_weak_prior_acat <- conditional_effects(brm_rating_cs_temp_weak_prior_acat, categorical = TRUE, re_formula = NA, ask = FALSE) brm_rating_temp_weak_prior_acat <- brm( rating ~ 1 + temp + (1|judge), prior = prior_rating_contact_temp, family = acat(link = "logit"), data = wine, file = "brm-rating-temp-weak-prior-acat" ) ce_rating_temp_weak_prior_acat <- conditional_effects(brm_rating_temp_weak_prior_acat, categorical = TRUE, re_formula = NA, ask = FALSE) ce_rating_temp_weak_prior_acattemp:cats__ %>%
mutate(model = "constant effects") %>%
bind_rows(
ce_rating_cs_temp_weak_prior_acat$temp:cats__ %>% mutate(model = "category-specific effects") ) %>% ggplot(aes(x = temp, y = estimate__, color = effect2__)) + geom_point(position = position_dodge(1), size = 3) + geom_linerange(aes(ymin = lower__, ymax = upper__), position = position_dodge(1), size = 1) + facet_wrap(~model) + scale_color_viridis_d() + labs(y = "Estimated probabilities", color = "rating") Or, put the probabilities for each model side-by-side, along with the empirical probabilities: ce_rating_temp_weak_prior_acat$temp:cats__ %>%
mutate(model = "constant effects") %>%
bind_rows(
ce_rating_cs_temp_weak_prior_acat\$temp:cats__ %>%
mutate(model = "category-specific effects")
) %>%
ggplot(aes(x = effect2__, y = estimate__, color = model)) +
geom_point(position = position_dodge(0.5), size = 3) +
geom_linerange(aes(ymin = lower__, ymax = upper__),
position = position_dodge(0.5), size = 1) +
geom_point(
data = wine %>%
group_by(temp, rating) %>%
tally() %>%
group_by(temp) %>%
mutate(p = n / sum(n), model = "empirical"),
aes(x = rating, y = p, color = model), size = 3
) +
facet_wrap(~temp, ncol = 1) +
labs(y = "Estimated probabilities", x = "Rating", color = "model") The category-specific effects have not made a notable difference to the estimated probabilities. The constant effects model may even be better, meaning that it is a valid assumption for these data, and category-specific effects are excessive.

## Conclusion

The brms package is an awesome tool for fitting Bayesian models in Stan. Though it requires a more thoughtful (what are my priors?) and slow (Markov chain Monte Carlo can take a long time) approach, I find Bayesian inference far more intuitive than frequentest null hypothesis significance testing.

For instance, consider the cumulative link regression with contact and temp:

tab_brm_clmm_rating_contact_temp %>%
select(-brm default priors) %>%
gt()

term brm weak priors clmm
b_Intercept -1.39 (0.57) -1.62 (0.68)
b_Intercept 1.3 (0.53) 1.51 (0.6)
b_Intercept 3.65 (0.67) 4.23 (0.81)
b_Intercept 5.32 (0.78) 6.09 (0.97)
b_contactyes 1.6 (0.47) 1.83 (0.51)
b_tempwarm 2.67 (0.51) 3.06 (0.6)
sd_judge__Intercept 1.08 (0.45) 1.13

With the Bayesian model, not only do we have the full posterior to work with, we can make conclusions like:

Under weakly regularizing priors, temperature probably affects wine rating $$\beta_{\text{temp}}$$ = 2.67, 95% credible interval = 1.72, 3.71.

Versus the frequentist model:

There is evidence against the null hypothesis that $$\beta_{\text{temp}} = 0$$, $$p < 0.001$$. The point estimate from the model was 3.06, 95% confidence interval = 1.9, 4.23.

With frequentist inference, we find the probability of the data assuming the null hypothesis is true, $$P(\text{data}|H_0)$$. With Bayesian inference, we find the probability of a hypothesis given the data, $$P(H|\text{data})$$, which means we don’t even have to consider the “null world.” This makes interpretation much easier.

## Reproducibility

Session info
 setting  value
version  R version 4.1.2 (2021-11-01)
os       Windows 10 x64
system   x86_64, mingw32
ui       RTerm
language (EN)
tz       America/Curacao
date     2022-01-10                  
Git repository
Local:    main C:/Users/tdunn/Documents/tdunn
Remote:   main @ origin (https://github.com/taylordunn/tdunn)
Head:     [da88cbd] 2022-01-08: Build on cloud
Bürkner, Paul-Christian, and Matti Vuorre. 2019. Ordinal Regression Models in Psychology: A Tutorial.” Advances in Methods and Practices in Psychological Science 2 (1): 77–101. https://doi.org/10.1177/2515245918823199.
Randall, J. H. 1989. The Analysis of Sensory Data by Generalized Linear Model.” Biometrical Journal 31 (7): 781–93. https://doi.org/10.1002/bimj.4710310703.

### Citation

For attribution, please cite this work as

Dunn (2020, March 17). tdunn: Ordinal regression in R: part 2. Retrieved from https://tdunn.ca/posts/2020-03-17-ordinal-regression-in-r-part-2/

BibTeX citation

@misc{dunn2020ordinal,
author = {Dunn, Taylor},
title = {tdunn: Ordinal regression in R: part 2},
url = {https://tdunn.ca/posts/2020-03-17-ordinal-regression-in-r-part-2/},
year = {2020}
}