Ordinal regression in R: part 2

regression ordinal Bayesian statistics

A theoretical and applied walkthrough of ordinal regression. Part 2: the Bayesian approach with brms.

Taylor Dunn
2020-03-17
R packages
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(dunnr)
library(gt)
library(broom)
library(broom.mixed)
library(patchwork)

extrafont::loadfonts(device = "win", quiet = TRUE)
theme_set(theme_td())
set_geom_fonts()
set_palette()

wine_red <- "#58181F"
update_geom_defaults("point", list(color = wine_red))
update_geom_defaults("line", list(color = wine_red))

This is part 2 of me learning ordinal regression in R. Previously, I took the frequentist approach with the ordinal package. Here, I’ll use brms package to fit Bayesian mixed models via Stan.

Though I won’t be reproducing their examples, Bürkner and Vuorre (2019) give a great tutorial of using brms for ordinal regression models. It also frames the cumulative model in the terms of a latent (not observable) continuous variable \(\tilde{y}\), which has been categorized into the observed ordinal variable \(y\). I found this way of thinking very intuitive, and it helped make a lot of the concepts click.

This post also serves as practice in Bayesian inference, so I’ll be comparing the results here to those from part 1, and explore different choices of prior distributions.

Setup

Load brms, tidybayes and the wine data from Randall (1989) that was analyzed in part 1.

library(brms)
library(tidybayes)
# Detect and set the number of cores for MCMC
options(mc.cores = parallel::detectCores())

library(ordinal)
data(wine)
wine <- as_tibble(wine)

Fitting the models

I will be fitting these models of wine bitterness ratings:

\[ \begin{align} \text{logit}(p(y_i \leq j)) &= \theta_j - u( \text{judge}_i) \\ \text{logit}(p(y_i \leq j)) &= \theta_j - \beta_1 \text{temp}_i - \beta_2 \text{contact}_i - u( \text{judge}_i) \\ \end{align} \]

where \(p(y_i \leq j)\) is the probability of a rating less than or equal to \(j\), \(\theta_j\) are the thresholds for the \(J-1 = 4\) levels, \(u(\text{judge}_i)\) are judge-specific random effects, and \(\beta_1\) and \(\beta_2\) are fixed effect coefficients for \(\text{temp}_i\) and \(\text{contact}_i\). (see part 1 for more details).

f_rating_null <- rating ~ 1 + (1|judge)
f_rating_contact_temp <- rating ~ 1 + contact + temp + (1|judge)

Null model

We will start with the “null” model, with just thresholds and random effects. The default priors for this model are:

get_prior(f_rating_null, data = wine, family = cumulative("logit"))
                prior     class      coef group resp dpar nlpar bound
 student_t(3, 0, 2.5) Intercept                                      
 student_t(3, 0, 2.5) Intercept         1                            
 student_t(3, 0, 2.5) Intercept         2                            
 student_t(3, 0, 2.5) Intercept         3                            
 student_t(3, 0, 2.5) Intercept         4                            
 student_t(3, 0, 2.5)        sd                                      
 student_t(3, 0, 2.5)        sd           judge                      
 student_t(3, 0, 2.5)        sd Intercept judge                      
       source
      default
 (vectorized)
 (vectorized)
 (vectorized)
 (vectorized)
      default
 (vectorized)
 (vectorized)

\[ \begin{align} \text{logit}(p(y_i \leq j)) &= \theta_j - u( \text{judge}_i) \\ \theta_j &\sim \text{Student-}t(3, 0, 2.5) \\ u(\text{judge}_i) &\sim \text{Normal}(0, \sigma_u) \\ \sigma_u &\sim \text{Student-} t(3, 0, 2.5) \end{align} \]

Fit the model:

brm_rating_null <-
  brm(
    f_rating_null,
    data = wine,
    family = cumulative("logit"),
    sample_prior = TRUE,
    file = "brm-rating-null"
  )

Visualize the two priors (on thresholds/Intercepts, and on SD of judge effects) with brms:prior_samples:

prior_draws(brm_rating_null) %>%
  pivot_longer(cols = everything(), names_to = "term", values_to = "value") %>%
  mutate(samples = "model prior samples") %>%
  ggplot(aes(x = value, y = samples, fill = samples)) +
  geom_violin() +
  # Also visualize some random samples from the t-distribution for comparison
  geom_violin(
    data = tibble(
      value = rstudent_t(n = 4000, df = 3, mu = 0, sigma = 2.5),
      samples = "student_t(3, 0, 2.5)", term = "Intercept"
    )
  ) +
  geom_violin(
    data = tibble(
      value = abs(rstudent_t(n = 4000, df = 3, mu = 0, sigma = 2.5)),
      samples = "abs(student_t(3, 0, 2.5))", term = "sd_judge"
    )
  ) +
  labs(y = NULL) +
  facet_wrap(~term, scales = "free_x") +
  coord_cartesian(xlim = c(-15, 15)) +
  theme(legend.position = "none") +
  add_facet_borders()

Note that, since a standard deviation can’t be negative, brms automatically takes the absolute value of the default student_t(3, 0, 2.5) prior.

These priors aren’t terrible. For instance, intercepts (the thresholds \(\theta_j\)) between -10 and +10 correspond to the following probabilities:

tibble(theta = seq(-10, 10)) %>%
  mutate(p = inv_logit_scaled(theta)) %>%
  ggplot(aes(x = theta, y = p)) +
  geom_line(size = 1)

Any prior values outside of this range would essentially zero cumulative probability for a level \(\leq j\).

Now that we’ve thought about our (default) prior assumptions, investigate the chains:

plot(brm_rating_null, ask = FALSE,
     # Only plot specified variables
     variable = "^b_|^sd_", regex = TRUE)

The intercept trace plots look good to me. There are some spikes in sd_judge__Intercept, but not enough to be concerning.

Print the model estimates:

brm_rating_null
 Family: cumulative 
  Links: mu = logit; disc = identity 
Formula: rating ~ 1 + (1 | judge) 
   Data: wine (Number of observations: 72) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Group-Level Effects: 
~judge (Number of levels: 9) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
sd(Intercept)     0.73      0.41     0.07     1.64 1.00     1339
              Tail_ESS
sd(Intercept)     1614

Population-Level Effects: 
             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
Intercept[1]    -2.79      0.54    -3.93    -1.81 1.00     2240
Intercept[2]    -0.56      0.36    -1.30     0.12 1.00     2777
Intercept[3]     1.10      0.38     0.39     1.87 1.00     2986
Intercept[4]     2.42      0.49     1.54     3.47 1.00     3761
             Tail_ESS
Intercept[1]     2206
Intercept[2]     2750
Intercept[3]     2927
Intercept[4]     2958

Family Specific Parameters: 
     Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
disc     1.00      0.00     1.00     1.00   NA       NA       NA

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

The Rhat values are also a good sign of model convergence.

Compare the null Bayesian model estimates to the frequentist estimates:

# Can't figure out how to extract random effect SDs from a clmm model, use clmm2
clmm2_rating_null <-
  clmm2(
    rating ~ 1, random = judge,
    data = wine, link = "logistic", Hess = TRUE
  )
# Unfortunately, clmm2 doesn't have a broom::tidy() function
summary(clmm2_rating_null) %>%
  coef() %>%
  as_tibble() %>%
  mutate(term = str_c("b_Intercept[", 1:4, "]")) %>%
  bind_rows(
    tibble(
      Estimate = as.numeric(clmm2_rating_null$stDev),
      term = "sd_judge__Intercept"
    )
  ) %>%
  janitor::clean_names() %>%
  left_join(
    broom.mixed::tidyMCMC(brm_rating_null, conf.int = TRUE),
    by = "term"
  ) %>%
  relocate(term) %>%
  mutate(
    pr_z = scales::pvalue(pr_z),
    across(where(is.numeric), ~round(., 2))
  ) %>%
  gt() %>%
  tab_spanner(
    label = "ordinal::clmm",
    columns = c(estimate.x, std_error, z_value, pr_z)
  ) %>%
  tab_spanner(
    label = "brms::brm",
    columns = c(estimate.y, std.error, conf.low, conf.high)
  ) %>%
  fmt_missing(columns = everything(), missing_text = "")
term ordinal::clmm brms::brm
estimate.x std_error z_value pr_z estimate.y std.error conf.low conf.high
b_Intercept[1] -2.72 0.52 -5.25 <0.001 -2.76 0.54 -3.93 -1.81
b_Intercept[2] -0.54 0.32 -1.72 0.085 -0.55 0.36 -1.30 0.12
b_Intercept[3] 1.10 0.34 3.24 0.001 1.10 0.38 0.39 1.87
b_Intercept[4] 2.35 0.46 5.13 <0.001 2.40 0.49 1.54 3.47
sd_judge__Intercept 0.57 0.69 0.41 0.07 1.64

Frequentist estimates are pretty close to the Bayesian estimates with naive priors.

Choice of priors

So what are reasonable priors for this data and model? My go-to resource for this kind of thing is this page from the stan wiki, but under the “Prior for cutpoints in ordered logit or probit regression,” they have a couple suggestions like “uniform priors typically should be ok,” but also say “Need to flesh out this section with examples,” so not a lot of help there.

I’ll consider the following priors on the the thresholds:

\[ \begin{align} \theta_j &\sim \text{Student-}t(3, 0, 5.0) \\ \theta_j &\sim \text{Student-}t(3, 0, 2.5) \\ \theta_j &\sim \text{Student-}t(3, 0, 1.0) \\ \theta_j &\sim \text{Normal}(0, 5.0) \\ \theta_j &\sim \text{Normal}(0, 2.5) \\ \theta_j &\sim \text{Normal}(0, 1.0) \\ \end{align} \]

and simulate some corresponding cumulative probabilities:

bind_rows(
  tibble(sigma = c(1, 2.5, 5), dist = "Normal") %>%
    mutate(
      prior = str_c("Normal(0, ", sigma, ")"),
      samples = map(sigma, ~rnorm(2000, mean = 0, sd = .x))
    ),
  tibble(sigma = c(1, 2.5, 5), dist = "Student-t") %>%
    mutate(prior = str_c("Student-t(3, 0, ", sigma, ")"),
           samples = map(sigma, ~rstudent_t(2000, df = 3, mu = 0, sigma = .x)))
  ) %>%
  #mutate(prior = fct_reorder2(prior, sigma, dist)) %>%
  unnest(samples) %>%
  mutate(p = inv_logit_scaled(samples)) %>%
  ggplot(aes(x = p)) +
  geom_histogram(fill = wine_red, binwidth = 0.1) +
  facet_wrap(~prior) +
  scale_y_continuous(expand = c(0, 0)) +
  add_facet_borders()

The \(\text{Normal}(0, 5)\) and \(\text{Student-}t(3, 0, 5)\) priors place most of the samples at the extremes (cumulative probabilities of 0 and 100%). The \(\text{Normal}(0, 2.5)\) and \(\text{Student-}t(3, 0, 2.5)\) are sensible default priors because they are fairly uniform in the probability space, although they do have slight peaks at the extremes. The \(\text{Normal}(0, 1)\) and \(\text{Student-}t(3, 0, 1)\) prior distributions are bell-shaped around \(p = 0.5\), and might be appropriate if we have reason to believe that there will be no extreme probabilities.

As for the scale parameter describing the variance in judge-specific random effects, I’ve seen the half-Cauchy distribution recommended:

tibble(scale = c(1, 2.5, 5)) %>%
  crossing(x = seq(0, 10, 0.1)) %>%
  mutate(
    prior = str_c("Half-Cauchy(0, ", scale, ")"),
    dens = dcauchy(x, location = 0, scale = scale)
  ) %>%
  ggplot(aes(x, y = dens)) +
  geom_line(aes(color = prior), size = 1) +
  theme(legend.position = c(0.6, 0.6), axis.ticks.y = element_blank(),
        axis.text.y = element_blank())

This distribution is fairly conservative, with a long right tail that allows for large values.

I don’t think there is anything wrong with the default priors, but just to show how it is done, I’ll fit the null model with the following:

\[ \begin{align} \text{logit}(p(y_i \leq j)) &= \theta_j - u( \text{judge}_i) \\ \theta_j &\sim \text{Normal}(0, 1.5) \\ u(\text{judge}_i) &\sim \text{Normal}(0, \sigma_u) \\ \sigma_u &\sim \text{Half-Cauchy}(0, 2.5) \end{align} \]

prior_rating_null <- c(
  prior(normal(0, 1.5), class = Intercept),
  prior(cauchy(0, 2.5), class = sd)
)
brm_rating_null_alt_prior <-
  brm(
    f_rating_null,
    prior = prior_rating_null ,
    data = wine,
    family = cumulative("logit"),
    file = "brm-rating-null-alt-prior"
  )
plot(brm_rating_null_alt_prior, ask = FALSE,
     variable = "^b_|^sd_", regex = TRUE)

As expected, these priors don’t noticeably improve the model convergence (which was already good). Likewise, the model estimates changed only slightly:

tidyMCMC(brm_rating_null) %>%
  mutate(priors = "default priors") %>%
  bind_rows(
    tidyMCMC(brm_rating_null_alt_prior) %>%
      mutate(priors = "alternative priors")
  ) %>%
  filter(!str_starts(term, "r_|prior_|disc")) %>%
  transmute(
    term,
    estimate_se = str_c(round(estimate, 2), " (", round(std.error, 2), ")"),
    priors
  ) %>%
  pivot_wider(names_from = priors, values_from = estimate_se) %>%
  gt()
term default priors alternative priors
b_Intercept[1] -2.76 (0.54) -2.63 (0.48)
b_Intercept[2] -0.55 (0.36) -0.53 (0.33)
b_Intercept[3] 1.1 (0.38) 1.04 (0.33)
b_Intercept[4] 2.4 (0.49) 2.28 (0.43)
sd_judge__Intercept 0.69 (0.41) 0.64 (0.39)

Fixed effects

We now add the “treatment” effects of temp and contact:

\[ \begin{align} \text{logit}(p(y_i \leq j)) &= \theta_j - \beta_1 \text{temp}_i - \beta_2 \text{contact}_i - u( \text{judge}_i) \\ \end{align} \]

This introduces two new priors we can specify:

get_prior(f_rating_contact_temp, data = wine, family = cumulative("logit")) %>%
  as_tibble() %>%
  select(prior, class, coef, group) %>%
  print(n = 11)
# A tibble: 11 x 4
   prior                  class     coef         group  
   <chr>                  <chr>     <chr>        <chr>  
 1 ""                     b         ""           ""     
 2 ""                     b         "contactyes" ""     
 3 ""                     b         "tempwarm"   ""     
 4 "student_t(3, 0, 2.5)" Intercept ""           ""     
 5 ""                     Intercept "1"          ""     
 6 ""                     Intercept "2"          ""     
 7 ""                     Intercept "3"          ""     
 8 ""                     Intercept "4"          ""     
 9 "student_t(3, 0, 2.5)" sd        ""           ""     
10 ""                     sd        ""           "judge"
11 ""                     sd        "Intercept"  "judge"

We know from part 1 that contactyes (\(\beta_1\)) and tempwarm (\(\beta_2\)) are associated with higher ratings, but we shouldn’t be biasing our priors by using the same data we are modeling. Instead, use a weakly regularizing normal distributions centered at 0:

\[ \begin{align} \text{logit}(p(y_i \leq j)) &= \theta_j - \beta_1 \text{temp}_i - \beta_2 \text{contact}_i - u( \text{judge}_i) \\ \beta_1 &\sim \text{Normal}(0, 5) \\ \beta_2 &\sim \text{Normal}(0, 5) \\ \theta_j &\sim \text{Normal}(0, 1.5) \\ u(\text{judge}_i) &\sim \text{Normal}(0, \sigma_u) \\ \sigma_u &\sim \text{Half-Cauchy}(0, 2.5) \end{align} \]

prior_rating_contact_temp <-
  c(prior_rating_null,
    prior(normal(0, 5), class = b))
brm_rating_contact_temp <-
  brm(
    f_rating_contact_temp,
    prior = prior_rating_contact_temp,
    data = wine,
    family = cumulative("logit"),
    file = "brm-rating-contact-temp-weak-prior"
  )
# Also fit using the default priors
brm_rating_contact_temp_default_prior <-
  brm(
    f_rating_contact_temp,
    data = wine,
    family = cumulative("logit"),
    sample_prior = TRUE,
    file = "brm-rating-contact-temp-default-prior"
  )
brm_rating_contact_temp
 Family: cumulative 
  Links: mu = logit; disc = identity 
Formula: rating ~ 1 + contact + temp + (1 | judge) 
   Data: wine (Number of observations: 72) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Group-Level Effects: 
~judge (Number of levels: 9) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
sd(Intercept)     1.14      0.45     0.46     2.20 1.00     1553
              Tail_ESS
sd(Intercept)     2226

Population-Level Effects: 
             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
Intercept[1]    -1.40      0.57    -2.55    -0.31 1.00     3148
Intercept[2]     1.30      0.53     0.26     2.33 1.00     3604
Intercept[3]     3.66      0.67     2.40     5.02 1.00     2992
Intercept[4]     5.34      0.78     3.88     6.91 1.00     2974
contactyes       1.61      0.47     0.69     2.54 1.00     4020
tempwarm         2.69      0.51     1.72     3.71 1.00     3161
             Tail_ESS
Intercept[1]     3209
Intercept[2]     3207
Intercept[3]     2650
Intercept[4]     3058
contactyes       2752
tempwarm         2589

Family Specific Parameters: 
     Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
disc     1.00      0.00     1.00     1.00   NA       NA       NA

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Compare these estimates to those from clmm and with default priors:

clmm2_rating_contact_temp <-
  clmm2(
    rating ~ contact + temp, random = judge,
    data = wine, link = "logistic", Hess = TRUE
  )
tab_brm_clmm_rating_contact_temp <- tidyMCMC(brm_rating_contact_temp) %>%
  mutate(model = "brm weak priors") %>%
  bind_rows(
    tidyMCMC(brm_rating_contact_temp_default_prior) %>%
      mutate(model = "brm default priors")
  ) %>%
  filter(!str_detect(term, "r_judge|lp__|prior|disc")) %>%
  transmute(
    term, model,
    estimate_se = str_c(round(estimate, 2), " (", round(std.error, 2), ")")
  ) %>%
  bind_rows(
    summary(clmm2_rating_contact_temp) %>%
      coef() %>%
      as_tibble() %>%
      mutate(term = c(str_c("b_Intercept[", 1:4, "]"),
                      "b_contactyes", "b_tempwarm")) %>%
      bind_rows(
        tibble(
          Estimate = as.numeric(clmm2_rating_contact_temp$stDev),
          term = "sd_judge__Intercept"
        )
      ) %>%
      janitor::clean_names() %>%
      transmute(
        model = "clmm", term,
        estimate_se = ifelse(
          !is.na(std_error),
          str_c(round(estimate, 2), " (", round(std_error, 2), ")"),
          round(estimate, 2)
        )
      )
  ) %>%
  pivot_wider(names_from = model, values_from = estimate_se)
gt(tab_brm_clmm_rating_contact_temp)
term brm weak priors brm default priors clmm
b_Intercept[1] -1.39 (0.57) -1.7 (0.78) -1.62 (0.68)
b_Intercept[2] 1.3 (0.53) 1.45 (0.65) 1.51 (0.6)
b_Intercept[3] 3.65 (0.67) 4.17 (0.83) 4.23 (0.81)
b_Intercept[4] 5.32 (0.78) 6.06 (0.97) 6.09 (0.97)
b_contactyes 1.6 (0.47) 1.85 (0.52) 1.83 (0.51)
b_tempwarm 2.67 (0.51) 3.1 (0.6) 3.06 (0.6)
sd_judge__Intercept 1.08 (0.45) 1.36 (0.56) 1.13

The estimates from the Bayesian regression with default priors are very close to the frequentist estimates.

But is the model actually a better fit to the data? We can compare them with leave-one-out cross-validation (LOOCV) based on the posterior likelihoods:

loo(brm_rating_contact_temp, brm_rating_contact_temp_default_prior)
Output of model 'brm_rating_contact_temp':

Computed from 4000 by 72 log-likelihood matrix

         Estimate  SE
elpd_loo    -85.2 4.3
p_loo        10.3 0.9
looic       170.4 8.5
------
Monte Carlo SE of elpd_loo is 0.1.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.

Output of model 'brm_rating_contact_temp_default_prior':

Computed from 4000 by 72 log-likelihood matrix

         Estimate   SE
elpd_loo    -85.3  5.3
p_loo        12.7  1.3
looic       170.5 10.6
------
Monte Carlo SE of elpd_loo is 0.1.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.

Model comparisons:
                                      elpd_diff se_diff
brm_rating_contact_temp                0.0       0.0   
brm_rating_contact_temp_default_prior -0.1       1.1   

The output here is the expected log predicted density (elpd_loo), the estimated number of effective parameters (p_loo), and the LOOCV information criteria (looic). Lower values of looic indicate better model fit. These are essentially equal, so the weak priors have not made a large difference to the model fit.

Aside: adjacent-category models

Here are what Bürkner and Vuorre (2019) had to say about the adjacent-category class of ordinal models.

Mathematically:

\[ \text{Pr}(Y = k | Y \in \{k, k+1\}) = F(\tau_k) \]

Suppose the latent variable \(\tilde{Y}_2\) is standard normally distributed with distribution function \(\Phi\), and \(\tau_2\) = 1. Then the probability of choosing \(Y\) = 2 over \(Y\) = 3 would be written as:

\[ \text{Pr}(Y = 2 | Y \in \{2, 3\}) = \Phi(\tau_2) = \Phi(1) = 0.84 \] Try fitting the null wine rating model with the acat family in brms:

brm_rating_contact_temp_default_prior_acat <-
  brm(
    f_rating_contact_temp,
    family = acat(link = "logit"),
    data = wine,
    file = "brm-rating-contact-temp-default-prior-acat"
  )

How do these estimates compare to the cumulative model?

tidyMCMC(brm_rating_contact_temp_default_prior) %>%
  mutate(model = "cumulative") %>%
  bind_rows(
    tidyMCMC(brm_rating_contact_temp_default_prior_acat) %>%
      mutate(model = "adjacent-category")
  ) %>%
  filter(!str_detect(term, "r_judge|lp__|prior|disc")) %>%
  transmute(
    term, model,
    estimate_se = str_c(round(estimate, 2), " (", round(std.error, 2), ")")
  ) %>%
  pivot_wider(names_from = model, values_from = estimate_se) %>%
  gt()
term cumulative adjacent-category
b_Intercept[1] -1.7 (0.78) -1.28 (0.67)
b_Intercept[2] 1.45 (0.65) 1.19 (0.57)
b_Intercept[3] 4.17 (0.83) 3.36 (0.8)
b_Intercept[4] 6.06 (0.97) 4.02 (1.03)
b_contactyes 1.85 (0.52) 1.32 (0.4)
b_tempwarm 3.1 (0.6) 2.3 (0.53)
sd_judge__Intercept 1.36 (0.56) 1 (0.43)

brms has a convenient function conditional_effects for quickly plotting estimates. For example, the effect of contact:

ce_rating_contact_temp_default_prior_acat <-
  conditional_effects(brm_rating_contact_temp_default_prior_acat,
                      categorical = TRUE, re_formula = NA, ask = FALSE)
ce_rating_contact_temp_default_prior <-
  conditional_effects(brm_rating_contact_temp_default_prior,
                      categorical = TRUE, re_formula = NA, ask = FALSE)
ce_rating_contact_temp_default_prior_acat$`contact:cats__` %>%
  mutate(model = "adjacent-category") %>%
  bind_rows(
    ce_rating_contact_temp_default_prior$`contact:cats__` %>%
      mutate(model = "cumulative")
  ) %>%
  ggplot(aes(x = contact, y = estimate__, color = effect2__)) +
  geom_point(position = position_dodge(1), size = 3) +
  geom_linerange(aes(ymin = lower__, ymax = upper__),
                 position = position_dodge(1), size = 1) +
  facet_wrap(~model) +
  scale_color_viridis_d() +
  labs(y = "Estimated probabilities", color = "rating")

Category-specific effects

In all of the models specified so far, all fixed effects were presumed to affect all response categories equally. For example, the effect of temp = warm had a mean effect of \(\beta_1\) = 2.67 on the thresholds \(\theta_j\), for all \(j = 1, 2, 3, 4\).

This may not be an appropriate assumption. For example, temp warm might have little relation to the highest rating, but it may strongly predict ratings of 3 relative to 1 or 2.

If this is a possibility, then we can model the predictor as having a category-specific effect by estimating \(K-1\) coefficients for it. The reason we’ve introduced the adjacent-category model is that it is straightforward to incorporate these effects (sequential models work as well). Cumulative models, however, can lead to negative probabilities, and so should be avoided when using category-specific effects.

Fit the adjacent-category model with category-specific effects on temp:

\[ \begin{align} \text{logit}(p(y_i \leq j)) &= \theta_j - \beta_{1j} \text{temp}_i - \beta_2 \text{contact}_i - u( \text{judge}_i) \\ \text{logit}(p(y_i \leq j)) &= \theta_j - \beta_{1j} \text{temp}_i - u( \text{judge}_i) \\ \end{align} \]

f_rating_cs_temp <- rating ~ 1 + cs(temp) + (1|judge)
brm_rating_cs_temp_default_prior_acat <-
  brm(
    f_rating_cs_temp,
    family = acat(link = "logit"),
    data = wine,
    file = "brm-rating-cs-temp-default-prior-acat"
  )

There were many divergent transitions, which is clear from the ugly trace plots:

plot(brm_rating_cs_temp_default_prior_acat, ask = FALSE,
     variable = "^b_|^sd_", regex = TRUE)

Most of the divergence is coming from estimating the lowest coefficient \(\beta_{11}\) (bcs_tempwarm[1]). I will try some regularizing priors (previously defined) and increasing the adapt_delta argument:

brm_rating_cs_temp_weak_prior_acat <-
  brm(
    f_rating_cs_temp,
    prior = prior_rating_contact_temp,
    family = acat(link = "logit"),
    data = wine,
    file = "brm-rating-cs-temp-weak-prior-acat",
    control = list(adapt_delta = 0.9)
  )
plot(brm_rating_cs_temp_weak_prior_acat, ask = FALSE,
     variable = "^b_|^sd_", regex = TRUE)

This makes a huge difference. Now compare this to a model without category-specific effects:

ce_rating_cs_temp_weak_prior_acat <-
  conditional_effects(brm_rating_cs_temp_weak_prior_acat,
                      categorical = TRUE, re_formula = NA, ask = FALSE)
brm_rating_temp_weak_prior_acat <-
  brm(
    rating ~ 1 + temp + (1|judge),
    prior = prior_rating_contact_temp,
    family = acat(link = "logit"),
    data = wine,
    file = "brm-rating-temp-weak-prior-acat"
  )
ce_rating_temp_weak_prior_acat <-
  conditional_effects(brm_rating_temp_weak_prior_acat,
                      categorical = TRUE, re_formula = NA, ask = FALSE)
ce_rating_temp_weak_prior_acat$`temp:cats__` %>%
  mutate(model = "constant effects") %>%
  bind_rows(
    ce_rating_cs_temp_weak_prior_acat$`temp:cats__` %>%
      mutate(model = "category-specific effects")
  ) %>%
  ggplot(aes(x = temp, y = estimate__, color = effect2__)) +
  geom_point(position = position_dodge(1), size = 3) +
  geom_linerange(aes(ymin = lower__, ymax = upper__),
                 position = position_dodge(1), size = 1) +
  facet_wrap(~model) +
  scale_color_viridis_d() +
  labs(y = "Estimated probabilities", color = "rating")

Or, put the probabilities for each model side-by-side, along with the empirical probabilities:

ce_rating_temp_weak_prior_acat$`temp:cats__` %>%
  mutate(model = "constant effects") %>%
  bind_rows(
    ce_rating_cs_temp_weak_prior_acat$`temp:cats__` %>%
      mutate(model = "category-specific effects")
  ) %>%
  ggplot(aes(x = effect2__, y = estimate__, color = model)) +
  geom_point(position = position_dodge(0.5), size = 3) +
  geom_linerange(aes(ymin = lower__, ymax = upper__),
                 position = position_dodge(0.5), size = 1) +
  geom_point(
    data = wine %>%
      group_by(temp, rating) %>%
      tally() %>%
      group_by(temp) %>%
      mutate(p = n / sum(n), model = "empirical"),
    aes(x = rating, y = p, color = model), size = 3
  ) +
  facet_wrap(~temp, ncol = 1) +
  labs(y = "Estimated probabilities", x = "Rating", color = "model")

The category-specific effects have not made a notable difference to the estimated probabilities. The constant effects model may even be better, meaning that it is a valid assumption for these data, and category-specific effects are excessive.

Conclusion

The brms package is an awesome tool for fitting Bayesian models in Stan. Though it requires a more thoughtful (what are my priors?) and slow (Markov chain Monte Carlo can take a long time) approach, I find Bayesian inference far more intuitive than frequentest null hypothesis significance testing.

For instance, consider the cumulative link regression with contact and temp:

tab_brm_clmm_rating_contact_temp %>%
  select(-`brm default priors`) %>%
  gt()
term brm weak priors clmm
b_Intercept[1] -1.39 (0.57) -1.62 (0.68)
b_Intercept[2] 1.3 (0.53) 1.51 (0.6)
b_Intercept[3] 3.65 (0.67) 4.23 (0.81)
b_Intercept[4] 5.32 (0.78) 6.09 (0.97)
b_contactyes 1.6 (0.47) 1.83 (0.51)
b_tempwarm 2.67 (0.51) 3.06 (0.6)
sd_judge__Intercept 1.08 (0.45) 1.13

With the Bayesian model, not only do we have the full posterior to work with, we can make conclusions like:

Under weakly regularizing priors, temperature probably affects wine rating \(\beta_{\text{temp}}\) = 2.67, 95% credible interval = 1.72, 3.71.

Versus the frequentist model:

There is evidence against the null hypothesis that \(\beta_{\text{temp}} = 0\), \(p < 0.001\). The point estimate from the model was 3.06, 95% confidence interval = 1.9, 4.23.

With frequentist inference, we find the probability of the data assuming the null hypothesis is true, \(P(\text{data}|H_0)\). With Bayesian inference, we find the probability of a hypothesis given the data, \(P(H|\text{data})\), which means we don’t even have to consider the “null world.” This makes interpretation much easier.

Reproducibility

Session info
 setting  value                       
 version  R version 4.1.2 (2021-11-01)
 os       Windows 10 x64              
 system   x86_64, mingw32             
 ui       RTerm                       
 language (EN)                        
 collate  English_Canada.1252         
 ctype    English_Canada.1252         
 tz       America/Curacao             
 date     2022-01-10                  
Git repository
Local:    main C:/Users/tdunn/Documents/tdunn
Remote:   main @ origin (https://github.com/taylordunn/tdunn)
Head:     [da88cbd] 2022-01-08: Build on cloud

Source code

Bürkner, Paul-Christian, and Matti Vuorre. 2019. Ordinal Regression Models in Psychology: A Tutorial.” Advances in Methods and Practices in Psychological Science 2 (1): 77–101. https://doi.org/10.1177/2515245918823199.
Randall, J. H. 1989. The Analysis of Sensory Data by Generalized Linear Model.” Biometrical Journal 31 (7): 781–93. https://doi.org/10.1002/bimj.4710310703.

References

Citation

For attribution, please cite this work as

Dunn (2020, March 17). tdunn: Ordinal regression in R: part 2. Retrieved from https://tdunn.ca/posts/2020-03-17-ordinal-regression-in-r-part-2/

BibTeX citation

@misc{dunn2020ordinal,
  author = {Dunn, Taylor},
  title = {tdunn: Ordinal regression in R: part 2},
  url = {https://tdunn.ca/posts/2020-03-17-ordinal-regression-in-r-part-2/},
  year = {2020}
}