Bars = observed proportions. Red triangles = OLS-implied category probabilities (treating the OLS fit as a continuous normal and binning at 0.5-unit edges). Purple dots = proportional-odds cumulative probit predicted probabilities. Green diamonds (when 'Also fit category-specific cumulative probit' is on) = CS probit's saturated fit. Orange squares (when 'Also fit heteroscedastic PO probit' is on) = heteroscedastic PO probit, which fits a per-arm latent variance — the right tool for the variance-only-shift scenario in Tour 4. Gaps between the OLS triangles and the bars show where OLS is compressing the tails or missing the shape.

Coefficients

OLS and the PO probit each summarize the treatment effect in a single coefficient with a familiar p-value. The category-specific probit cannot — its effect varies by threshold, so the contrast table below shows per-cell estimates with 95% CIs instead. For most applied decisions, those per-cell numbers (in proportions of users by response category) are the deliverable; the OLS coefficient and its p qualify a quantity — a scale-point shift in the mean of an ordinal variable — that is commonly what decisions ride on but, as we argue throughout the series, rarely what they should ride on.

Interaction tests

Model fit (AIC / BIC on a common categorical scale)

Log-likelihood, degrees of freedom, AIC, and BIC for each model fit on the same simulated data. Lower AIC and BIC indicate better fit. OLS's log-likelihood is computed on the categorical scale (integrating the fitted normal over half-integer bins around each integer category), not on the native Gaussian scale, so the three models are compared on a common likelihood scale. Under proportional odds with a near-normal baseline (e.g., Tour 1) the PO probit and OLS often score similarly; under bimodal, skewed, or PO-violating settings the cumulative probit family typically pulls ahead by a substantial margin.

Treatment contrast by category (with 95% CIs)

The contrast is P(Y = k | Model B) − P(Y = k | Model A), by response category (and subgroup, if the covariate is on). Probit Δ uses whichever cumulative-probit variant is enabled: the category-specific probit when its checkbox is on, the heteroscedastic PO probit when that one is on, and the standard PO probit otherwise. ⚠ flags categories where the OLS-implied contrast has the opposite sign from the probit point estimate — those are the cases where the choice of model not only changes the size of the reported effect, but flips its direction.
100%-stacked bars show the conditional response distribution in each arm (and subgroup, if the covariate is on). For ordinal outcomes this is usually a cleaner summary than the mean.
What does each model predict at the observed treatment effect (1×) and at the dialed counterfactual multiplier (presets default to 3×; slide back toward 1× to collapse the comparison)? The layout depends on the sidebar configuration.

Covariate off: four panels — OLS's predictive normal and the predicted CDF on top, per-category bar comparisons for OLS and the cumulative-probit family on the bottom, with observed proportions in orange. The cumulative-probit panel uses the standard PO probit by default and switches to the heteroscedastic PO probit when that checkbox is on. Out-of-bounds "<1" and ">K" columns are flagged grey; cumulative probit always assigns 0 mass there, OLS does not.

Covariate on: the figure switches to a 2×2 bar grid (one row per subgroup, one column per model). cf_mult scales each subgroup's treatment effect separately — β_treatment for the reference subgroup, β_treatment + β_interaction for the other — so the figure shows how the heterogeneous effect extrapolates. PDF and CDF panels are dropped in this mode; the per-subgroup bar contrasts carry the story.