Unmasking the Opt-In Trap: Using Propensity Scores for Causal Inference in AI Feature Experiments

Published: 2026-05-02 18:18:03 | Category: AI & Machine Learning

Introduction: The Hidden Bias Behind AI Feature Toggles

Every product team that launches an AI-powered feature behind a user-controlled toggle eventually confronts a stubborn reality: the users who opt in are fundamentally different from those who don’t. When your latest LLM-based assistant shows a 21-percentage-point increase in task completion among adopters, it’s tempting to celebrate. But that number conflates the feature’s true effect with the pre-existing characteristics of power users who always try new tools. This opt-in trap corrupts naive comparisons and demands a more rigorous approach to causal inference.

Consider a typical scenario: your product ships an "Enable AI agent" toggle. Dashboard metrics glow green, but heavy-engagement users click toggles without hesitation, while light users ignore them. The observed difference reflects both the feature’s impact and the selection bias of who opts in. Without adjusting for this bias, you cannot trust the numbers.

Propensity score methods offer a statistical escape hatch. By reweighting or matching users based on their likelihood of opting in, you can approximate a randomized experiment and isolate the causal effect. This article walks through the complete pipeline—from estimation to diagnostics—using a synthetic SaaS dataset where the true effect is known.

Why Opt-In Features Break Naive Comparisons

When a feature sits behind a toggle, the act of opting in is itself a choice correlated with user behavior. Users who click "Try our AI assistant" differ in engagement, technical comfort, and prior usage. A simple t-test between opt-in and opt-out groups collapses the feature’s effect with these pre-existing differences. This is the selection bias that plagues every generative AI rollout—whether it’s smart replies, code suggestions, or chat assistants.

In a proper A/B test, random assignment ensures two groups are comparable except for the treatment. But opt-in toggles mimic observational studies, not experiments. The pre-existing gap between groups is the measurement problem, and raw dashboard numbers cannot fix it. You need statistical tools that adjust for the non-random selection.

What Propensity Scores Actually Do

A propensity score is the probability that a user opts in, given a set of observed covariates (e.g., past usage, tenure, feature interaction frequency). These scores serve as a balancing tool: they let you create a synthetic control group that resembles the treatment group in terms of observed characteristics.

Two common techniques use propensity scores:

Inverse-probability weighting (IPW): Each user is weighted by the inverse of their propensity score (or 1 minus it) to reweight the sample so that the distribution of covariates mimics that of a randomized trial.
Nearest-neighbor matching: Each treated user is paired with an untreated user who has a similar propensity score, and the outcome difference is averaged across matched pairs.

Both methods aim to reduce bias from observable confounders, but they rely on the ignorability assumption: that all confounders are measured. If unmeasured confounders exist, the estimate remains biased.

The Full Pipeline: A Step-by-Step Tutorial

We simulate a SaaS platform with 50,000 users. The true causal effect of the AI feature on task completion is known (e.g., 5 tasks), but the observed naive difference is inflated by selection bias. The companion notebook at GitHub contains all code.

Prerequisites

Python with pandas, numpy, scikit-learn, statsmodels, matplotlib.
Basic familiarity with logistic regression and bootstrap sampling.

Setting Up the Working Example

Generate a dataset where treatment assignment depends on observed covariates (e.g., login frequency, tutorial completion, days since signup). The outcome (tasks completed) is influenced by both treatment and covariates. We know the ground truth: the average treatment effect on the treated (ATT) is 5 tasks.

Step 1: Estimate the Propensity Score

Fit a logistic regression model where the dependent variable is the binary opt-in indicator, and the independent variables are the observed confounders. The predicted probabilities are the propensity scores. Check for overlap: if scores cluster near 0 or 1 for some users, matching becomes unstable.

Step 2: Inverse-Probability Weighting

Compute weights: for treated users, weight = 1 / propensity; for control users, weight = 1 / (1 - propensity). Then calculate the weighted average outcome difference. This gives an unbiased estimate of the average treatment effect (ATE) if the propensity model is correct.

Step 3: Nearest-Neighbor Matching

For each treated user, find the control user with the closest propensity score (within a caliper, e.g., 0.05). Discard unmatched treated users. The ATT is the mean outcome difference across matched pairs. Matching preserves interpretability but discards data.

Step 4: Check Covariate Balance

Before trusting the estimate, assess balance: compute standardized mean differences for each covariate between treated and control groups, before and after weighting/matching. A common threshold is |SMD| < 0.1 after adjustment. Create love plots or balance tables. Poor balance indicates the propensity model is misspecified.

Step 5: Bootstrap Confidence Intervals

Re-estimate the treatment effect on 1,000 bootstrapped samples of the original dataset. Take the 2.5th and 97.5th percentiles as the 95% confidence interval. This quantifies uncertainty without parametric assumptions. If the interval contains the true effect, the method is working.

When Propensity Score Methods Fail

Propensity scores are not a magic wand. They fail when:

Unmeasured confounders exist: If a hidden variable (e.g., user mood, external incentives) drives both opt-in and outcomes, no adjustment can fix it.
Lack of overlap: If certain propensity score ranges have no treated or control users, matching extrapolates beyond the data.
Misspecified model: If the logistic regression omits interactions or nonlinear terms, weights may be biased.
Extreme weights: Very small propensities produce large weights that inflate variance and potentially bias the estimate.

Always perform sensitivity analyses to assess robustness. For example, use multiple propensity estimation methods (e.g., gradient boosting) or compare IPW and matching results.

What to Do Next

If you’re building an LLM-based feature behind a toggle, adopt causal inference from day one. Log rich covariates, pre-register your analysis plan, and test propensity score methods on historical data. The companion notebook provides a ready-to-run template.

For deeper reading, explore doubly robust estimation (combines outcome regression with propensity weighting) or instrumental variables if unmeasured confounders are suspected. Remember: the opt-in trap is inevitable, but it doesn’t have to undermine your product decisions.

Casinolinks