Causal Inference for LLM Features: Overcoming the Opt-In Bias with Propensity Scores in Python

Published: 2026-05-02 17:46:38 | Category: AI & Machine Learning

The Opt-In Trap: Why Your AI Feature Metrics Mislead

When you ship a new AI-powered feature behind a user toggle, the numbers can look impressive at first. Users who click “Try our AI assistant” or “Enable smart replies” often show dramatically better outcomes—say, 21% more tasks completed. But this comparison is flawed from the start. The volunteers who opt in are not a random sample; they're typically your most engaged power users. Any naive metric comparing opt-in users to non-users conflates the feature's true causal effect with pre-existing differences between these groups. This is the Opt-In Trap, a persistent challenge in product experimentation for generative AI features.

How Propensity Scores Break the Bias

Propensity score methods offer a statistical remedy. A propensity score is the probability that a user chooses to opt in, estimated from observable characteristics (e.g., past engagement, account age, feature usage). By weighting or matching users based on these scores, we can create comparable groups that mimic a randomized experiment. The goal is to isolate the feature's causal effect from the bias introduced by self-selection.

The Full Pipeline: From Estimation to Inference

This walkthrough uses a synthetic SaaS dataset of 50,000 users, where the ground truth causal effect is known. You'll follow these steps:

Estimate propensity scores
Apply inverse-probability weighting (IPW)
Perform nearest-neighbor matching
Check covariate balance
Compute bootstrap confidence intervals

All code runs end-to-end in the companion notebook at GitHub (file psm_demo.ipynb). Pre-executed outputs let you follow along before running locally.

Setting Up the Working Example

We work with a synthetic dataset containing user-level features: past_engagement_score, account_age_months, feature_usage_count, and a binary opt_in flag. The outcome is tasks_completed. A logistic regression model estimates the propensity score for each user.

Step 1: Estimate the Propensity Score

We train a logistic regression model (or any classifier) using user features as predictors and the opt-in decision as the target. The resulting predicted probabilities are the propensity scores. In Python:

from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X, y)
propensity_scores = model.predict_proba(X)[:, 1]

Step 2: Inverse-Probability Weighting

IPW assigns each user a weight: 1 / propensity_score for treated users, 1 / (1 - propensity_score) for control users. The weighted average difference in outcomes estimates the average treatment effect (ATE). Large weights can inflate variance, so trimming extreme scores is common.

Step 3: Nearest-Neighbor Matching

Instead of weighting, you can match each treated user with one or more control users who have a similar propensity score. Nearest-neighbor matching (with a caliper) ensures close matches. The average difference within matched pairs estimates the treatment effect on the treated (ATT).

Step 4: Check Covariate Balance

After weighting or matching, check that covariates are similar across groups. Use standardized mean differences (SMD); values below 0.1 indicate good balance. Visualization with Love plots helps identify remaining bias.

Step 5: Bootstrap Confidence Intervals

To quantify uncertainty, bootstrap the entire estimation process (re-sample users, re-estimate propensity scores, recalc treatment effect). The 2.5th and 97.5th percentiles of bootstrapped effects form the confidence interval.

When Propensity Score Methods Fail

Propensity score methods rely on the unconfoundedness assumption: no unmeasured confounders that affect both treatment and outcome. If a hidden variable (like user motivation) drives both opt-in and outcomes, the estimate remains biased. Also, extreme propensity scores (close to 0 or 1) can cause instability, and matching may fail if no similar controls exist. Always perform sensitivity analyses (e.g., E-value) to assess robustness.

What to Do Next

Propensity score methods are powerful but not a silver bullet. Combine them with other causal techniques (e.g., instrumental variables, difference-in-differences) when appropriate. For AI features behind toggles, always consider a randomized staged rollout (A/B test) if feasible. The companion notebook at GitHub includes more advanced diagnostics and variations.

Conclusion

When your product team celebrates a 21% lift from an AI feature, be skeptical—the Opt-In Trap may be inflating the numbers. Propensity score methods, applied correctly, can disentangle selection bias from true causal effects. This Python tutorial provides a reproducible framework for product experimentation teams to make better decisions about LLM-based features.

Thchere