Sensitivity Checks in Pipeline Workflows#

Sensitivity checks help you assess whether a causal estimate is robust or fragile. In quasi-experimental work, they are best treated as design diagnostics that probe assumptions and modelling choices, not as proofs that identification succeeded [Cook et al., 2002, Reichardt, 2019].

CausalPy’s pipeline API makes sensitivity analysis a first-class step, so robustness checks can run alongside model fitting and report generation in a single, reproducible workflow.

Architecture overview#

The sensitivity framework has three main pieces:

Check — a protocol that individual checks implement. Each check declares which experiment types it applies to (applicable_methods), validates preconditions, and returns a structured CheckResult.
SensitivityAnalysis — a pipeline step that holds a list of Check objects and runs them against the fitted experiment.
CheckResult — the output of a check, containing a pass/fail verdict (or None for informational checks), a prose summary, an optional diagnostics table, optional figures, and arbitrary metadata.

When a GenerateReport step follows SensitivityAnalysis, those results are included in the generated HTML report automatically.

Choosing checks#

Start with the default suite#

import causalpy as cp

result = cp.Pipeline(
    data=df,
    steps=[
        cp.EstimateEffect(
            method=cp.InterruptedTimeSeries,
            treatment_time=treatment_time,
            formula="y ~ 1 + t",
            model=cp.pymc_models.LinearRegression(),
        ),
        cp.SensitivityAnalysis.default_for(cp.InterruptedTimeSeries),
        cp.GenerateReport(),
    ],
).run()

SensitivityAnalysis.default_for(method) returns a pre-loaded step containing every check currently registered as a default for that experiment type. At present, PlaceboInTime is registered as the default check for InterruptedTimeSeries and SyntheticControl.

Compose a custom suite#

cp.SensitivityAnalysis(
    checks=[
        cp.checks.PlaceboInTime(n_folds=4),
        cp.checks.PersistenceCheck(),
        cp.checks.PriorSensitivity(
            alternatives=[
                {"name": "diffuse", "model": cp.pymc_models.LinearRegression(...)},
            ]
        ),
    ]
)

SensitivityAnalysis checks applicability as it runs. If a check does not support the fitted experiment type, CausalPy raises a clear error naming the methods that check supports.

Quick reference#

Check	Applies to	Registered as default?	Main question
PlaceboInTime	ITS, SC (PyMC models)	Yes, for ITS and SC	Do pseudo-interventions in the pre-period also produce “effects”?
PriorSensitivity	ITS, DiD, SC, Staggered DiD, RD, RKink, PrePostNEGD, IPW, IV (PyMC models)	No	Do conclusions change materially under reasonable prior alternatives?
PersistenceCheck	Three-period ITS designs	No	Does the effect remain after the intervention ends?
ConvexHullCheck	SC	No	Is the treated unit supported by the donor pool, or are we extrapolating?
LeaveOneOut	SC	No	Does the result depend heavily on one donor unit?
PlaceboInSpace	SC	No	Are placebo effects in control units as large as the treated effect?
BandwidthSensitivity	RD, RKink	No	Does the estimate depend heavily on bandwidth choice?
McCraryDensityTest	RD	No	Is there evidence of manipulation around the cutoff?
PreTreatmentPlaceboCheck	Staggered DiD	No	Do pre-treatment event-study effects look close to zero?

Where examples already exist#

PlaceboInTime: Pipeline Workflow, HTML Report Generation
BandwidthSensitivity: Regression kink design with pymc models
PreTreatmentPlaceboCheck: Staggered Difference-in-Differences
More check-specific walkthroughs are still being added, so some checks currently have API coverage but no dedicated notebook example yet.

Check-by-check guide#

PlaceboInTime #

PlaceboInTime moves the intervention backward into the pre-treatment period and re-fits the model. If those pseudo-interventions often produce effects comparable to the observed one, the original result looks less credible. In synthetic control settings, placebo and falsification exercises are a standard part of design assessment [Abadie, 2021]; in interrupted time series settings, the same logic aligns with broader falsification practice in pre/post intervention designs [Lopez Bernal et al., 2017].

This check requires a PyMC-backed model because it works with posterior impact draws. In CausalPy it can also fit a hierarchical null model and, optionally, estimate Bayesian assurance for a user-supplied expected effect prior.

PriorSensitivity #

PriorSensitivity re-fits the same experiment with alternative prior specifications and compares the resulting effect summaries. Use it when prior choice could matter materially, especially in small samples or weakly identified models. Reporting how posterior conclusions change under reasonable alternatives is good Bayesian practice [Fan et al., 2023].

This is the broadest check in the current API, but it is only available for PyMC-backed experiments.

PersistenceCheck #

PersistenceCheck applies to three-period ITS designs with treatment_end_time. It wraps analyze_persistence() to ask whether the effect persists, fades, or reverses after the intervention ends. This is especially relevant when policy or campaign effects may decay after treatment is removed [Wagner et al., 2002].

ConvexHullCheck #

ConvexHullCheck asks whether the treated unit sits within the support of the donor pool in the pre-treatment period. If not, the synthetic control fit relies on extrapolation rather than interpolation, which weakens design credibility.

LeaveOneOut #

LeaveOneOut drops one control unit at a time and re-fits the synthetic control. If the estimated effect changes dramatically when a single donor is removed, the result depends too heavily on that donor rather than on the donor pool as a whole.

PlaceboInSpace #

PlaceboInSpace re-labels each control unit as though it were treated and compares those placebo effects to the observed treated effect. If many placebo units show effects as large as the treated unit, the original estimate looks less distinctive [Abadie et al., 2010].

BandwidthSensitivity #

BandwidthSensitivity re-fits RD or RKink models across a sequence of bandwidths. Because bandwidth choice drives the bias-variance trade-off in local designs, a result that flips across plausible bandwidths should be treated cautiously [Lee and Lemieux, 2010, Imbens and Lemieux, 2008].

McCraryDensityTest #

McCraryDensityTest checks for a discontinuity in the density of the running variable at the threshold. A sharp jump suggests units may have manipulated their assignment variable, undermining the design’s local comparability assumption [McCrary, 2008].

PreTreatmentPlaceboCheck #

PreTreatmentPlaceboCheck examines pre-treatment event-study effects in staggered DiD. If negative event times are far from zero, the parallel trends story is harder to defend and the treatment effect may be biased [Goodman-Bacon, 2021, Borusyak et al., 2024].

Working with check results#

Each check returns a CheckResult with the following fields:

passed — True if the check passed, False if it failed, or None for informational checks with no pass/fail criterion.
text — a prose summary describing the outcome.
table — an optional pandas.DataFrame with diagnostic statistics.
figures — an optional list of matplotlib figures.
metadata — a dict of arbitrary extra data for downstream steps.

You can inspect results programmatically:

for cr in result.sensitivity_results:
    status = (
        "PASS"
        if cr.passed is True
        else ("FAIL" if cr.passed is False else "INFO")
    )
    print(f"[{status}] {cr.check_name}: {cr.text}")

    if cr.table is not None:
        display(cr.table)

When a GenerateReport step follows SensitivityAnalysis in the pipeline, check results are automatically included in the HTML report.

Interpreting sensitivity results#

Important

Sensitivity checks are diagnostics, not definitive verdicts. A passing check does not prove your causal claim is correct, and a failing check does not prove it is wrong. They reveal where your analysis is robust and where it is fragile.

What a passing check tells you: The estimate survived a specific stress test. This increases confidence in the result, especially when multiple independent checks point in the same direction.

What a failing check tells you: The estimate is sensitive to a particular assumption or modelling choice. This does not invalidate the analysis; it tells you where to investigate further, justify your choices, or present stronger caveats.

General guidance:

Start with the defaults, then add method-specific checks that target the most plausible failure modes for your design.
Run more than one check. No single diagnostic is sufficient.
Report failures as well as passes. Selective reporting of only passing checks undermines credibility.
Use domain knowledge to decide which failures are consequential. A bandwidth warning at an extreme specification is different from strong placebo evidence.
Treat checks as part of a cumulative argument, not a mechanical accept/reject gate.

Next steps#

For the pipeline mechanics, see Pipeline Workflow. For HTML reporting of check results, see HTML Report Generation. More method-specific sensitivity walkthroughs will be added over time; where they already exist, they are linked above.

References#

[1]

Thomas D Cook, Donald Thomas Campbell, and William Shadish. Experimental and quasi-experimental designs for generalized causal inference. Volume 1195. Houghton Mifflin Boston, MA, 2002.

[2]

Charles S Reichardt. Quasi-experimentation: A guide to design and analysis. Guilford Publications, 2019.

[3]

Alberto Abadie, Alexis Diamond, and Jens Hainmueller. Synthetic control methods for comparative case studies: estimating the effect of california's tobacco control program. Journal of the American Statistical Association, 105(490):493–505, 2010.

[4]

Alberto Abadie. Using synthetic controls: feasibility, data requirements, and methodological aspects. Journal of Economic Literature, 59(2):391–425, 2021.

[5]

Andrew Goodman-Bacon. Difference-in-differences with variation in treatment timing. Journal of Econometrics, 225(2):254–277, 2021.

[6]

Kirill Borusyak, Xavier Jaravel, and Jann Spiess. Revisiting event-study designs: robust and efficient estimation. Review of Economic Studies, 91(6):3253–3285, 2024.

[7]

Anita K Wagner, Stephen B Soumerai, Fang Zhang, and Dennis Ross-Degnan. Segmented regression analysis of interrupted time series studies in medication use research. Journal of Clinical Pharmacy and Therapeutics, 27(4):299–309, 2002.

[8]

James Lopez Bernal, Steven Cummins, and Antonio Gasparrini. Interrupted time series regression for the evaluation of public health interventions: a tutorial. International Journal of Epidemiology, 46(1):348–355, 2017.

[9]

David S Lee and Thomas Lemieux. Regression discontinuity designs in economics. Journal of Economic Literature, 48(2):281–355, 2010.

[10]

Justin McCrary. Manipulation of the running variable in the regression discontinuity design: a density test. Journal of Econometrics, 142(2):698–714, 2008.

[11]

Guido W Imbens and Thomas Lemieux. Regression discontinuity designs: a guide to practice. Journal of Econometrics, 142(2):615–635, 2008.

[12]

Li Fan, Ding Peng, and Mealli Fabrizia. Bayesian causal inference: a critical review. Philosophical Transactions of the Royal Society, 2023.