Experimentation (A/B Testing)

info

Experimentation in Flagsmith is currently in active development. This page describes the flow as it stands in the current prototype — naming, scope, and behaviour may change before general availability. Feedback welcome on the draft PR.

Screenshot placeholder — Hero image — wide shot of the Experiment Results dashboard with lift bars and the recommendation callout. Target path: /img/experimentation/hero-results-dashboard.png

Flagsmith Experimentation lets you run controlled A/B tests on your multivariate feature flags and read the results inside Flagsmith itself. You define a hypothesis, pick the flag to split traffic on, choose which metrics matter, and Flagsmith tracks the outcome by reading evaluation and event data from your data warehouse.

This replaces the earlier "bring your own analytics" approach where Flagsmith handled variation assignment but results lived in Amplitude or Mixpanel. Everything — configuration, running, and results — now lives in one place.

Key benefits

One workflow for the full experiment lifecycle. Configure, launch, monitor, and read results without leaving Flagsmith.
Structured hypothesis and metric tracking. Experiments are a first-class object with a hypothesis, primary / secondary / guardrail metrics, and a significance verdict — not a free-form flag configuration.
Reuses existing Flagsmith concepts. Assignment uses the same identity-based bucketing as multivariate flags. Target audiences use existing segments.
Warehouse-backed metrics. Results are computed from your data warehouse (Snowflake, BigQuery, Databricks), so metric definitions stay close to where your business already stores event data.

How it fits with Flagsmith

Experimentation builds on three concepts you already use:

Multivariate flags — every experiment runs on a multivariate flag. The flag's existing variations become the experiment's control and treatment variations.
Segments — experiments target a segment. Users outside the segment see the flag's environment default, unchanged.
Identities — users are bucketed by identity, the same way multivariate flag values are assigned today. A user consistently sees the same variation for the duration of the experiment.

1. Connecting a data warehouse

Before you can run experiments, connect the data warehouse where your flag-evaluation and custom-event data lives. Flagsmith reads from the warehouse to compute metric values per variation.

info

The data warehouse connection is organisation-scoped — one connection per organisation, available across all projects. Configure it once and every project inherits it.

Step 1: Open Organisation Integrations

Screenshot placeholder — Organisation Integrations page with the Data Warehouse card highlighted. Target path: /img/experimentation/integrations-list-warehouse-card.png

Go to Organisation Integrations from the organisation nav.
Locate the Data Warehouse card.
Click Add Integration.

Step 2: Choose a warehouse

Screenshot placeholder — Configuration form showing Snowflake / BigQuery / Databricks selector cards. Target path: /img/experimentation/warehouse-config-form.png

Pick your warehouse provider — Snowflake, BigQuery, or Databricks.
Fill in the connection details (account URL, database, schema, warehouse, user, authentication method).

Step 3: Test and connect

Screenshot placeholder — Test-passed state showing the inline success banner above the action row. Target path: /img/experimentation/warehouse-test-passed.png

Click Test Connection to verify your credentials without saving them. Flagsmith attempts to authenticate and reports the result inline.
If the test passes, click Connect to save the configuration and start streaming data.
If the test fails, review the error, correct your credentials, and re-run the test.

warning

Editing any connection field clears the last test result. Re-run the test before connecting so you aren't saving untested credentials.

Step 4: Verify data is flowing

Screenshot placeholder — Connected state — live stats card showing 24h flag evaluations and custom events, plus connection details grid. Target path: /img/experimentation/warehouse-connected.png

Once connected, the warehouse page shows:

24-hour flag evaluation count — confirms Flagsmith is writing to your warehouse
24-hour custom event count — confirms your app is writing events Flagsmith can read for metric computation
Connection details — read-only summary of what's configured, with Edit and Disconnect controls

2. Creating an experiment

All experiment workflows start on the Experiments page in the project sidebar (alongside Features, Segments, and Identities). The list is the hub: it's where you launch a new experiment and where you click in to read the results of an existing one.

Screenshot placeholder — Experiments list page with a mix of Running, Completed, and Draft rows, highlighting the Create Experiment button. Target path: /img/experimentation/experiments-list.png

Creating an experiment is a 5-step wizard. Each step validates before you continue, and you can jump back to any earlier step from the Review & Launch summary.

Step 1: Open the Experiments list

Go to Experiments in the project sidebar.
Click Create Experiment in the top right.

Step 2: Experiment Details

Screenshot placeholder — Experiment Details step — name field, hypothesis textarea, start/end date pickers. Target path: /img/experimentation/wizard-details.png

Name — a short identifier (e.g. checkout_paypal_button_test).
Hypothesis — required. State what you expect to happen and why. This becomes part of the experiment's permanent record so future team members can see the original intent.
Start and end dates — default to today + 14 days. Adjust if you want to schedule the experiment for later or run for a different window.

Step 3: Flag & Variations

Screenshot placeholder — Flag picker with multi-variant flags; the single-variant blocking banner below for context. Target path: /img/experimentation/wizard-flag-variations.png

Pick the multivariate flag to experiment on. Single-variant flags are not eligible — the wizard blocks them with an explanation and a link to add a variation.
Review the Variations table — read-only. These are the flag's existing multivariate values and become the experiment's control and treatment variations.

info

Experiments require at least one non-control variation. If the selected flag has none, the wizard shows a banner prompting you to add one on the flag's main page first.

Step 4: Select Metrics

Screenshot placeholder — Metric picker showing pre-selected metrics with the Primary / Secondary / Guardrail segmented control visible on a row. Target path: /img/experimentation/wizard-metrics.png

Pick the metrics this experiment will track. Each metric has a role that tells Flagsmith how to interpret it:

Primary — the metric that drives the verdict. Significance here determines whether the experiment succeeds.
Secondary — informational. Tracked and displayed alongside primary, but doesn't influence the recommendation.
Guardrail — a safety check. Used to detect regressions on metrics you don't want to break (e.g. page-load time, error rate).

Select metrics from the list.
For each selected metric, choose its role using the three-way segmented control.
If you select multiple primary metrics, a soft warning appears — this is allowed but harder to interpret statistically.

Step 5: Segments & Traffic

Screenshot placeholder — Segment selector with an active conflict banner shown for context; traffic-split inputs below. Target path: /img/experimentation/wizard-segment-traffic.png

Pick the segment whose users will be eligible for the experiment. Everyone outside the segment sees the flag's environment default, unchanged.
Set traffic weights — the percentage of eligible users assigned to each variation. Weights auto-balance: adjusting one rebalances the others proportionally. Click Split evenly to reset.

warning

If the chosen flag already has a segment override for this segment, the wizard shows a conflict banner. Resolve the conflict (pick a different segment, or remove the override) before continuing — a running experiment on a segment with an override produces incorrect assignment.

Step 6: Review & Launch

Screenshot placeholder — Review summary with per-section edit links. Target path: /img/experimentation/wizard-review-launch.png

Review the full configuration — details, flag, variations, metrics with roles, segment, traffic split, and dates.
Click any section's Edit link to jump back to that step.
Click Launch Experiment.
Confirm the launch in the modal. Once confirmed, the experiment starts assigning traffic immediately and the flag begins splitting variations by the configured weights within the chosen segment.

info

Once launched, experiment configuration is locked for the duration of the run to preserve statistical validity. To change the setup, stop the experiment and create a new one.

3. Reading experiment results

From the Experiments list, click any running or completed experiment to open its Results dashboard.

Screenshot placeholder — Full results dashboard — stat cards, recommendation callout, metrics comparison table, and trend chart stacked. Target path: /img/experimentation/results-dashboard-full.png

Read the dashboard top-down:

Summary cards

Lift vs. control on the primary metric
Probability of being best — how confident the statistical engine is that the leading variation truly wins
Sample size per variation — how many assigned identities have contributed data so far

Recommendation callout

Screenshot placeholder — Recommendation callout — "Treatment B is outperforming Control with 94% probability" style. Target path: /img/experimentation/results-recommendation-callout.png

A plain-language summary of the current state — which variation is leading, how confident the verdict is, and a suggested next action (continue, declare a winner, investigate).

Metrics comparison table

Screenshot placeholder — Metrics comparison table — primary row emphasised, guardrail badge visible, zero-centred lift bars. Target path: /img/experimentation/results-comparison-table.png

One row per metric, with the primary row visually emphasised so your eye lands on the verdict-driving metric first. Each row shows:

Role badge — Primary / Secondary / Guardrail
Control value and Treatment value
Lift bar — zero-centred, showing the relative change and its direction
Significance — statistical confidence in the observed lift

Trend over time

Screenshot placeholder — Trend line chart with metric selector above it — control vs. treatment lines. Target path: /img/experimentation/results-trend-chart.png

Below the table, a line chart plots each variation's metric value over the duration of the experiment. The metric selector lets you inspect any of the experiment's metrics independently.

This helps you distinguish a stable result from one that only recently flipped — if lines have been consistently separated for days, the verdict is more trustworthy than if they crossed yesterday.

What's next

Learn more about multivariate flags — the building block every experiment sits on top of.
Explore segments to define the audience for an experiment.
Read about identities, the unit Flagsmith buckets users by when assigning variations.

Key benefits​

How it fits with Flagsmith​

1. Connecting a data warehouse​

Step 1: Open Organisation Integrations​

Step 2: Choose a warehouse​

Step 3: Test and connect​

Step 4: Verify data is flowing​

2. Creating an experiment​

Step 1: Open the Experiments list​

Step 2: Experiment Details​

Step 3: Flag & Variations​

Step 4: Select Metrics​

Step 5: Segments & Traffic​

Step 6: Review & Launch​

3. Reading experiment results​

Summary cards​

Recommendation callout​

Metrics comparison table​

Trend over time​

What's next​

Key benefits

How it fits with Flagsmith

1. Connecting a data warehouse

Step 1: Open Organisation Integrations

Step 2: Choose a warehouse

Step 3: Test and connect

Step 4: Verify data is flowing

2. Creating an experiment

Step 1: Open the Experiments list

Step 2: Experiment Details

Step 3: Flag & Variations

Step 4: Select Metrics

Step 5: Segments & Traffic

Step 6: Review & Launch

3. Reading experiment results

Summary cards

Recommendation callout

Metrics comparison table

Trend over time

What's next