Uplift Modeling at (un)Common Logic

10 April 2026

Performance marketing does not reward the prettiest model. It rewards decisions that move dollars. That is why uplift modeling has become a cornerstone in how we evaluate, prioritize, and bid across channels at (un)Common Logic. When you optimize to correlation, you end up rewarding the ad that shows up right before an already motivated buyer clicks Purchase. When you optimize to incrementality, you learn which intervention actually changed behavior. That second path is harder. It is also where disproportionate returns hide.
Why incrementality beats correlation
Most ad platforms are masterful at finding people who convert whether or not you spend money. If your KPI is last click ROAS, the machines will allocate budget to harvest demand and call it success. This is not inherently wrong. It is incomplete, and over any meaningful time horizon it leads to two mistakes. First, you overpay for credit on customers you would have acquired anyway. Second, you starve the touchpoints that actually create net-new demand.

Incrementality reframes the question from did a user convert to did the treatment change the probability of conversion for this user. Uplift modeling goes one level deeper. Instead of predicting the outcome in isolation, it predicts the difference in outcome with and without treatment for each user. That difference is the individual treatment effect. It is the quantity you want when deciding whether to show an impression, raise a bid, or send an offer.

In practice we do not need perfect individual effects to make progress. We need reliable relative comparisons between audience segments so we can sort and act. If a model can tell us that one cohort stands to increase purchase probability by 90 basis points while another barely moves by 5, we can budget accordingly even if the absolute levels are off by a hair.
The four response types we care about
Marketers intuitively understand that not all conversions are equal. Uplift formalizes that intuition by sorting people into four behavioral buckets that show up over and over in data.
Persuadables, treatment increases the chance they convert. These are the people we are trying to find. Sure things, they will convert whether or not they see the ad or get the discount. Spend here is mostly waste. Lost causes, they will not convert regardless. Spend here is pure waste. Do not disturb, the rare group whose probability of conversion drops under treatment. Think of the shopper who sees a low quality retargeting ad and decides the brand feels spammy.
This taxonomy forces honest accounting. If a campaign has a high ROAS because it swamps sure things in retargeting, it lacks leverage. If an email cadence leads to more unsubscribes and lower future purchase rates among your best customers, you may be manufacturing do not disturb effects without realizing it. Our goal at (un)Common Logic is to push budget toward persuadables, and to design messaging that avoids creating do not disturb reactions.
Where uplift lives in our stack
Uplift modeling is not a single tool so much as a disciplined way of answering questions. We use it in three layers.

At the strategy layer, uplift clarifies whether a channel is adding net-new value or merely soaking up credit. If paid search branded click-through rates go up when we cut spend, the spend was likely harvesting sure things. If retargeting raises new customer rate and LTV by cohort, it earns more budget. The strategic layer is about placement and scale.

At the audience layer, uplift helps us rank micro-segments by incremental response. For example, recent cart abandoners with fewer than two prior purchases respond differently than long-lapsed buyers with high AOV. A blended ROAS across both groups tells you nothing. An uplift view reveals where to invest, and where to back off.

At the activation layer, uplift connects directly to levers. We export high-uplift audiences to platforms, set bid multipliers by uplift decile, and adjust message or offer intensity to match predicted treatment effect. It is not theory unless it changes how the auction sees you.
Designing experiments that can support uplift
The starting point is data with a clear notion of treatment and control. You can learn a lot from natural experiments and platform holdouts, but planned tests build trust faster. A few details matter more than they seem at first glance.

Randomization must align with the decision unit. If the decision is whether to show a specific ad to a specific user at a specific moment, then the cleanest path is randomization at the user level or, if that is impossible, at a stable identifier like hashed email. Geo holdouts can work in a pinch for upper funnel media, but they introduce noise from local effects and seasonality that has to be modeled.

The outcome must reflect the business goal and the time window where treatment can act. If you are promoting a subscription that purchases weekly, a 7 to 14 day conversion window might capture both quick signups and reasonable deliberation. For higher ticket purchases, look at both leading indicators like add-to-cart and lagging outcomes like closed sale over a longer horizon. Uplift models need a consistent target.

Negative outcomes matter. Many programs optimize to conversion and forget to encode churn, unsubscribes, or returns. For a fair read on net effect, push those outcomes into the label or at least track them at the cohort level. A campaign that raises orders by 4 percent and returns by 6 percent is losing, it just does not look like it in the platform UI.

Treatment integrity is worth guarding. If control users keep getting similar messages from adjacent campaigns, the genuine incremental effect will be muted in the data. Coordination across teams keeps signals clean. At (un)Common Logic, we build test calendars and traffic shaping rules so that the same users are not in competing experiments without documentation.
From raw responses to true lift
You do not need exotic methods to begin. The simplest route estimates two outcome models, one for treated users and one for control, then subtracts their predictions at the individual level. That two model approach can get you 70 percent of the value if you pair it with thoughtful features and a strict evaluation protocol.

As needs mature, we often move to direct uplift learners. Uplift decision trees split on features to maximize separation in treatment effects, not just in baseline conversion rates. These models tend to produce stable, interpretable segments that sit well with media planners. You can read off a rule like new visitors on mobile with multiple category pageviews show high uplift to dynamic creative A, then turn that into a targeting or messaging plan without an extra layer of translation.

Meta learners like the T-learner, S-learner, and X-learner add flexibility. The X-learner, for example, builds separate models of response in treated and control, imputes individual treatment effects for each side, then learns a final model on those imputed effects. When treatment allocation is unbalanced or propensities vary a lot, these methods hold up better.

Causal forests and doubly robust methods push further by combining outcome models with propensity models to reduce bias. They help when assignment is not strictly random, which is often true in production where certain users are more likely to see an impression or receive an email. With doubly robust estimation, an error in the outcome model can be partially offset by a correct propensity model, and vice versa.

The right choice depends on the stakes and the data regime. For a rapidly changing ecommerce catalog with lots of seasonal churn, a pragmatic two model approach refreshed weekly might be best. For a B2B pipeline with lower volume and longer consideration, a more statistically efficient learner could extract signal without overfitting.
Features that carry weight
We resist feature bloat. Models improve fastest when features summarize the choice context that humans already use to make decisions.

Recency, frequency, and monetary value still earn their keep, but only if defined around the decision window. Recency since last site visit can matter more for media timing than recency since last purchase. Frequency of micro actions like product detail views in the past 72 hours often predicts uplift better than all-time order count.

Ad exposure history needs nuance. A binary seen an ad in the past day is less useful than counts by creative family and recency by channel. Uplift often rises when the next impression will introduce new information, and falls when it will repeat what the user has already ignored.

Offer sensitivity varies by user and context. If your brand runs promotions, features like historical response to discounts by size or by category help separate persuadables from sure things hunting for a deal they would have taken anyway.

Device and speed add color. Mobile visitors on average behave differently, but lift can be especially sensitive to page load time for certain treatments. If the promoted landing experience is heavier, you may observe negative uplift for slower connections. Encoding page performance metrics around the moment of treatment can catch this.

Context trumps demographics. Time of day, weekday versus weekend, and adjacency to offline events like store visits explain a lot of the lift we see in omnichannel engagements. We capture these with light touch features rather than bloated profiles.
Evaluating uplift models without fooling yourself
Metrics that look fine for response prediction can mislead with uplift. We do not chase AUC on conversion. We track uplift at k, Qini and uplift curves, and the expected value of deploying the model as a policy.

The Qini curve sorts users by predicted uplift, then plots cumulative incremental conversions relative to a random sort. A healthy curve rises steeply at the left, which means top ranked users deliver disproportionate incremental impact. The area under the Qini curve summarizes that advantage. It is a compact way to compare models.

Uplift at k asks a practical question. If we only have budget to treat the top 10 percent of users by predicted uplift, what incremental gain do we https://jsbin.com/hoxocutigu https://jsbin.com/hoxocutigu get versus not treating or versus treating at random. Because budgets are finite, uplift at a few k levels makes deployment decisions more grounded.

Calibration matters. If a decile is predicted to have a 0.6 percent uplift and the observed comes back at 0.2 to 0.4 percent under a new creative, the model may still be useful for ranking, but we will not use the absolute scores to set offer sizes. We log these differences and correct either the model or the actions tied to it.

Policy risk is the quiet trap. A model can look great in validation but still lead to worse business outcomes if it sends the wrong message to the wrong people. We run policy simulations that restrict actions to certain guardrails, then measure not just conversions but downstream metrics like return rate and unsubscribe. If a policy creates meaningful do not disturb effects in any segment, we rework it even if short term uplift looks appealing.
Bringing uplift into media buying
Uplift only pays when it hits the auction. We act on it through audiences, bids, and creative decisions.

For audience work, we export high uplift cohorts into platforms as inclusion lists and throttle low uplift cohorts with exclusions or reduced frequency caps. In Google Ads and Meta, this can look like building ten deciles of predicted uplift and aligning bid multipliers accordingly. The top decile receives higher bids and more exploratory creative tests, while the bottom deciles receive lower bids or are put on a slower drip.

For creative, we pair treatments with predicted mechanisms. If uplift emerges from information scarcity, then dynamic product ads with fresh inventory make sense for those users. If uplift emerges from decision friction, then creative that simplifies choice and reduces formality might perform better. The model tells you where lift is available, not what to say. You still have to craft the message.

For remarketing frequency, we let negative uplift speak. If a user segment shows evidence that an extra impression reduces purchase probability or harms brand favorability, we cap them tighter. It feels uncomfortable to back off spend on an audience that looks large and near purchase, but the data tend to reward that discipline with healthier profit per impression.
Handling small samples and cold starts
New campaigns and thin data environments rarely give you the volume to build a stable uplift model on day one. You still have options.

Start with simple rules derived from prior tests. If cart abandoners within 24 hours respond strongly to free shipping reminders and that effect does not show up for lapsed buyers, codify that split. As data accumulates, let the model take over.

Borrow strength across similar treatments using hierarchical models. If you are testing several related versions of an offer, you can estimate a shared baseline and allow each creative to deviate based on its own evidence. This dampens wild swings in small groups without smearing everything together.

Use proxy outcomes when final conversions are sparse. Add-to-cart or lead form completion often shows lift in the same direction as revenue, and will appear sooner. Keep a close eye on situations where proxy and final diverge, for instance a discount that drives low quality leads.

Design shorter, repeated tests instead of one grand experiment. Rotate in and out of holdout for certain geos or cohorts to gather more uplift observations over time. The consistency of effect across runs builds confidence.
Beyond media, where uplift earns its keep
Email and SMS programs are fertile ground. A simple recency-based cadence can turn into a smarter program that prioritizes outreach to persuadables and avoids fatiguing sure things. If a model says that sending a reminder to a first time purchaser two days after delivery increases repeat purchase probability by 0.8 percent, while the same message sent at seven days produces almost no lift, you can adjust timing.

On-site promotions benefit as well. Generic 10 percent off banners make you feel busy and drive visible conversions, but much of that is sure thing leakage. A targeted approach where visitors who fit a persuadable profile see the offer, and others receive value framing without a coupon, preserves margin while preserving volume.

Sales outreach in B2B sees the same patterns. Not every marketing qualified lead should trigger the same sequence. If SDR time is scarce, route it toward accounts with positive uplift to a human nudge and rely on nurture for the rest. This does not require a crystal ball, just a comparative sense of where a call changes the outcome.
Trade-offs and the realities that do not fit a slide
Uplift is not free. It demands more disciplined measurement and more patience. You will spend time on instrumentation, on cleaning identifiers, on setting up holdouts that leave money on the table short term. That is the price of learning. If a business is under unnatural quarter pressure, it may be wiser to pilot uplift in one channel where you can protect the experiment than to rip through everything at once.

Models age faster than you expect. Creative that once produced strong uplift will decay as the market adapts. Routine retraining and honest re-evaluation are part of the work. We have retired models we liked because their decisions no longer produced the advantage they once did, even though the validation metrics looked fine.

Fairness and brand effects deserve attention. If uplift models point toward aggressive frequency for a vulnerable group or overuse of urgency tactics that conflict with brand values, you should say no. A clean Qini curve is not a mandate; it is evidence to weigh against other principles.

Finally, none of this replaces craft. A quiet truth about uplift is that it magnifies the quality of the creative and the offer design. If your message is dull, there is little uplift to allocate. If you make something people care about, uplift modeling helps you aim it.
What success looks like
Teams that adopt uplift thinking start to ask different questions. Budget reviews pivot from which campaign had the highest ROAS to which actions created the most incremental profit. Media planners look at audience definitions and ask whether their segments isolate persuadables or collect sure things. Analytics roadmaps prioritize instrumentation that unlocks cleaner treatment labels and consistent outcome definitions.

You also see a more mature conversation with platforms. Rather than arguing with last click reports, you come to the table with holdout results and uplift deciles that show where the algorithm’s appetite aligns with your margin structure and where it does not. You stop paying for credit and start paying for change.

On the shop floor at (un)Common Logic, that looks like cross functional rhythms. Paid media, CRM, analytics, and creative sit around the same table to set test priorities. We keep a shared ledger of experiments and the implied uplift we are trying to measure. When results land, we convert them into policies that the buying teams can apply without new meetings. The tech is there to help, but the operating model makes it stick.
A compact path to getting started Nail the measurement basics, define treatment and control cleanly, pick outcomes and windows that reflect your economics, and set up at least one holdout that you will live with for a quarter. Build a simple two model uplift baseline, treat and control outcome models with shared features, then subtract predictions to rank users by uplift and validate with Qini curves and uplift at k. Operationalize one decision, export the top uplift deciles as audiences, apply measured bid or frequency changes, and protect a low uplift group to watch for do not disturb effects. Keep a weekly ritual, refresh the model, review uplift by decile against observed behavior, adjust creative and offers where uplift seems mechanism driven, and retire what no longer moves the needle. Expand deliberately, add channels or treatments only when you have the instrumentation and bandwidth to run them cleanly, and document results so future teams understand the why, not just the what. The mindset that endures
Uplift modeling rewards curiosity and restraint. Curiosity to ask where change truly happens, to dig past attractive but hollow metrics, to examine mechanisms rather than rely on averages. Restraint to hold out traffic when it hurts, to stop campaigns that look good in the UI but do not help the business, to say no to actions that create negative effects downstream.

As with most durable advantages, the math matters, but the habit matters more. Teams that measure incrementality and act on it make better bets. They spend less rescuing sure things and more creating growth. At (un)Common Logic, we view uplift not as a project, but as part of how we make decisions. It keeps us honest. It keeps clients from paying for stories when what they need are results.