1. The Setup
This report covers a simulated Instagram in-feed test for Bears with Benefits, comparing two creatives head-to-head across four audience groups: Multitasking Working Mum, Gen-Z Self-Care Urbanite, Athletic Performance and Recovery Buyer, and Quantified-Self Stack-Builder. Across 101 rounds and 800 agents, the simulation captured not only who clicked and converted, but how agents reacted, what they objected to, and which creative patterns held up over repeated exposure.
The central question was straightforward. Which of the two ads performs better, for whom, and why? More specifically, where does each ad earn attention, where does it lose trust, and which objections are creative problems versus product or brand problems.
After the simulation concluded, we also went back and interviewed agents directly to probe the reasoning behind their reactions. That post-simulation layer is important here because it clarifies what the behavioural data alone can only suggest: for some groups the issue is the ad format, for others it is the missing credibility cues, and for two of the four groups the real limitation is the product category itself rather than the copy.
The Ad Creatives
Both ads sell the same gummy line. The way they sell it is where they diverge.
Variant A: "Your Problems | Our Solutions" is a numbered problem-solution layout. Five common wellness complaints are listed down one side, thin hair, brittle nails, winter immunity, post-workout drain, persistent fatigue, and mapped on the other side to five specific Bears with Benefits SKUs. The tone is functional and direct. It treats the reader as someone with a problem to solve and offers a product against each one.
Variant B: "DON'T BUY THESE VITAMINS" opens with a pattern-interrupt headline, then resolves with a softer payoff: "…unless you want glowy skin, strong hair & nails." Rather than a product range, it focuses on a single hero SKU, You Glow, Girl, with the active ingredients (biotin, zinc, hyaluronic acid) called out alongside sugar-free and vegan badges. The tone is contrarian, the aesthetic is closer to a beauty brand than a supplement brand, and the ad treats the reader less like a patient and more like a peer.
Both creatives ran as static Instagram in-feed posts on the brand's own account, with no explicit call-to-action button.
The question Imago set out to answer: could a simulation have told the brand which ad to focus on and thus achieve better return on their ad-spend? Which customer group resonates with which ad, and which emotions do these creatives evoke?
How Imago Approached This
Imago's method is built around the idea that before you test an ad in market, you can test it against simulated audiences that behave like real people. Not simple A/B testing or focus groups. A simulation that generates the kind of reactions, hesitations, and language your actual customers would have when they scroll past your ad. Here is what that looked like in practice for this campaign.
1. Inputs
We fed the system the two ad creatives, the full copy, the brand context, and the campaign brief. This includes everything a real person would see: the image, the headline, the supporting copy, the product price range, and the platform context. The simulation needs enough signal to construct a credible impression of the ad as it would appear in a real feed.
2. Defining the Personas
Based on the brand context and campaign targeting, we defined four audience segments. Each one represents a meaningfully different type of buyer the ad was likely to reach.
- Athletic Performance & Recovery Buyer: gym-driven adults who evaluate every supplement on dosage, ingredient form, and price-per-milligram, and won't give a gummy brand a pass just because it's convenient.
- Gen-Z Self-Care Urbanite: early-career women in major DACH and UK cities who treat supplements as identity curation, buying what looks right as much as what works, on a budget that still stretches for the right brand.
- DACH Multitasking Working Mum: working mothers in their mid-thirties to early forties buying supplements for themselves and their kids, driven by convenience, clean ingredients, and anything credible enough to recommend to other mums.
- Quantified-Self Stack-Builder: higher-income professionals running a structured personal stack who apply the same evidence-led lens to a German gummy brand as they would to a US longevity capsule, dosage and third-party testing first, marketing copy never.
Each persona is a fully described individual with specific motivations, prior experiences, and decision criteria.
3. Running the Simulation
800 AI agents, 200 per group, were exposed to both ad variants across 101 rounds and six contextual states (morning commute stress, mid-morning at the office, lunchtime browsing, post-workout fatigue, evening wind-down, and weekend leisure) simulating the situational breadth of a real Instagram in-feed placement. Each agent could react the way a real person would: posting, commenting, liking, ignoring, clicking, or purchasing, with every reaction driven by how well the ad landed against their individual profile.
Both variants were rotated across the agent pool in counterbalanced order so that every agent contributed paired evidence, the same person's reaction to Variant A and Variant B, eliminating order effects from the comparison. Agents weren't exposed to the ad every round, mirroring real-world impression frequency, and the simulation tracked each agent's accumulating purchase intent over time, gating actual conversion on a prior click having occurred, the same way paid-ad attribution works in practice.
4. What the Simulation Captures
The output is two things working together. First, a set of quantitative metrics (CTR, engagement rate, purchase rate, and sentiment scores per segment) measured separately for each variant and then put head-to-head. Second, and more importantly, the verbatim language agents used when they reacted, and when asked directly which variant they would click, which felt more aligned with them, and which felt more credible.
This shows more than just which variant edged ahead on CTR, but in exactly why Variant B stopped the scroll when Variant A didn't, exactly which objection neither ad ever answered, and exactly where the two creatives pulled different audiences in opposite directions. That combination is what the following sections break down.
2. What the Simulation Found
Head-to-Head: How the Two Variants Performed
Across all four groups and 101 simulation rounds, Variant B ("DON'T BUY THESE VITAMINS") outperformed Variant A ("Your Problems | Our Solutions") on every primary behavioural metric, though the margins are modest. It stopped the scroll more often, drove more clicks per impression, and landed slightly more positive sentiment, while both variants sat in broadly similar negative-leaning territory overall.
| Metric | Variant A | Variant B |
|---|---|---|
| Response rate | 23.6% | 24.9% |
| CTR | 3.96% | 4.17% |
| Positive sentiment | 21.8% | 23.9% |
| Negative sentiment | 33.8% | 33.1% |
| Forced-choice click preference | 25% | 75% |
The forced-choice phase made the gap starker: 75% of agents said they would click Variant B, a margin that held consistently across groups. The one genuine counterpoint is credibility. Variant A was rated more trustworthy by 55.6% of agents overall, a lead it held in three of four groups. But that credibility advantage didn't translate into more clicks or warmer sentiment in the behavioural data, pointing to a creative that reads as reliable without compelling action. One agent put it plainly:
"Variant A tries to solve every problem at once and looks like a generic shopping list, which makes me question quality control on any single product."
That read recurred across multiple groups. Variant B's hook, by contrast, generated the kind of curiosity that converted to clicks even among audiences who weren't sold on the brand, with agents noting that "the 'DON'T BUY THESE' hook would actually make me stop scrolling, even though I know it's a gimmick."
The performance split also maps cleanly onto audience orientation: lifestyle-driven groups responded better to both variants overall, while analytically-driven groups were sceptical of both, a dynamic explored in detail in the group breakdowns below.
The Four Groups at a Glance
| Segment | Purchase Rate | Positive Sentiment | CTR | Variant A Key Signal | Variant B Key Signal |
|---|---|---|---|---|---|
| Multitasking Working Mum | 17.5% | 23.8% | 4.8% | CTR 4.71%, highest A CTR of any group, but sharpest sentiment collapse | CTR 4.17%, more stable sentiment, 82.5% forced-choice click preference |
| Gen-Z Self-Care Urbanite | 22.5% | 34.6% | 4.7% | CTR 4.32%, strong early resonance, steep sentiment wear-out | CTR 5.00%, shallower decay, most durable performer |
| Quantified-Self Stack-Builder | 5.0% | 18.5% | 3.3% | CTR 3.61%, only group where A held a meaningful click-intent edge | CTR 2.92%, high engagement but lowest absolute CTR, sceptical of both |
| Athletic Performance & Recovery Buyer | 2.5% | 13.4% | 3.5% | CTR 3.20%, lowest response rate, 44.3% negative sentiment, actively alienated | CTR 3.79%, initial hostility, weakest overall converter |
"Multitasking Working Mum"
The problem-recognition framing worked on this group. Variant A delivered the highest CTR of any group (4.71%), and the headline's direct structure resonated with an audience that buys supplements to solve specific, tangible problems. Variant B held slightly lower on raw CTR (4.17%) but was the clear preference in the forced-choice phase: 82.5% of agents in this group chose it for click intent, and its sentiment held more stable across the run.
One agent put the split plainly:
"Neither of these really wins me over, but if I'm being honest, Variant B's 'DON'T BUY THESE' hook would actually make me stop scrolling, even though I know it's a gimmick. That said, as a mum who buys supplements at dm every 6 weeks, Variant A at least lists specific concerns and product names so I could actually figure out what I'm buying."
DACH Multitasking Working MumThat quote points directly to the underlying friction both variants failed to resolve. This group's purchase decisions are gated on strong transparency and neither creative supplied it. Agents consistently benchmarked Bears with Benefits against alternatives not out of aesthetic preference, but because those products provide what the ad currently withholds: clear dosing, recognisable ingredient forms, and a trust marker verifiable without leaving the feed.
When asked in follow-up whether a Made-in-Germany label and supplement facts panel would close the gap between clicking and buying, the answer was unambiguous. The hook is doing its job (Variant B gets the scroll-stop), but the conversion requires the regulatory cue that neither variant currently carries. A Made-in-Germany badge paired with on-creative ingredient transparency would directly address the stated objection, and given this group already clicks at above-benchmark rates, the conversion upside of closing that gap is the most concrete creative fix this simulation surfaces.
"Gen-Z Self-Care Urbanite"
This was the strongest-performing group in the simulation: the highest purchase rate, the highest positive sentiment, and the highest CTR of any segment. Both variants outperformed their cross-group averages here, but Variant B was the clear winner: a 5.0% CTR versus 4.32% for Variant A, positive sentiment of 36.4% versus 32.4%, and an 85% forced-choice click preference. The "DON'T BUY THESE VITAMINS" hook landed well with an audience that encounters disruptive, self-aware creative formats daily on TikTok and Instagram, and the tongue-in-cheek tone matched how this group already relates to the wellness category, with enthusiasm, but also with knowing irony.
Variant A wasn't without appeal. Its response rate of 25.5% was the highest of any group-variant combination, and its early positive sentiment (+0.372) indicated a genuine first-impression spark. The problem-solution format mapped recognisably onto real concerns this group holds (hair, nails, skin, fatigue) and for a brief window, it worked. But the five-problem, five-product structure read quickly as a cluttered shopping list rather than a curated recommendation, and the novelty wore off. Agents articulated the aesthetic mismatch directly:
"Variant A tries to solve every problem at once and looks like a generic shopping list, which makes me question quality control on any single product."
When asked directly at what point the ad moved from "I'd consider this" to "I wouldn't buy this," the answer from this group was that the tipping point was in the ad. This group evaluates supplements on vegan and clean-label cues, and peer-aligned aesthetics, and when those signals were present but not as clustered as in Variant A, the agents were more inclined to click.
"Quantified-Self Stack-Builder"
This group was broadly sceptical of both variants, but they produced the most interesting result in the entire simulation: the only segment where Variant A outperformed Variant B on CTR. Variant A generated a 3.61% click-through rate versus 2.92% for Variant B. The most plausible explanation is that Variant A's structured, problem-mapped format gave this group something to evaluate. It gave them a starting point for the cost-per-dose analysis they apply to every supplement decision.
The implication is worth sitting with. The Quantified-Self Stack-Builder is the audience that most clearly responds to information density over creative disruption, and while neither variant gave them what they actually needed (dosage, ingredient forms, CoA links), Variant A's structured list at least mapped to the format of how they think. One agent was candid about this:
"Variant A at least lists the specific problems and names the products with active ingredients, so I could partially evaluate the stack."
That quote is also a roadmap: this is a group that would respond to an ad that leads with data rather than creative, and Variant A's relative advantage here is the strongest signal in the simulation that a more information-led creative direction could unlock meaningful performance with analytically-oriented audiences.
"Athletic Performance & Recovery Buyer"
The numbers for this group are unambiguous: one converted agent out of forty, 43.2% negative sentiment, and an average final purchase intent of just 7.3%, the weakest result across every metric in the simulation. Both variants landed badly. Variant A's problem-solution list triggered immediate scepticism (44.3% negative sentiment, CTR 3.20%) and Variant B fared only marginally better (42.3% negative, CTR 3.79%).
The qualitative evidence is where the picture becomes genuinely useful. The objections this group raised were not about the ad format or creative execution, they were structural rejections of the product category as Bears with Benefits currently positions it. Agents consistently dismissed both variants on the same grounds: no dosage disclosure, no ingredient forms, no third-party certification, and a gummy format they associate with underdosing and sugar load rather than performance nutrition. One agent calculated a full cost-per-therapeutic-dose comparison mid-simulation and concluded the product was "premium-priced sugar with good branding." Another went further and quite hostile:
"Any supplement ad that relies on reverse psychology instead of showing me third-party testing and exact dosages is an immediate pass. If the science is solid, you don't need a hook."
What this simulation surfaces is that the poor performance here isn't that the creative is a problem. In this instance, it's a targeting problem. This group evaluates every supplement on the same criteria they'd apply to a sports-nutrition product: elemental mg, bioavailability, ingredient form, third-party CoA. Bears with Benefits, as currently formulated and positioned, does not clear that bar, and no version of this ad creative would change that. The simulation gives Bears with Benefits something valuable: clear, evidence-backed confirmation that this audience segment should be deprioritised in media spend, and that capturing them would require a fundamentally different product proposition, not a better headline.
3. How the Simulation Compared to Reality
Variant B is the creative to run, against Gen-Z Self-Care Urbanites and DACH Multitasking Working Mums. These are the two groups with the purchase rates, sentiment profiles, and click intent to justify cold-audience spend. The single highest-leverage fix is also clearly defined: pair Variant B's hook with on-creative ingredient transparency: explicit callouts, a Made-in-Germany badge, vegan and sugar-free signals prominent at first interaction.
What a real-world campaign cannot tell you is any of the above. It can report that certain audiences didn't convert, but not that the Athletic Performance group rejected the product category rather than the creative, that Variant A's credibility lead never translated into clicks, or that the exact fix for the DACH working mum is a supplement facts panel next to a Made-in-Germany mark. The simulation named the two groups worth backing, identified the creative fix that closes the purchase gap, and ruled out a wasted targeting segment, all before a single euro of real spend was committed.