Our AI systems analyst spent two weeks inside the leading “AI visibility” dashboards (Peec.Ai, Scrunch, Writesonic, SEMRush, Profound and more) testing their capabilities.
He expected some reporting issues. Instead, he kept tripping the same faults:
- Scores that moved when inputs changed
- Prompts that didn’t sound like buyers
- Trends that collapsed once geography and versioning were controlled.
One vivid moment: he added five real comparison prompts to our Writesonic instance and the AI visibility overall score fell by 7 overnight.
Tiny changes and the test shifted. That score is unreliable. To optimize for AI, you need a controlled test plan before anything else, so that’s where we started.
Setting the Stage: Key Definitions
Before we get into the meat of this blog, we wanted to clear up some definitions.
Prompt (AI visibility context)
The exact text a user types into an AI system (e.g., “Who are the best AI visibility firms for industrial manufacturers?”). In testing, a prompt is a controlled, buyer-authentic question string used to probe recognition, understanding, and recommendation.
Prompt engineering (AI visibility context)
Test design, not trick wording. The structured process of designing, versioning, and governing a fixed set of buyer-authentic prompts to measure three outcomes: recognition (does the engine know the entity), understanding (can it explain what you do), and recommendation (do you appear in decision queries).
Prompt set / benchmark
The fixed collection of prompts you run every week.
Branded prompts
Include your company or product name. Use them to test entity recognition, factual accuracy, and how engines describe you.
Unbranded prompts
Omit your name. Use them to test whether engines recommend you for category, problem, and comparison questions where buyers decide.
Archetypes
The shapes buyers actually use: definitions, comparisons, how-tos, pricing, alternatives, rankings, and problem statements. Balance prompt archetypes, or your results will skew.
Low AI visibility (working definition)
A measurable state: your brand appears in fewer than half of relevant unbranded answers, is misdescribed in branded answers, lacks credible citations, or repeatedly loses to the same competitors across engines over multiple weekly runs.
Why Prompt Creation Must be Its Own Job (Not Tool-Generated)
Prompt engineering comes first. Most AI visibility dashboards mix tool-generated prompts with your ad-hoc additions, then score the blend. So do not treat that SEMrush, Scrunch, or Writesonic “AI visibility score” (or any number for that matter) as ground truth. In our tests, swings in “visibility” were caused by the actual prompt inputs.
- Modeled prompts, not buyer language
Many tools auto-generate “prompts” from keyword data or scraped summaries. They skew broad, brand-neutral, and category heavy. Presence looks high, relevance to the business does not.
- No ICP or journey grounding
Prompts must be tied to awareness, consideration, decision, or post-purchase language. With autogenerated prompts, you get category chatter instead of decision-grade questions that reflect how your buyers actually ask.
- Archetype skew
One mixed list of prompts often blends definition, comparisons, how-tos, pricing, and rankings without balance. Whatever archetype dominates steers the results and makes the score fragile.
- Geography masked by rollups
The same prompt produced different recommendations across the U.S., Europe, and the Middle East. A single “global” score hides where you compete and where you don’t.
- Prompt drift
Small wording changes move the trend line even when the market is static. If you can’t freeze phrasing, you can’t compare runs.
- Silent template changes
Tool-side updates to templates or indices can shift outcomes. If you aren’t capturing full answers with timestamps and engine versions, you cannot explain the swing. - No version control
Tiny edits overwrite baselines. Without prompt IDs and release labels, movement appears to be a market change when it is a measurement error.
Without a locked, buyer-authentic prompt set, you are testing a moving target.
Next, we define the benchmark we’re using at No Fluff and the runbook that keeps inputs stable, so movement is real and explainable.
What a Credible Prompt Set Looks Like
A credible AI Visibility prompt set is intentional, structured, and fixed. It is not a keyword list or a tool-generated dump. It is a controlled benchmark designed to measure how AI recognizes, understands, and recommends a brand.
For No Fluff, this benchmark is intentionally built across six prompt clusters, each testing a different dimension of AI visibility:
1. Branded Direct:
Purpose: Entity recognition and brand recall.
These represent the highest signal visibility category. You need a definite goal of 100% AI engine recall. Examples for No Fluff itself:
- What is No Fluff?
- What does No Fluff do?
- Is No Fluff a legitimate AI visibility company?
- How does No Fluff measure AI search visibility?
2. Semi-Branded:
Purpose: Brand-to-category and service association. These are crucial because engines must understand contextual associations around your business. Examples for No Fluff itself:
- No Fluff GEO services
- No Fluff AI search optimization for B2B service firms
- Why No Fluff helps businesses get recommended by AI models?
3. Category:
Purpose: Non-branded, competitive positioning without brand prompts.
Exactly what tools are used to benchmark how often a brand is recommended in non-brand queries. Examples for No Fluff itself:
- Best AI search visibility companies
- Top generative engine optimization consultants
- Who helps industrial brands appear in AI search results?
4. Problem / Pain point / Use case:
Problem-solution reasoning. Mirrors real decision language. Examples for No Fluff itself:
- Why is my company not showing up in AI search engines?
- What affects whether AI models mention a brand?
- How do I fix low GEO visibility?
5. Comparison, Alternatives & Ranking:
Decision-stage evaluation. Alternatives and “best” lists prompts will surface the competitive set and unique value proposition. Examples for No Fluff itself:
- Best companies for AI visibility and generative engine optimization
- What are the best alternatives to GEO agencies?
- Companies similar to AI search visibility providers
- Best PR + SEO + AI visibility hybrid agencies
6. Advanced Semantic Association:
Deep methodological and authority understanding. Goes beyond visibility into “does the model understand the playbook we sell,” which is ideal for proving deep authority. Examples for No Fluff itself:
- How do structured content frameworks improve AI visibility?
- What technical foundations are required for GEO?
- What is the future of AI visibility for B2B brands?
Together, these clusters cover the full AI-driven buyer journey: awareness, consideration, decision, and validation.
Six Criteria for a Solid Prompt Building
To be credible, the prompt set must meet the following criteria:
- Buyer Aligned Language:
Prompts are short, plain-language, and reflect how real buyers ask questions inside AI tools, not internal marketing language or SEO abstractions. - Full Journey Coverage:
- Branded and Category prompts test awareness and recall
- Problem prompts test solution framing
- Comparison and Ranking prompts test decision stage positioning
- Advanced Semantic prompts test authority and trust
- Mapped to Real Services:
Prompts are directly tied to your business’s actual offerings (For us, GEO, AEO, AI Visibility Sprint, Authority Building), ensuring visibility testing reflects commercial reality, not theoretical interest.
- Industry and Context Sensitivity:
Category and Problem prompts are replayed across buyer industries where AI interpretation differs by market. - Revenue Weighted Prioritization:
High intent prompts that influence recommendations and vendor selection are prioritized over informational or low-impact queries. - Locked and Version Controlled:
Our prompt set totaled 150 prompts and became a fixed benchmark. New prompts are introduced only through review and versioning, never ad hoc edits, preserving consistency and enabling meaningful week-over-week comparisons.
Measurement with Controlled Fixed Prompts
With the prompts fixed, our analyst shifted to how we plan to test. The point wasn’t to build a perfect lab. It was to make week-over-week results comparable and useful for the marketing team reporting and content.
- Capture the full answer with timestamp and engine version
- Score brand presence, recommendation strength, and accuracy
- Log rank within the answer and competitor frequency
- Grade source authority in tiers
- Track geography and note regional variance
- Rerun on a set cadence so drift is visible
What this Unlocks for the Business
Marketing gets a clean narrative: where we win recommendations, where we lose, and which competitors show up with us.
Content gets a backlog that maps to what models actually cite: missing proof, weak sections, and thin pages. This is where you run against!
Leadership sees cause and effect: a new case study enters the citations, rank rises, recommendation language improves.
Let’s build your AI visibility benchmark
Get in touch with us to help engineer the right prompt set, put reporting on stable ground, and show where lift will come from.
FAQ
What do you mean by “prompt engineering” here?
Test design. A governed, versioned set of buyer-authentic prompts used to measure three outcomes: recognition, understanding, and recommendation.
How big should the prompt set be?
Coverage first, size second. Balance six clusters (Branded, Semi-Branded, Category, Problem, Comparison, Advanced Semantic) and core archetypes. Lock the release (e.g., NF-150) for the sprint.
How often should we run it?
Weekly is enough when prompts are fixed, and answers are logged with engine versions.
What exactly do we capture in each run?
Full answer text, timestamp, engine/version, presence, recommendation strength, in-answer rank, competitor frequency, cited sources (tiered), and geography.
How do we attribute movement?
It’s “real” when inputs, engines, or sources change. Otherwise ,treat swings as drift until corroborated across runs.
What technical foundations are required for GEO?
Clean markup and fast pages, Organization plus Service/Product schema, canonical URLs and sitemaps, stable hierarchy, precise About and Service pages, outcome-driven case studies, and consistent off-site profiles on trusted domains.
How do we fix low GEO visibility?
If branded prompts fail, repair entity clarity and citations first. If unbranded prompts fail, add decision-grade comparisons and outcomes, and earn third-party coverage. Re-run the same set to confirm lift.
How does this connect to content planning?
Use findings to build a short backlog tied to what models cite: missing proof, unclear definitions, thin sections, and absent comparisons. Ship the smallest item likely to move a recommendation, then re-test.
Can we automate all of this?
Parts, yes. Logging, scheduling, screenshots, and scoring can be scripted. The judgment calls still need a human analyst who knows the buyer and the product.
Do we need daily runs?
No. Weekly is sufficient if prompts are fixed and engines are logged. Daily adds cost without better insight for most B2B teams.
How many prompts are “enough”?
Start lean with high-intent prompts per persona and region. Expand only when you see a signal that a new area can generate revenue. Keep versions locked so trends remain comparable.