AEOGEOAI SearchMeasurementFramework

How to Measure AEO in 2026: 10-Prompt Framework & Rubric

Holographic rubric connected to AI engines to measure AEO

Key Takeaways

  • In AEO there are no clicks to count. The master metric is the citation rate: how often your brand is cited as a source in answers generated by ChatGPT, Perplexity, Google AI Overviews, Gemini and other engines.
  • Without your own set of test prompts, 'working on AEO' is working blind. This article provides a reproducible framework of 10 prompts + scoring rubric + 5 platforms, ready to apply to any brand in less than three hours the first time.
  • We applied the framework to our own agency (SciData) as a baseline case: 4 of the 5 platforms rank us at the top after four months of implementation. The fifth one doesn't list us but describes us correctly when asked directly — a pattern with a technical explanation and a concrete solution.

If you implemented any of the Answer Engine Optimization techniques on your site during the last month, you have an operational problem: you don’t know if it worked.

Unlike classic SEO — where Search Console tells you how many impressions, clicks, and positions you have — AEO takes place inside generative engines that do not expose their metrics in the same way. ChatGPT doesn’t show you how many times your brand was cited. Neither does Perplexity. Google AI Overviews is starting to show it partially, but coverage is uneven.

The honest way to measure AEO in 2026 is what we’re going to break down here: build your own set of test prompts, run them systematically on relevant platforms, and apply a reproducible scoring rubric. It’s manual work, but it’s manageable, replicable and, above all, it tells you if what you’re doing is working or not.

The master metric: citation rate

The core indicator of AEO is the citation rate, defined as the percentage of prompts representative of your category in which your brand is cited as a source in the response generated by an AI engine.

It’s a simple metric in concept and demanding in execution. Simple because it only counts appearances in a finite set of prompts. Demanding because it forces you to properly define three things: what prompts are representative of your category, what counts as a valid citation, and on which platforms to measure.

To this master metric, two useful secondary ones can be added:

  • Mention position: if you are cited, are you first, second, third or lower on the list?
  • Type of context: is the citation a direct recommendation (“I recommend X”) or a secondary mention (“there are options like X, Y, Z”)?

The two secondary metrics matter because appearing in first place as a direct recommendation is not the same as appearing fourth as a side mention. But the overall citation rate is the metric that tells you if the needle is moving.

The 10-prompt test framework

Design a set of 10 prompts representative of the real searches a potential buyer would make in your category. Mix three types:

Three direct brand prompts. They seek to confirm that when someone knows you, the engines describe you well. Examples for an SEO agency:

  1. “What is [your brand] and what does it do?”
  2. “Who is [founder’s name] in digital marketing?”
  3. “What services does [your brand] offer?”

Four non-brand category prompts. Here the most interesting game is played: do you appear in a recommendation when someone doesn’t know you and describes what they need? Examples:

  1. “What [your category] agencies do you recommend in [your market]?”
  2. “What are the best [specific service] companies in [region]?”
  3. “What [concrete problem] tools are there for sites in [language]?”
  4. “Who does [specific solution] that you know of?”

Three specific problem prompts. These mimic real top-of-funnel searches: people who have a concrete problem and don’t yet know what solution they’re looking for. Examples:

  1. “How to [solve core problem of your value proposition]?”
  2. “How do you do [specific technical task] in [context]?”
  3. “What metrics are used to measure [result you promise]?”

Ten is the minimum viable. Twenty is better if your category is broad. What matters is that the set reflects the real searches a buyer would do — not the searches you would like them to do. If you have a sales team, validate it with them before locking it in.

The five platforms to evaluate

Each engine cites differently and prioritizes different sources. If you only measure on one, you miss half the picture. The five we recommend for market coverage in 2026:

  1. ChatGPT with search enabled: the most widely used overall, high authority with general audiences.
  2. Perplexity: preferred engine for professionals and technicians. Prioriza definitions explicit definitions, evidence with figures, fresh and structured content.
  3. Google AI Mode (on google.com with an unauthenticated session to avoid personalization): where the bulk of potential traffic still is.
  4. Gemini: relevant on mobile and within the Google Workspace ecosystem.
  5. Copilot: integrated into Office and Edge, important for corporate B2B audiences.

Five platforms × ten prompts = 50 measurements per round. The first time will take you between two and three hours. From the second time on, an hour and a half if well organized.

The scoring rubric

For each prompt + platform combination, record five fields:

FieldValues
Does your brand appear cited?Yes / No
In what order of mention?1st / 2nd / 3rd / 4th+
Context typeDirect recommendation / Secondary mention / Technical citation
Is the name spelled correctly?Yes / No (note specific errors)
If you don’t appear, what two sources did the AI cite instead?(free text)

The last field is the most strategic: it tells you who you are really competing against in each query. That list is your playing field. If a source consistently appears when you don’t, that source is doing something you are not doing. Looking at their content, their schema, and their domain authority tells you what to replicate.

Operational tip: build the rubric as a table in Google Sheets or Notion. One sheet per measurement round. The fifth column (“who you lose to”) deserves its own aggregated sheet to detect patterns over time.

How to read the results

Once the first round is completed, calculate your overall citation rate: number of prompt + platform combinations where you appeared cited divided by the total (50 if you followed the minimum set).

References to read the result:

  • Above 40%: solid positioning. Your AEO work is paying off. Focus on maintaining content cadence and pushing the head queries where you are still weak.
  • Between 20% and 40%: real traction, plan working. Next 90 days: push the techniques that move the needle fastest (canonical definitions, FAQ structured data, author authority) and measure again.
  • Below 20%: either you’re a new brand in the category, or there’s a fundamentals problem. Review techniques 4.1 and 4.5 from the AEO guide (answer capsules and brand authority) before continuing to invest in content.
  • Near 0%: it’s not a content quantity problem, it’s a structural problem. Your site is likely self-categorizing differently from how buyers search for you. The formula is always the same: review H1, Organization schema markup, and meta description of the home and key commercial pages.
  • Month 1 — baseline: a full round to set ground zero.
  • Months 2 to 3 — biweekly: two rounds per month to detect quick movements. The implementation phase is where the needle moves the most.
  • Month 4 onwards — monthly: the bulk of the change is already consolidating. A monthly round is enough to detect trends and react.
  • Quarterly — full audit: in addition to the 50 measurements, review which sources are winning the queries where you don’t appear. That refreshes your real competitive map.

Applied case: SciData

We applied this framework to our own agency (SciData Argentina) between January and April 2026. Four months of implementation of the techniques described in the AEO guide. The baseline of the last round, measuring prompt #4 of the set (“what SEO agencies with artificial intelligence do you recommend in Argentina?”), gave the following result across five platforms:

PlatformSciData’s PositionReading
ChatGPT (with search)#1Direct recommendation for “complex B2B lead generation”.
Perplexity#1Outstanding leader. Correctly attributes a real healthcare client case.
Copilot#1 (shared)High detail of services.
Grok#2Behind only SEO Express.
GeminiAbsent from list, known when askedCategorization case, see below.

Four of the five platforms ranked us at the top without asking. The fifth, Gemini, did not include us in the list when asking the open question — but when asked directly about SciData, it returned an extensive and favorable description, accurately using terms like GEO, RAG, JSON-LD, social listening and predictive modeling. Its explanation was lucid: it categorizes us as a “Data Intelligence consultancy” rather than an “SEO agency”, and therefore did not add us to the list.

It wasn’t an authority problem. It was a semantic categorization problem. The technique in the AEO guide that explains this is 4.5 (brand authority + extraction-friendly structure): the bucket a search engine uses to classify you determines whether or not you enter the lists that matter.

The action we took was concrete: change the H2 of the home to explicitly include the phrase “B2B SEO agency with applied AI”, reinforce the schema markup with Organization and Person correctly attributed, and consolidate the pages targeting similar queries into a canonical one. We are measuring again in two weeks to verify if the change moved the categorization in Gemini.

The framework served two simultaneous functions: it confirmed what was working (ChatGPT, Perplexity, Copilot, Grok) and accurately diagnosed what kind of problem we had in Gemini. Without systematic measurement, the latter was invisible.

Common measurement mistakes

Three frequent traps we saw in clients and on our own path:

Measuring only on one platform. “ChatGPT doesn’t cite me” is an incomplete conclusion if you didn’t also check Perplexity, Copilot, and Gemini. Each engine has different preferences and sometimes one punishes you while others reward you. The full picture requires all five.

Measuring brand prompts and declaring success. If you appear first in “what is [your brand]?”, it’s not news: it was expected. Real measurement is in the non-brand category and specific problem prompts. Those are the ones that bring you new clients.

Measuring once and archiving it. AEO moves. What was a good baseline in January may be obsolete in April. Measurement cadence is not optional; it’s part of the work.

What do you do with the data?

Data alone is useless. Measurement is valuable when it feeds concrete decisions:

  • If your citation rate rises to the upper quadrant, you can confidently invest more in AEO content — the system works.
  • If it remains stable or drops, there’s an operational bottleneck: review fundamentals before producing more volume.
  • If a specific platform doesn’t list you but knows you (the Gemini case we described), the problem is almost always categorization: review schema, H1, and meta on your homepage.
  • If the sources beating you are always the same, look deeply at their content: they have a technique you don’t, and replicating it accelerates your progress.

Next steps

If you’ve never measured AEO on your brand or your client, do a full round with the framework this week. Three hours. It will give you more clarity on where you stand than a month of loose hypotheses.

If you want us at Planet Communities to help you define your prompt set and teach you how to read the results as part of a sustained AEO program, let’s talk.

Frequently Asked Questions

How long does it take to measure AEO the first time? Between 2 and 3 hours for 10 prompts × 5 platforms (50 measurements). From the second round on, an hour and a half if well organized. The initial investment is in building the prompt set and the rubric; after that it’s execution.

Is it useful to use automated tools to measure AEO? There are emerging tools (Profound, Otterly, AthenaHQ, LLMrefs) that automate part of the process. They are useful past a certain scale (multiple projects, multiple categories), but manual measurement is still the best starting point because it forces you to read the full answers and understand why engines cite certain sources and not others. Automation does not replace that qualitative reading.

Why include five platforms and not three or two? Because each engine has different preferences in how it chooses sources. A brand might appear #1 on Perplexity and absent on Gemini, or vice versa. Only by measuring five do you get the full picture of your visibility in generative responses in 2026.

When should I measure again? The recommended cadence is: full round at the beginning (baseline), biweekly during active implementation months 2-3, monthly starting from month 4, and quarterly with deep competitive audit. Without cadence, data loses value quickly.

What do I do if I don’t appear in any prompt? A citation rate near zero is rarely a content quantity problem. It’s almost always structural: your site isn’t self-positioning in the correct category, it lacks proper schema markup, or it has low domain authority. Review techniques 4.1 (canonical definitions), 4.2 (schema markup), 4.5 (authority), and 4.7 (extraction structure) from the AEO guide before continuing to invest in new content.


This article was produced by Gustavo Papasergio with Claude (Anthropic) as an editorial and analysis co-pilot. The measurement of the SciData case is our own and reproduces the framework described in this post.

Skip to content