The skepticism is fair and most tools selling “AI visibility scores” deserve it. But the challenge is methodological, not fundamental.
The variance you are describing is real. Memory, personalization, subscription tier, query phrasing all affect outputs. Where people go wrong is treating a single snapshot as ground truth. What actually gives you signal is running the same prompts repeatedly, across multiple engines, with consistent phrasing, and looking at trend lines not point-in-time scores. That levels out most of the stochastic noise.
The part that does remain genuinely hard is personalized memory. Short of running everything through API without memory or using clean browser contexts, you are always measuring a slightly different surface. Honest tools will tell you that upfront.
I built something in this space so I am obviously biased, but the way we approached it was to track mention rate and citation rate separately across ChatGPT, Gemini, Perplexity, and a few others, using standardized prompt sets tied to your product and buyer stage. Over enough runs, the pattern becomes pretty reliable. Not perfect, but directionally useful.
The tools that just give you a score with no methodology breakdown? Yeah those are probably snake oil. Worth asking any vendor how they normalize for the variables you mentioned before trusting the number.