Synthetic Survey Data? It's Not Data

Chris ChapmanChris Chapman
18 min read

[I write this reluctantly, because (1) AI is split into factions and this post may not change any minds; (2) it may upset some well-intentioned researchers. On the other hand, I’ve been asked and hope it is helpful.]

There is widespread discussion of LLM-generated synthetic survey data and its utility. Some survey panel providers promote synthetic data as being fast and cheap and, they claim, as a way to “boost the prevalence” of infrequent respondent groups.

In this post I share my point of view on synthetic survey data — not because I want to wade into any AI wars, but because the topic arises repeatedly and colleagues have asked me to write a post.

Among other things, I argue in this article that:

  • the concept of synthetic data is logically flawed

  • synthetic data fails empirically to emulate human data

  • synthetic data cannot overcome sampling limitations

I include some references but this is not a comprehensive review. It only compiles my thoughts, in varying degrees of completeness, plus a few pointers to other folks’ work. You may agree with some arguments more that others — and that’s fine, I only hope you agree with something :)

Side note: this post discusses the utility of synthetic data, but that is only one consideration. I’ve described elsewhere why it is important to consider ethics, externalities, aesthetics, and social structures alongside the utility of AI. However, proponents of synthetic data discuss it in terms of utility, so that’s where I focus discussion in this article.


The Concept of Synthetic Data is Logically Flawed

I see three fundamental logical problems with LLM synthetic survey data:

  • It rests on the common, yet I believe incorrect, assumption that surveys are about sampling a “true state of affairs” in the world. That view is imported from classical psychometrics where it is a simplifying assumption. But it is not what surveys do in the real world. As I have explained elsewhere, surveys are a form of motivated communication and they must be interpreted as such. They cannot be viewed as measurements of any particular “reality” that is accessible or meaningful apart from considering motivation.

Thus, unless we are interested in the motivations of LLM systems, their data logically cannot replace human data. It doesn’t even matter whether their answers might in some way be “the same” … because the point of surveys is not to measure “the answer” in the first place. The point is to listen to people, and that requires … well, listening to people.

  • The second logical problem is temporal: LLMs are trained on past data, whereas the goal of a survey should be to listen to people now. Even if an LLM has historical data aligning with our question, it is outdated as soon as it has been trained. Thus, synthetic data has no determinable relationship to what we want to know now. It might or might not be relevant; we don’t know.

You might object, “what if I want data about something that other researchers have asked in the past?” That’s fine … but if those data already exist, you don’t need to go through the convoluted process of writing a survey that you hope will align with those data, and then subject that survey to an LLM that you hope will use the data to give responses to your survey, which you then hope will recreate the data. Instead, simply Google your question and access existing data directly.

  • The third problem concerns the domain of inquiry: the space of potential business questions is infinite, but existing data is finite. Thus we may expect that most questions (the infinite space) have not been answered … especially when we are working in a new product area. Although LLMs create novel data on demand, there is no logical reason to expect that their statistical models will infer any particular novel truth — and, indeed, LLMs are not even designed to represent truth within their training sets. [Mathematical note: there are infinities of differing sizes. Even if an LLM can infer within one infinity of data, it doesn’t necessarily infer within all infinities of data. I’ll set that aside.]

To summarize: there is no need to do research when the answer to your question already exists. But you can’t find out whether an answer to your question exists by asking an LLM to create an answer. You need to find an actual data set or else collect new data.


Synthetic Data Fails Empirically (yet that’s not the right question)

There have been various empirical studies assessing whether LLM synthetic data aligns with real data. I believe that is somewhat of a wrong question, which I’ll explain later. Meanwhile, empirical results contradict the claim that synthetic data is similar to human data. Here are a few examples.

  • Bisbee et al (2024) demonstrated that ChatGPT survey results are unstable and are not representative of human survey answers. They found that, “sampling by ChatGPT is not reliable for statistical inference … [Also] the distribution of synthetic responses varies with minor changes in prompt wording, and … the same prompt yields significantly different results over a 3-month period.”

  • Paxton & Yang (2024) found that LLMs and humans report strongly differing “attitudes” about technology products. They found that “language model responses diverge from human responses—often dramatically … [These] divergent results are robust to multiple prompt variations, model families (Gemini, GPT, etc.), and major updates to the models … [Therefore] language model responses should not be used to replace or augment human survey responses at this point in time.”

The following table is one snapshot from Paxton & Yang, assessing the correlations in attitudinal ratings obtained from ChatGPT, Google Gemini, and human raters. Among 8 attitudes, it was only for ratings of “helpfulness” that the correlation exceeds r\=0.20; and even then, it was only r\=0.28-0.30, a weak correlation. The median correlation between LLMs and human attitudes was r\=0.10. This means that LLMs reproduced a median of only 1% (r²) of the pattern of attitudinal ratings of human respondents.

A correlation matrix from Paxton & Yang, 2024, comparing the emotional valence of ratings obtained from human respondents, Gemini LLM, and ChatGPT LLM. Across all 8 dimensions, the correlation between LLM and human ratings was low, in no case exceeding r = 0.3, and with a median agreement of r=0.10. By contrast, the two LLM models had higher agreement with one another (median r=0.19 but high variance, ranging r=-0.04-0.98).

  • Samoylov (2024) noted multiple problems with LLM-created data, especially that LLM results vary dramatically, across many dimensions, in response to relatively simple rephrasing of prompts. Similar to my point above, he further noted that it is impossible to know when one’s domain of interest is covered by LLM training data. He wrote, “how [a prompt] was worded massively affected the results. This is a demonstration of the test-retest unreliability of using LLM-generated responses … because you do not know what most of the LLMs were trained on, you do not know what kind of knowledge they encode … this one observation is enough to make anyone interested in getting real data look the other way”.

An especially nice feature of Samoylov’s article is the inclusion of R code to demonstrate the unreliability of LLM synthetic data. You can update his prompts and examine for yourself whether, in your domain, the answers from an LLM are reliable. The code is at the end of his article.

In short, these empirical studies demonstrate that:

  • Results vary from LLM model to model, time to time, and prompt to prompt.
    Implication: synthetic data are not reliable in the way we expect data to be reliable.

  • Results do not agree with human responses or the patterns of human responses.
    Implication: synthetic data do not have construct validity to mimic human responses in the way they claim. (Read more about construct validity here.)

In the post so far, we’ve seen: (1) that the concept of synthetic data misunderstands how surveys work as motivated communication so the data don’t make sense logically; and, (2) even if we set that problem aside, the resulting data empirically do not align with human survey data. Next, I’ll examine why empirical claims about synthetic data are unscientific in any case.


… Yet empirical evaluation is not the right question

[tl/dr; This is a long sub-section, but I hope it will be worth your time to consider the logic here!]

Although the empirical work above is admirable, it reflects a whack-a-mole strategy. Empirical investigations have no particular theoretical justification; instead, they respond to claims by AI proponents. Such proponents claim synthetic data “works” and demand proof otherwise.

Let’s assume the opposite of the previous empirical results for a moment. Let’s suppose — for the sake of argument — that synthetic responses to survey items are expected to be universally reliable and valid. What would that mean? It would imply that an LLM could, in principle, be expected to answer any question, on any survey, similarly to how a human would answer.

To make that more explicit, let’s look again at the general claims for synthetic data. Suppose we want to know about the overall population likelihood that people will purchase our new product. A traditional survey of purchasing intent will:

  • Target a group of people and ask about each person’s intention to purchase our product

  • Expect that the intention will be somewhat indicative of future behavior

  • Sample a population so we can estimate the aggregate behavior of the targeted group

The claim for synthetic data is the same: that we can use synthetic responses to “sample” a group and estimate its aggregate intentions or behavior. For example, we might estimate whether synthetic “purchasers” are likely to purchase our product. And then we project that estimate to humans.

When we think about that for a moment, we might notice two things. First, it seems highly implausible. If a data provider has an LLM with such capability — to credibly answer any survey question similarly to humans would answer, about future behavior — why would they use it to sell survey panel responses?

Imagine, as one example, that we can accurately assess the intentions of executives and stock traders. For instance, we could envision survey items about their intended behaviors, such as these:

  • Tomorrow, will you sell XYZ stock?

  • Today, will you increase your hedge of ABC currency?

  • In the next hour, will you short UVW stock?

Just to be clear, these questions do not ask for predictions about future stock prices or the like. They ask only about a person’s intended behavior. In exactly the same way we might survey consumers about intentions to purchase a product, we could ask traders about intentions to purchase stocks, commodities, or currency. We can use the aggregate of those responses to infer likely population behavior.

If we could accurately assess such behavioral intentions, we could use that aggregate information quite profitably to predict stock, currency, or commodity futures — and that would be a much better business than selling survey responses. It is easy to imagine similar examples of valuable data in the domains of politics, medicine, pharmaceuticals, military affairs, logistics, shipping, and the like. If we could use synthetic data to estimate intended behaviors of groups of people in those domains, it would have value that far exceeds that of selling survey panel data.

“But wait!” : Side note about such a hypothetical survey

Let’s set aside LLMs for a moment. Do you object that executives and traders would not answer such items on a survey? That they would not report their intended trades? Or that they would not answer honestly? Do you worry that any answers they give wouldn’t predict real behavior?

Good! If you think that, your views may align with interpreting surveys as being about motivated communication, and not as about sampling any particular “reality” apart from that. And that implies that the core premise of synthetic data is illogical, as I noted at the outset of this post (you might return to the top of this post and re-read my discussion; or see this separate post for more.)

For this first point, my inference is this: synthetic data providers do not believe that their systems can do anything like this or that their data have such value. Instead, they sell such data because it has low value. Besides their own behavior, they reveal their belief in the value of such through their pricing, and in their statements about how cheap it is.

Now, maybe you’re thinking, “I understand the argument, but perhaps providers do use the LLMs profitably in the way you describe, but they do so secretly while they also sell survey responses as a second business.” That is logically possible, yet: (1) why would they give away a competitive advantage of that kind and let others match what they are doing? (2) where are the case studies to prove such capability ? (3) why is their pricing so low? Overall, this possibility seems extremely unlikely to be the case, for all the reasons outlined here.

Second, the claim that any survey question might credibly be answered by an LLM is not a claim that can be tested empirically. How would one go about testing that? One would need to define the space of all possible survey questions, determine a sampling strategy for that space, and determine a way to find the “ground truth” of human responses across that entire space to compare to LLM responses. And would need to show that the infinity of possible spaces maps to a majority of its constituent subspaces.

I have no idea how one could do that in a general way. Therefore, the question of empirically assessing LLMs is necessarily limited to specific domains (as in each of the examples above) and cannot be evaluated generally.

Consider, third, that when we evaluating the reliability of synthetic data in any particular domain, an LLM proponent could always claim “that is just one example.” This shows that an empirical assessment strategy cannot answer the general claim about answering survey items, because that general claim is not an empirical claim. Instead, it is a belief disguised as a claim.

In summary, the “hypothesis” that is supposedly being tested in empirical evaluation of synthetic data — namely, that it can replace human data — is a marketing claim and not a scientific hypothesis.

Next we’ll look at another marketing claim: that synthetic data can replace hard-to-reach subgroups of respondents.


Synthetic Data Cannot Overcome Sampling Limitations

LLM-generated data providers postulate that synthetic data can emulate “hard to reach audiences” (example). The term “hard to reach” is rarely defined by providers. Instead, they use vague references to imprecise notions such as “increasing the diversity” of respondents.

From discussions I’ve had, this is often interpreted in two ways. First, it may mean niche audiences who have low prevalence in panels (corporate executives, physicians, developers, users of one specific product, etc). Second, it may mean historically marginalized respondents such as ethnic minority groups, people with disabilities, different language groups, and others.

In either case, the premise is that these are people who do not answer our surveys as often as we researchers would like them to. (Remember my argument that surveys reflect motivated communication? This premise also aligns with that view.) Platforms claim that they can “boost” the responses for such people using synthetic data that either supplements or completely replaces real data.

Unfortunately, there is no reason to believe that LLMs can accomplish the goal of representing difficult-to-reach audiences, and there are several reasons to expect otherwise:

  • We saw above that LLM data does not agree with general population responses. Why would an LLM give better data for subgroups than it does for larger populations? There is no logical or statistical reason to expect that.

  • We know that LLM training data over-represents some groups (English speaking, white, educated, western, technology-interested, affluent) and under-represents other groups (non-English speaking, other than white, non-western, etc.). There are algorithmic approaches to reduce bias, but — even if a provider has implemented some of them — how would we know that bias has been eliminated for some potentially unique group of interest that we want to sample? There is no a priori bias reduction that works in advance, for targeted samples that will only be specified later. (This is similar to the empirical questions above; it cannot be answered in a general way but only on a per-domain basis.)

  • Even when subgroup bias has been addressed in some way, the approaches are fragile and non-generalizable. Ferrara (2024) has written of this as the “butterfly effect,” that small changes to algorithms, training data, or prompts can lead to substantial changes in the output of LLM models that magnify the effect of biases. What does that imply? Even if an LLM system generates reliable synthetic data that overcomes biases at one point in time and for one group, we have no reason to expect that it will do so at a later time, or with a different group, or after an algorithm update.

In short, the claim that LLM synthetic data can amplify otherwise underrepresented groups appears to be highly unlikely as a statistical matter, and is impossible to prove as a general expectation.

There is another sampling problem that I’ve written about separately: the basic concepts and statistics of sampling do not apply to LLMs. This is additional to the concerns in this post, and it means that, besides the concerns here, any “sample” from an LLM has no particular statistical meaning. It is, in fact, not a sample at all but output of a mostly uncharacterized stochastic process. (This implies, for example, that we cannot place meaningful confidence intervals on statistics from synthetic data; the estimates certainly are not exact values, yet we have no way to assess our degree of sampling certainty about them.)


Side Note: Questions for Research Ethics

Although I don’t have space here to consider all aspects of research ethics, there a core question that I encourage every researcher to ask: does synthetic data meet ethical obligations to society and ourselves?

Various governmental and professional organizations outline ethical requirements for research. These include legal definitions such the US Code of Federal Regulations (42 CFR 93.234) and professional standards such as the Ethics Code of the American Psychological Association (2017). The legal definition in US 42 CFR is relatively typical:

Research misconduct means fabrication, falsification, or plagiarism in proposing, performing, or reviewing research, or in reporting research results.

So the question is: are synthetic data fabricated, falsified, or plagiarized?

I don’t propose an answer here because the set of considerations — from models of ethics such as deontology vs. consequentialism, to the definition of individual words and the functioning of specific LLM systems with respect to plagiarism (e.g., Reisner, 2025) — is too large to tackle in this post.

Instead, I believe that the question is important (and sometimes mandated) for each of us to consider and answer for ourselves. We should be confident that our research aligns with ethical requirements!


Other Uses for Synthetic Data: Unfortunately Unconvincing

Here are four common claims for synthetic data other than reporting it, with my brief rebuttals.

  • Synthetic data accelerates research. Rebuttal: synthetic data is not data; using it is not research.

  • Synthetic data can pre-test a survey and analyses. Rebuttal: random responses, as many platforms provide, are preferable. Random responses are free of assumptions and correlated patterns, giving more comprehensive and unbiased tests. Random data can also be used to help assess data validity.

  • Synthetic data can preview expected results with stakeholders. Rebuttal: such data doesn’t preview anything. Better — and much less risky — is for a researcher to use a combination of domain knowledge and stakeholder engagement to create a few scenarios reflecting potential outcomes. We can discuss those with high specificity and relevance, without fabricating data.

  • Other colleagues will use such data, but I can use it more carefully. Rebuttal: they shouldn’t use it, either. We can’t let research ethics and practice be defined by what non-researchers might do.

Overall, I don’t view any of these claims as a good reason to use synthetic data. Let’s consider the benefits and risks. On the benefits side: synthetic data shows low value and there are preferable alternatives. On the risks side: creating synthetic data could lead to it being accidentally used, reported, or demanded by executives. In my opinion, the low benefits are strongly outweighed by the substantial risk.

Although it is great to pre-test a survey and its analyses, and it’s also great to preview potential results with stakeholders, those goals do not require synthetic data and are accomplished better without it.


Conclusion

My personal conclusion — as a matter of logic, empirical findings, statistical reasoning, and scientific principles — is that synthetic data has no place in survey research. I also believe the purported use cases for synthetic data are unconvincing; alternative approaches are superior in their results and are less risky in practice. And there are important questions about research ethics that each of us should consider.

Disagree? Publish your reasoning and results! (For a venue, check out quantuxcon.org)

All of this poses a question: why is there so much interest in synthetic data? Samoylov (2024) argues that it reflects a “snake oil” industry that is intent to sell products to naive customers.

For my part, I see the hope placed in synthetic data as a form of magical thinking. It is certainly appealing to believe that a data genie can magically create the data I need! And it is even more appealing to believe that it can free me from the difficulties of collecting real data, while getting results faster.

But more likely, we can’t escape the real work of survey research: collecting good data that informs unique decisions. On a happy note, actual research — which is to say, learning from people — is not only informative about the real world, it is also an enjoyable and rewarding enterprise.

And that process of learning from people will always deliver enough value to exist!


References

American Psychological Association (2017). Ethical Principles of Psychologists and Code of Conduct. At https://www.apa.org/ethics/code

Bisbee J, Clinton JD, Dorff C, Kenkel B, Larson JM. Synthetic Replacements for Human Survey Data? The Perils of Large Language Models. Political Analysis. 2024; 32(4):401-416. doi:10.1017/pan.2024.5

Ferrara E (2024). The Butterfly Effect in artificial intelligence systems: Implications for AI bias and fairness. Machine Learning with Applications, Volume 15. https://doi.org/10.1016/j.mlwa.2024.100525. At https://www.sciencedirect.com/science/article/pii/S266682702400001X.

Paxton J, Yang Y. (2024). “Do LLMs simulate human attitudes about technology products?” In Proceedings of the 2024 Quantitative User Experience Conference. At https://drive.google.com/file/d/16F_JZv4eHNiDMJT6BT7F6m97C2rBX8-7/view?usp=sharing

Reisner A (2025). “Search LibGen, the Pirated-Books Database That Meta Used to Train AI”. The Atlantic, at https://www.theatlantic.com/technology/archive/2025/03/search-libgen-data-set/682094/

Samoylov N (2024). Synthetic respondents are the homoeopathy of market research. At https://conjointly.com/blog/synthetic-respondents-are-the-homeopathy-of-market-research/

US Code of Federal Regulations (2024). Research misconduct. 42 CFR 93.234. https://www.ecfr.gov/current/title-42/section-93.234


Finally, as always, this post was …

written by a human, not by AI

0
Subscribe to my newsletter

Read articles from Chris Chapman directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Chris Chapman
Chris Chapman

President + Executive Director, Quant UX Association. Previously: Principal UX Researcher @ Google; Amazon Lab 126; Microsoft. Author of "Quantitative User Experience Research" and "[R | Python] for Marketing Research and Analytics".