AI, Synthetic Data, and the Future of Data Quality

Exploring the transformative impact of AI and synthetic data on data quality, with practical guidance for consumer and sensory research featured on the Research Live podcast

Author profile picture

Alex Kuzmina, Innovation Associate Director, NOVA

26 Nov, 2024 | 7 minutes

I had the privilege of being interviewed by Liam Kay-McClean, deputy editor of Research Live, for the final episode of the MRS podcast series on synthetic data and artificial intelligence. In this episode, I discussed three important questions:

  1. What impact will AI and synthetic data have on data quality in the research industry?
  2. Is better data quality a prerequisite for introducing synthetic data models?
  3. What is the one data quality rule market research organizations should keep in mind when expanding their use of AI?

You can listen to the podcast here:

TUNE IN ON SPOTIFY

TUNE IN ON APPLE PODCASTS

MRS Research Live Podcast: The Role of AI and Synthetic Data in Market Research

Wondering about the impact of AI and synthetic data on consumer and sensory research? Tune into this podcast hosted by Research Live and read full answers for some practical guidance from NOVA's Alexandra Kuzmina.

2024 spotify brand assets media kit

MRS Research Live Podcast: The Role of AI and Synthetic Data in Market Research

Wondering about the impact of AI and synthetic data on consumer and sensory research? Tune into this podcast hosted by Research Live and read full answers for some practical guidance from NOVA's Alexandra Kuzmina.

Apple Podcasts jpg og

While the podcast gives a snapshot of these discussions, this page provides the full responses I shared during the interview, offering more detail on the challenges and opportunities for our industry. Tune in to the Research Live podcast to hear the full conversation.

What do you think the impact of AI and synthetic data will be on data quality in the research industry?

It's very important to define what we're talking about when we are discussing synthetic data, as this space is not mature yet, and common definitions could even help ease very heated debates. Ray Poynter has done some great work to establish a common language to ensure transparency which I think is very important.

Adobe Stock 825116582

For simplicity, let’s break them down into two areas, these are not exhaustive but can be helpful:

Pexels cottonbro 8088493

1. Synthetic Data (Quant) or Synthetic Survey Responses

This is about augmenting and extrapolating from existing data. For example, if you run booster samples today to meet quotas for specific sub-groups, synthetic data could be used to help in situations where boosts can’t be achieved or just cost prohibitive. This is what most of the suppliers of synthetic data are talking about. The idea here is to stretch and sort of magnify known data without needing new real-world responses. But, whether this promise holds true is something we’re still exploring and more generally trying to understand, is it really needed?

2. Synthetic Personas

Respondents or users or AI personas because again quoting Ray Paynter the word synthetic may have negative connotations for some. The term might not be perfect either, but it's about interacting with existing data in a conversational way—think of it as "chatting with your data." By feeding project data into a machine learning model, you can interact with that data through a conversational interface in a more qualitative manner, asking questions you may not have thought of during primary data collection. The key is that the model generates answers grounded in responses from real people, whether it is project or census data.

Adobe Stock 820000343 1

This is what we have available as synthetic personas now. And there is a huge potential of them in the future, what it could be in the future as digital twins of real participants, answering questions or talking to each other, really transforming the relationship with consumer data that brands as end clients have today.

I think, your question relates more to synthetic quant data or synthetic survey responses. As an idea, synthetic data can have a big impact in specific areas, such as medical research, where finding enough real-world participants for rare conditions can be challenging. In these contexts, synthetic data can fill critical gaps and help move research forward.

However, in CPG and FMCG research, it's a different story. We're not dealing with inaccessible groups; we’re researching everyday consumers. In these cases, from what we’ve seen so far again for our niche use cases, synthetic data raises significant questions about its reliability. But we are continuing to explore this area.

There are loads of suppliers out there promising this stuff, from our experience you need to make sure the people you are talking to understand the types of data you collect because consumer data can be highly variable, when we're working with survey data in FMCG, where consumer behaviors, attitudes and preferences vary a lot.

Pexels gustavo fring 4173326

So, to answer your question, synthetic data debates might be a blessing in disguise for our industry. This is mainly because they force us to essentially rethink how we view data quality in market research. Beyond this, they shine a light on long-standing challenges, like declining engagement in research, which has been a major contributor to suboptimal data quality.

Yes, engaging participants in meaningful ways at scale is expensive, but duplicating data rows or artificially boosting samples might not be the answer. There’s a lot of buzz around how synthetic data could solve issues like small sample sizes or hard-to-reach sub-groups, but from what we've seen so far, it hasn't delivered on these promises in the CPG space yet.

But let’s imagine that synthetic data does work, then there is a lot to think through. If you can suddenly have base sizes of 1000 then sig testing becomes a bit irrelevant as small differences will be significant. If you can boost one subgroup to 1000, why not boost them all? Not sure the MRS will have drawn up guidelines on how to manage/analyze/report synthetic data yet (like they have for weighted data).

We think the more interesting thing to look at is the persona side because they can be fed different sources of information and alternative data sources and can be augmented in new ways which can be really powerful for both clients and agencies.

By alternative data, we mean looking beyond traditional survey data to sources like social media activity, geolocation data, open data from governments and local authorities, or even weather patterns. We could also apply frameworks from behavioral science to our existing data, which may not boost sample sizes but could offer deeper, richer, more meaningful consumer insights. Maybe this approach could be a more valuable use of time and effort.

Adobe Stock 972710768

Is better data quality a pre-requisite for the introduction of synthetic data models?

Absolutely, and as any data scientist will tell you, it’s the classic 'garbage in, garbage out' scenario. Synthetic data models are only as good as the real-world data they're trained on. If the underlying data is flawed or incomplete, synthetic outputs will simply magnify those issues.

There is a risk that synthetic data can make the data more homogeneous, this can lead to an oversimplification or loss of important variability. If you're starting with a small dataset — say, 20 people — and extrapolating that to 50, you’re still limited by the quality and representativeness of those original 20 responses unless you're integrating other, high-quality data sources. Without that, you’re just applying stats to a flawed foundation.

What is the one data quality rule market research organizations should keep in mind when expanding their use of AI?

If I had to pick just one, I’d say it’s to never lose sight of the real consumers behind the data. AI should help us get closer to the consumer, not take us further away.

This means that regardless of how advanced your AI models are, they must be grounded in high-quality, real-world data that truly reflects the behaviors, attitudes, and nuances of actual people.

In practice, this means market research agencies should have rigorous data validation processes. AI might automate certain aspects of our processes, but we still need to ensure that what it’s working with is clean, accurate, and representative. There's also the benefit of AI helping to advance data cleaning checks.

But don’t just rely on AI outputs—constantly question whether the insights align with real-world behavior, especially when dealing with consumer-driven industries like FMCG or CPG. At the end of the day, AI is a tool to enhance research, not a substitute for high-quality, authentic human data.

AI should help us get closer to the consumer. And ironically AI has a huge potential. I say ironically because we use technology to get closer to people because technology allows us to do it at scale.

Adobe Stock 217294299

As AI and synthetic data evolve, they will continue to shape the future of market research—for better or worse. Staying informed and engaged is key to navigating these changes. If you’re keen to explore more about these trends and other innovations, check out December’s edition of the NOVA Partnership Initiative Newsletter.