Editor Tim King introduces this collection of insights on generative AI data quality from Solutions Review’s featured contributors. Articles appearing in this space were originally published on Insight Jaman enterprise computing community enabling human conversation about AI.
As GenAI models become increasingly integrated into various industries, the accuracy, reliability, and integrity of the data they consume directly impacts their performance and the trust placed in their results. Poor data quality can lead to biased, inaccurate, or even harmful results, undermining the potential benefits of AI and posing significant risks to organizations and consumers.
Navigating these complexities requires leveraging the knowledge and experience of those at the forefront of these fields. The field of GenAI and data management is constantly evolving, with new methodologies and technologies emerging at a rapid pace. As a result, strategies for maintaining data quality are also evolving, requiring continuous learning and adaptation.
Whether addressing issues of bias and data representation or discussing the technical nuances of data preprocessing and validation, their expertise not only underscores the importance of rigorous data standards, but also provides concrete strategies for ensuring that generative models are built on a high-quality data foundation.
In this article, we have gathered the perspectives of a distinguished group of experts in the field, each offering a unique perspective on the challenges and best practices related to data quality in the age of AI.
The thought leaders profiled in this article have been instrumental in shaping the discourse on data quality in generative AI. Their work not only addresses current challenges but also anticipates future developments, providing a roadmap for how to approach data quality in a way that is both forward-thinking and practical.
Generative AI Data Quality: Expert Opinions
Tola Capital Vice President Jake Nibley, Partner Akshay Bhushan, and Founder Sinan Ozdemir discuss how fine-tuning and data quality are defining the AI arms race:
“The data generation process cost the team less than $500 using the OpenAI API. On their first run, fine-tuning the model took three hours on 8x 80GB A100s and gave them similar performance results to text-davinci-003 using only $100 in cloud computing costs. For less than $1,000, the team created a language model that won an additional comparison against text-davinici-003 in a blind pairwise comparison evaluation (89 vs. 90). This shows us that it is more than possible for open source models to catch up quickly; it is inevitable.”
We’re moving toward a world where everyone has access to these models, and businesses and individuals alike will derive tremendous value from them. It’s up to businesses to decide not only how they create the next breakthrough technology using proprietary or open source models, but also how they refine them to deliver better outcomes for their specific use case. This type of innovation isn’t a binary approach: open source and proprietary models can work in harmony.
Read more Review of solutions
Petr Nemeth, Co-Founder and CEO of Dataddo, offers insights on several solutions to improve data quality for AI initiatives:
“People-centric solutions for data quality, such as implementing a global data governance policywill remain important, but they must be complemented by technological solutions that allow for the standardization and flagging of questionable data as early as possible in the AI lifecycle. This is why organizations that do not have the appropriate technologies and tools in place are in difficulty to move AI initiatives into production.
For as long as humans have been collecting data, organizational solutions such as policies and methodologies have been essential to maintaining its quality. And they still are. However, on their own, they are clearly insufficient for AI workloads; they must be implemented alongside the appropriate technologies and tools.
Read more Review of solutions
Syniti’s Rex Ahlstrom offers a quick commentary on GenAI and data quality, and how to deploy a successful data strategy:
“Organizations must begin collecting and documenting data, metadata, procedures, processes, and business rules as part of their data quality programs. These are essential elements required for AI models to produce accurate and relevant results. By investing in initiatives to improve data quality, companies can establish a solid foundation for the application of AI.”
The context of the data is also important. How can you be sure that you are selecting the right datasets and inputs? Your results will be useless if you have high-quality data, but bad data. To ensure that you get the most out of generative AI, you need to combine good data curation with high-quality data.
Read more Review of solutions
Bigeye’s Kyle Kirwan offers insight into the critical importance of data quality in this in-depth resource:
“Data quality is a critical aspect of modern business operations, impacting everything from day-to-day decisions to long-term strategic planning. By understanding the importance of data quality and implementing preventative measures, organizations can ensure their data is reliable, accurate, and fit for purpose. High-quality data not only drives effective and efficient decision-making, it also builds trust among stakeholders. As the saying goes, ‘incomplete data is eliminated.’ Ensuring data quality from the start can prevent costly mistakes and pave the way for successful data-driven initiatives.”
Read more Review of solutions
Nicola Askham, a solution evaluation expert, gives us her take on data quality as the secret ingredient to AI and generative AI success:
“We often marvel at the scale of large-scale language models (LLMs). These behemoths owe their “scale” to the vast volumes of data they are trained on, collected from myriad sources. The lifeblood of these models is the quality of this big data. It is through this data that the models learn the complex dance of language patterns, allowing them to generate consistent and contextually accurate answers.
As a data leader, I’ve often found myself fascinated by the intricacies of artificial intelligence and its relationship to data quality. However, it’s important to remember that AI, like any tool, is only as effective as it is trained on data.