Democratization of data –the process of making data accessible to everyone in an organization, regardless of their technical skills.
Data democratization is a conundrum that the old school Ralph Kimball acolytes like me have been trying to solve this problem for decades. Starting with user-friendly data models (data warehouses) and then moving on to the plethora of highly evolved and user-friendly business intelligence tools now available, we have come a long way.
And yet, the ability to derive new insights from data remains largely the domain of data analysts, data scientists, and business analysts. For the vast majority of other members of professional organizations, the technical gap around the data (real or imaginary) persists.
A glimmer of hope?
In late November 2022, OpenAI’s release of ChatGPT allowed the general (read: non-technical) public to interact with a large language model (LLM) by simply typing a query. (fast) in their natural language. Through this conversational user interface, users could prompt the LLM to answer questions about the data it had been “trained” on. In ChatGPT’s case, he was trained on, well… the Internet.
ChatGPT put incredible data processing power in the hands of everyone who had access to it. As we became aware of the possibilities of this mechanism, many of us in the data analytics field quickly began to think about its potential impact on our own space.
We didn’t have to think long…
Just four months after ChatGPT’s initial release to the general public, OpenAI released an alpha version of a ChatGPT plugin called Code interpreter. With it, anyone can load a dataset into ChatGPT, enter a few prompts, and call Python to perform regression analysis, descriptive analysis, and even create visualizations. All this without having to write any code!
The release of Code Interpreter gave us all a glimpse of how AI-powered conversational data analysis could work. It was amazing!
Soon after, citing ChatGPT’s already established ability to write code (SQL, R, and Python, to name a few) as well as the nascent capabilities of Code Interpreter, many began predicting the eventual demise of the role of data analyst. (At the time I disagreed and even wrote an article about it).
Will generative AI replace the need for data analysts? Galen Okazaki for Toward Data Science
Certainly, such a prediction didn’t seem far-fetched when considering the possibility that even the least technically inclined within your organization might be able to learn from their data by simply typing it in or even asking their questions verbally.
So could data analysis based on conversational AI be the key to bridging the technical gap between data and its democratization?
Let’s take a closer look.
The Current State of AI-Driven Conversational Data Analytics
So… it’s been almost a year and a half since the release of that alpha version of Code Interpreter and what progress have we made with AI-powered conversational data analytics? Probably not as much as you might have expected.
For example: In July 2023, the ChatGPT code interpreter was rebadged and re-released as Advanced data analysis. Not only has Code Interpreter’s name been changed, but also… uh… uh… Well, at least its new name provides a more accurate description of what it actually does. 🤷♂️
In all honesty, Code Interpreter/Advanced Data Analysis is a great tool, but it was never intended to be an enterprise-wide analytics solution. It still only works with static files you upload to it, because you can’t connect it to a database.
For a better perspective, let’s look at some currently available analytics tools that have integrated conversational AI interfaces.
Questions and answers about Power BI
The first attempt to implement conversational data analysis predates the ChatGPT release. In 2019, the omnipresence of Microsoft Power BI released a feature called “Q&A.” It allowed users to enter questions about their data in their natural language, as long as it was English (currently the only supported language).
This is done through a text box interface embedded in an existing dashboard or report. Through this interface, users ask questions about the dataset behind that particular dashboard or report in natural language. Power BI uses Natural language query (NLQ), to translate textual questions into queries. The answers are rendered in the form of visualizations.
While this feature has its uses, it has a significant limitation. Power BI Q&A is limited to querying only the data set behind the report or dashboard being viewed, which is far too narrow if your ultimate goal is enterprise-wide data democratization .
Snowflake Cortex Analyst
A more appropriate example of AI-powered conversational data analytics that could potentially support data democracy is Snowflake Cortex Analyst.
To briefly summarize, Snowflake itself is a constantly evolving, cloud-based SaaS data warehousing and analytics platform that provides customers with the ability to scale their storage and/or compute as needed. increase or decrease according to their needs. Its architecture also supports high-speed data processing and queries.
Cortex Analyst is Snowflake’s version of AI-powered conversational data analytics. Right off the bat, it has a huge advantage over Power BI Q&A, in that instead of only allowing users to query a set of data behind an existing report or dashboard, Cortex Analyst will allow the user to query the entire underlying database. He does it in rely on the semantic layer and model to interpret user requests.
This brings us to a critical point.
Have a fully verified semantic layer in place is an absolute prerequisite for data democracy. It makes perfect sense that before giving everyone in your company the ability to work with data, it is necessary to have a universally accepted definition of the data and metrics used. We’ll talk about this later.
While I’ve only covered two examples of AI-powered conversational data analytics here, they should be enough to help you imagine their potential role in the democratization of data.
The challenges of data democracy
Although the ability to ask a question about your data in natural language and get an answer has significant potential, I think the biggest challenges to data democracy are not technological.
Let’s start with the prerequisites for successful data democratization. These include a robust data infrastructure that fully meets the previously mentioned needs. semantic layer and model, data mastery, data quality and data governance. In themselves, each This is a significant project and the reality is that, for many businesses, it is still a work in progress.
This is especially true for data literacy.
Indeed, while 92% of business decision-makers believe that data mastery is important, only 34% of companies currently offer data mastery training (source Data Literacy Index, Wharton School of Business).
Another challenge I’ve seen throughout my career in data analysis. In my experience, there has always been a group of users (including some at the C level) who, for various reasons, refused to use the BI interfaces we created for them. Even though they were generally a minority of people, it reminded us that while bells and whistles are great, many will stubbornly continue to work only with what they know best.
Summary
A successful data democratization effort cannot rely on any specific technology, regardless of its promise. This requires a visionary and multi-dimensional approach that includes strong data infrastructure and a data-driven organizational mindset, in addition to appropriate technologies.
So while AI-powered conversational data analytics alone cannot solve the data democratization conundrum, it can most certainly play an important role in an overall effort.
Side note:
As someone who believes in enabling industries to work with data, I see immense value in AI-powered conversational data analytics.
In my opinion, at least for the moment, the highest and best use of this tool would be in the hands of business analysts. Given their combined knowledge of how the business works (domain knowledge) and their already established data fluency, they are best equipped to leverage conversational analytics to get answers without being encumbered by a complex code.