What type of data analysis can AI perform?
We already know ChatGPT as the most versatile AI tool, with plugins that allow it to do just about everything. It can generate working code in Python, R and many other languages, as well as complex SQL queries. As you can imagine, the combination of these features would allow you to use AI for almost every part of your data analysis work.
Use cases include:
- Questioning
- Cleaning and other treatments
- Visualization
When it comes to working with data, specialized tools like Julius AI (for csv files) or BlazeSQL (for SQL databases) are specially designed for this purpose. Unlike ChatGPT, these tools do not require you to download/connect and explain your data every time you open them.
ChatGPT works for quick analysis on a csv file, but most companies store the data in SQL databases within private networks. However, specialized tools can connect to these secure SQL databases and answer your questions by querying your database and visualizing the results.
How could AI replace data analysts?
Data analysis is about obtaining insights from data, data analysts and data scientists are those who have the technical skills to provide stakeholders with the information they need. But things have changed and AI tools can now successfully accomplish certain tasks that previously could only be accomplished by data analysts and data scientists.
In theory, a commercial actor without technical skills could now connect their data to an AI tool, and make a query like “Get monthly revenue grouped by product, for the top 3 products of the year”. The AI can then recover the data, and even visualize it. The user would only need a few seconds to write the request. If they had asked a human colleague, they might not have gotten an answer for a few days or more.
Seeing an image like this can be both surprising and worrying for data analysts, but replacing data analysts and data scientists is not that simple. Simply run an SQL query and graph the result. only part of their work, and even that cannot always be done reliably by AI. It may have worked in the screenshot above, but what if the result is wrong even if it looks correct?
It seems like it’s time to talk about some of the limitations of AI for working with data.
Limitation #1: AI hallucinations
Most people who have worked with ChatGPT and similar tools have heard the term “hallucination” in this context. When you ask them about something they don’t know, they sometimes answer just make stuff up.
The reason for these hallucinations is simple: LLMs are like highly advanced autocomplete algorithms. They make the probably the next post in a conversation, based on the data they were trained on. Thanks to high-quality datasets and advanced training techniques, this “autocomplete” works so well that these tools can respond to complex queries with remarkably high-quality results. Unfortunately, when faced with situations their training data has not prepared them for, the probably the next post maybe it doesn’t really make sense.
What happens if it generates code that runs, but the code returns incorrect data? The business stakeholder using AI Data Analyst may have no idea that the result is wrong, but they cannot see the error because they do not understand the code.
Limit No. 2: commercial information.
Usually, when a new data analyst starts working in a company, he or she needs to learn the meaning of certain columns and values. Indeed, the data model was designed by the company. You can’t just analyze data without understanding where it comes from, because common knowledge is not enough to understand most databases.
AI tools like BlazeSQL allow you to include this information for the AI to use, but a data analyst or data scientist will be needed to keep it up to date.
Limitation #3: Sometimes the AI gets stuck. AKA “Blind Spots”
You may have seen examples of ChatGPT getting stuck on a very basic question. These questions are often very easy to answer, but they require the AI to reason in ways it isn’t very good at.
We can call these cases “blind spots”, and they also exist for writing code. Ex. One of the common AI blind spots for generating SQL queries is using subqueries. AI models often generate queries that attempt to select a column in a subquery, even if that column does not exist in the subquery.
WITH recent_orders AS (
SELECT
customer_id,
MAX(order_date) AS latest_order_date
FROM
orders
GROUP BY
customer_id
)
SELECT
customer_id,
product_id, -- (This column is not defined in the subquery)
latest_order_date
FROM
recent_orders
Even when the error is reported, they often make the same mistake when trying again.
Limitation #4: AI models agree too much
AI models will tend to agree with you, even if you are wrong. This can pose a huge problem when the AI model is supposed to play the role of an expert, since an expert should be able to correct you when you are wrong.
Limitation #5: Entry length
A human can spend months learning about a project and the database, collecting a lot of important information. An LLM, on the other hand, typically has a “token limit,” meaning it can only accept a certain amount of input.
This input length (aka “token limit”) is often restrictive when dealing with complex tasks. How could you distill these months of learning into a few pages and integrate them into the AI model?
The widely available version of GPT-4 is limited to 12 pages input + output. Keep in mind that a data analyst will attend hours of meetings and read documentation or reports. All results (GPT-4 code and explanation) must be subtracted from the 12 pages, because the limit includes output, not just input.
This means that a major data analysis project that requires a lot of learning and exploration is simply not feasible.
Limit #6: General Skills
Last but not least, ChatGPT and other AI chatbots are… just chatbots. Human interaction and soft skills play an important role in working on data projects. Whether it’s gaining trust, managing office politics, or interpreting nonverbal communication. These elements are crucial to successfully collaborating with stakeholders and successfully completing a project.
And after?
As you can see, AI has a number of limitations that prevent it from being a fully competent data analyst. The list above contains just a few of the major limitations, but there are many other major obstacles when it comes to replacing a data expert. In other words, you don’t have to worry about AI replacing you!
That being said, AI is already having a significant impact on Data Analysts and Data Scientists. It may not be perfect, but it already offers incredible value.
Work faster with AI
Writing code, whether it’s Python, SQL, or R, can be time-consuming. These AI tools may not be 100% accurate, but they still work well most of the time. It’s often 10 times faster to quickly review what they’ve generated than to redo everything from scratch.
In cases where the AI struggles or often makes mistakes, it may be quicker to start from scratch. In other cases, the massive increase in productivity is worth the occasional debugging effort. The important thing is to experiment with different tools, know their strengths and weaknesses, and integrate them into your workflow accordingly.
What about the future?
Things are progressing extremely quickly, so some of the current limitations won’t necessarily last long. This is especially true now that AI tools are used by so many people, as they learn from their users. These interactions are used to train the models, and there are millions of interactions every day.
ChatGPT has the fastest growing user base of all time, and it learns from this user base.
With competitors like Claude, Bard and others joining the race, we will surely see massive improvements soon.
Preparing for these changes is simple, just keep an eye out for new tools and experiment with them. This way, you’ll know their strengths and weaknesses and can ensure you’re leveraging the latest technologies and adapting as they evolve.
On that note, a few tools to watch out for include:
BlazeSQL (for SQL databases)
Advanced ChatGPT data analysis (For csv and other files)
Panda AI (added Generative AI to the pandas library)
Justus Mulli is a data scientist and founder, with experience in finance, healthcare and e-commerce. He leverages his expertise in data science and AI to implement disruptive AI solutions across various industries and professions.