Do you master data using Python? If so, I bet most of you use Pandas to manipulate data.
If you don’t know, Pandas is an open source Python package specially developed for data analysis and manipulation. This is one of the most used packages and the one you usually learn when you start a data science journey in Python.
So, what is Pandas AI? I assume you are reading this article because you want to know more.
Well, as you know, we are in an era where generative AI is everywhere. Imagine if you can perform data analysis on your data using generative AI; things would be much easier.
This is what Pandas AI brings. With simple prompts, we can quickly analyze and manipulate our dataset without sending our data anywhere.
This article explains how to use Pandas AI for data analysis tasks. In the article we will learn the following:
- Setting up Pandas AI
- Data Mining with Pandas AI
- Data Visualization with Pandas AI
- Advanced use of Pandas AI
If you’re ready to learn, let’s do it!
Panda AI is a Python package that implements a Large Language Model (LLM) feature in the Pandas API. We can use the standard Pandas API with a generative AI enhancement that turns Pandas into a conversational tool.
We mainly want to use Pandas AI because of the simple process provided by the package. The package could automatically parse data using a simple prompt without requiring complex code.
Enough introduction. Let’s move on to concrete things.
First of all, we need to install the package before anything else.
Next, we need to configure the LLM we want to use for Pandas AI. There are several options, such as OpenAI GPT and HuggingFace. However, we will be using OpenAI GPT for this tutorial.
Defining the OpenAI model in Pandas AI is simple, but you will need the OpenAI API key. If you don’t have one, you can access their website.
If everything is ready, let’s configure the Pandas AI LLM using the code below.
from pandasai.llm import OpenAI
llm = OpenAI(api_token="Your OpenAI API Key")
You are now ready to perform data analysis with Pandas AI.
Data Mining with Pandas AI
Let’s start with an example dataset and try data mining with Pandas AI. I would use the Titanic data from the Seaborn package in this example.
import seaborn as sns
from pandasai import SmartDataframe
data = sns.load_dataset('titanic')
df = SmartDataframe(data, config = {'llm': llm})
We need to pass them to the Pandas AI Smart Data Frame object to launch the Pandas AI. After that, we can perform a conversational activity on our DataFrame.
Let’s try a simple question.
response = df.chat("""Return the survived class in percentage""")
response
The percentage of passengers who survived is: 38.38%
From the prompt, Pandas AI could suggest the solution and answer our questions.
We can ask questions to Pandas AI which provide answers in the DataFrame object. For example, here are several prompts for analyzing data.
#Data Summary
summary = df.chat("""Can you get me the statistical summary of the dataset""")
#Class percentage
surv_pclass_perc = df.chat("""Return the survived in percentage breakdown by pclass""")
#Missing Data
missing_data_perc = df.chat("""Return the missing data percentage for the columns""")
#Outlier Data
outlier_fare_data = response = df.chat("""Please provide me the data rows that
contains outlier data based on fare column""")
Image by author
You can see from the image above that the Pandas AI can provide information with the DataFrame object, even though the prompt is quite complex.
However, Pandas AI cannot handle too complex calculation because the packages are limited to the LLM that we pass on the SmartDataFrame object. In the future, I’m sure Pandas AI could handle much more detailed analysis as LLM capability scales.
Data Visualization with Pandas AI
Pandas AI is useful for data mining and can perform data visualization. As long as we specify the prompt, Pandas AI will give the visualization result.
Let’s try a simple example.
response = df.chat('Please provide me the fare data distribution visualization')
response
Image by author
In the example above, we ask Pandas AI to visualize the distribution of the Fare column. The result is the bar chart distribution of the dataset.
Just like data mining, you can do any type of data visualization. However, Pandas AI still cannot handle more complex visualization processes.
Here are some more examples of data visualization with Pandas AI.
kde_plot = df.chat("""Please plot the kde distribution of age column and separate them with survived column""")
box_plot = df.chat("""Return me the box plot visualization of the age column separated by sex""")
heat_map = df.chat("""Give me heat map plot to visualize the numerical columns correlation""")
count_plot = df.chat("""Visualize the categorical column sex and survived""")
Image by author
The plot looks nice and neat. You can continue to request more details from Pandas AI if necessary.
Pandas AI advances usage
We can use several built-in Pandas AI APIs to improve the Pandas AI experience.
Clearing the cache
By default, all Pandas AI object prompts and results are stored in the local directory to reduce processing time and the time Pandas AI needs to call the model.
However, this cache can sometimes make the Pandas AI result irrelevant because they take the past result into account. This is why it is recommended to clear the cache. You can clear them with the following code.
import pandasai as pai
pai.clear_cache()
You can also disable cache at first.
df = SmartDataframe(data, {"enable_cache": False})
This way no prompts or results are stored from the beginning.
Custom head
It is possible to pass a sample head DataFrame to Pandas AI. This is useful if you don’t want to share some private data with the LLM or just want to provide an example to Pandas AI.
To do this, you can use the following code.
from pandasai import SmartDataframe
import pandas as pd
# head df
head_df = data.sample(5)
df = SmartDataframe(data, config={
"custom_head": head_df,
'llm': llm
})
Pandas AI Skills and Agents
Pandas AI allows users to pass a sample function and execute it with an agent decision. For example, the function below combines two different DataFrames and we pass in an example plot function to be executed by the Pandas AI agent.
import pandas as pd
from pandasai import Agent
from pandasai.skills import skill
employees_data = {
"EmployeeID": (1, 2, 3, 4, 5),
"Name": ("John", "Emma", "Liam", "Olivia", "William"),
"Department": ("HR", "Sales", "IT", "Marketing", "Finance"),
}
salaries_data = {
"EmployeeID": (1, 2, 3, 4, 5),
"Salary": (5000, 6000, 4500, 7000, 5500),
}
employees_df = pd.DataFrame(employees_data)
salaries_df = pd.DataFrame(salaries_data)
# Function doc string to give more context to the model for use of this skill
@skill
def plot_salaries(names: list(str), salaries: list(int)):
"""
Displays the bar chart having name on x-axis and salaries on y-axis
Args:
names (list(str)): Employees' names
salaries (list(int)): Salaries
"""
# plot bars
import matplotlib.pyplot as plt
plt.bar(names, salaries)
plt.xlabel("Employee Name")
plt.ylabel("Salary")
plt.title("Employee Salaries")
plt.xticks(rotation=45)
# Adding count above for each bar
for i, salary in enumerate(salaries):
plt.text(i, salary + 1000, str(salary), ha="center", va="bottom")
plt.show()
agent = Agent((employees_df, salaries_df), config = {'llm': llm})
agent.add_skills(plot_salaries)
response = agent.chat("Plot the employee salaries against names")
The agent would decide whether or not to use the function we assigned to the Pandas AI.
The combination of Skill and Agent gives you a more controllable result for your DataFrame analysis.
We learned how easy it is to use Pandas AI to help us with our data analysis work. By using the power of LLM, we can limit the coding part of data analysis work and focus on critical work instead.
In this article, we learned how to configure Pandas AI, perform data exploration and visualization with Pandas AI, and advance usage. You can do so much more with the package, so visit their Documentation to learn more.
Cornellius Yudha Wijaya is Deputy Director of Data Science and Data Editor. While working full-time at Allianz Indonesia, he loves sharing Python and data tips via social media and editorial. Cornellius writes on a variety of topics related to AI and machine learning.