It’s wise to be brief when asking artificial intelligence tools to mine massive data sets for insights, according to a Cornell researcher. Emmanuel Trummer.
That’s why Trummer, an associate professor of computer science in Cornell’s Ann S. Bowers College of Computing and Information Science, has developed a new computing system, called Schemonic, that reduces the cost of using large language models (LLMs) like ChatGPT and Google Bard by combing through large data sets and generating what amounts to “CliffsNotes” versions of the data that the models can understand. Using Schemonic reduces the cost of using LLMs by up to tenfold, Trummer said.
“The monetary costs associated with using large language models are not insignificant,” said Trummer, the author of “Generating succinct descriptions of database schemas for cost-effective incentive of large language models”, which was presented at the 50th Very Large Database (VLDB) Conference held August 26-30 in Guangzhou, China. “I think this is a problem that everyone who uses these models faces.”
LLMs are the powerful algorithms that underpin generative AI. They have advanced to the point where they can crunch large data sets and show, through the computer code they generate, where to find patterns and insights in the data. Even those without a technical background can leverage these tools, Trummer said.
But getting LLMs to understand and process large data sets is difficult and potentially expensive, because the companies behind these models charge processing fees based on the number of individual “tokens” — words and numbers — within a data set. A large data set can contain billions of tokens or more, and the fees accrue each time users query the LLM, Trummer said.
“If you have hundreds of thousands of users all asking lots of questions about your dataset, you pay the price of repeatedly reading the data description for each request,” said Trummer, whose research explores how to make data analysis more efficient and user-friendly. “The costs can quickly add up.”
The key is to provide the LLM with concise instructions, in as few tokens as possible, about what the dataset contains and how it is organized, he said.
That’s where Schemonic comes in. Its abbreviated descriptions of database structure are enough for LLMs to do their magic at a fraction of the cost, he said.
“Schemonic detects a data structure pattern that can be summarized concisely,” he said. “This approach compresses the structured data in an optimal way to minimize the amount you would have to pay.”
There is often a quality tradeoff when compressing information, but the descriptions generated by Schemonic are guaranteed to be semantically correct, Trummer said. Additionally, state-of-the-art LLMs like OpenAI’s GPT-4 model can understand Schemonic’s abbreviated descriptions without any negative impact on the quality of their output, he said.
“LLMs are used in many data analysis cases, from translating questions about data into formal queries, to extracting tabular data from text, to finding semantic relationships between different data sets,” Trummer said. “All of these cases require you to describe the structure of the data to the LLM, which is why Schemonic helps you save money in all of these use cases.”
Louis DiPietro is a writer at the Cornell Ann S. Bowers College of Computing and Information Science.