Joe makes a call from a payphone. It costs him 60 cents for each minute of call. After 10 minutes, the price drops to 50 cents per minute. How much would a 30 minute call cost him?
Questions like these are part of a series of arithmetic tests for U.S. elementary schools, typically targeting children ages 10 to 11. Mathematical reasoning is the key to problem solving. It can therefore be used to measure the capabilities of a artificial intelligence (AI).
The 8k Mathematics Suite (GSM8K) for primary schools has become a popular benchmark for various AI extended language models (LLMs), such as ChatGPT. The suite contains 8,500 problems like the one above, divided into problems to form an LLM and then into real problems to solve. ChatGPT’s latest OpenAI LLM, the GPT-4o model, scored 92.5% on the GSM8K suite, while Google’s Gemini 1.5 Pro LLM scored 91.7%. A smaller LLM with fewer tuning parameters, Microsoft’s Phi-3-small, nevertheless achieved an impressive 88.5 percent.
However, a recent article by six researchers from Apple discovered significant weaknesses in the reasoning ability of 22 different cutting-edge LLMs, including those mentioned above. A simple name change – for example from “Joe” to “Dave” in the problem above – and leaving the rest of the test question completely unchanged can lead to a different answer than an LLM. This is clearly surprising and would not be expected from a student with a real mathematical understanding.
The fragility of the different LLMs examined by the researchers was more significant when the numbers of the test problems were changed, rather than just the names.
For example, changing the base rate of the telephone call in the test above from 60 cents per minute to 70 cents per minute, and similar numerical changes in the rest of the test problems, led to greater variety of precision in responses. The researchers concluded that LLMs do not perform formal reasoning and hypothesized that they were doing their best to match patterns within the set of proposed training problems.
Even more intriguing, removing or adding additional clauses had a significant impact on LLM performance. For example, removing the clause specifying a reduction in the price of a call after 10 minutes in the test problem above, or adding a new clause granting a 5% reduction for calls costing more than $10, often caused variation in the accuracy of the results.
The researchers noted that as the difficulty of test problems increased by adding more clauses, the performance of LLMs deteriorated rapidly with increasing problem complexity. They posited that pattern finding and matching becomes much more difficult for LLMs as problem difficulties increase, reinforcing their suggestion that authentic mathematical reasoning does not actually take place.
In addition to changing the specified values and complexity of the problems, the researchers then tried to add seemingly relevant, but in practice completely inconsequential, clauses. For example, the phone call problem above might add an unimportant clause stating that phone call prices were actually 10% cheaper last year, but the problem nonetheless lies in the current cost of the phone call of Joe. However, it is often the case that LLMs nevertheless apply the discount rate. In these scenarios, the researchers observed catastrophic performance declines in all LLMs tested, which they attributed potentially to an overreliance of LLMs on a particular set of training problems.
The researchers concluded: “Ultimately, our work highlights important limitations in the ability of LLMs to perform true mathematical reasoning. The high variance in LLM performance on different versions of the same question, their substantial decline in performance with a slight increase in difficulty, and their sensitivity to inconsequential information indicate that their reasoning is fragile. This may seem more like sophisticated pattern matching than true logical reasoning.
The text responses of ChatGPT and other LLMs captured the attention of audiences and investors when they gave the impression that they genuinely understood the world. In practice, it appears that they have grown to such a size that they absorb more information from their training data than individual humans could typically know or remember, and combine this data in various combinations. With enough input and training data, requiring considerable investment and energy, an LLM can give an illusion of intelligence but is in fact inherently limited in high-level reasoning and lacks an intelligent conceptual model.
One of the most influential computer giants today is Linus Torvalds, the creator of the widely used Linux operating system. He recently said that even if he found AI really interesting, he was still going to ignore it for the moment. He observed that the entire tech industry around AI is 90% marketing and 10% reality, and that “in five years, things will change and at that point we’ll see what AI is used for real-world workloads every day.”
I agree with him. The current generation of LLMs are useful in text analysis and search, and can also produce stunning images and videos, but their true commercial impact has not yet been proven.