Janco van Niekerk, data scientist at KID.
Artificial intelligence (AI) – and more specifically large language models (LLM) or generative AI (GenAI) – are the next big disruptors in the world of data science.
Although LLMs like ChatGPT have yet to be widely adopted in data science in South Africa, the technology is already making waves, as organisations and their data science teams explore its potential and use cases.
The future will undoubtedly see disruption and the way we work will change dramatically. Data scientists will need to learn to master the tools and ask the right questions about the technology, while staying on top of new approaches and strategies.
While LLM technology is remarkable, it is not yet effective in all areas. It is prone to hallucinations and does not excel at coding complex systems with many integrations. For high-risk decision-making and sensitive customer engagements, it cannot be allowed to run unsupervised.
This could mean that organizations will need to build machine learning models to “supervise” some of the LLM responses. And before they even consider these measures, they will need to ensure that their data will support the LLM and machine learning (ML) tools they deploy.
Although LLM technology is remarkable, it is not yet effective in all areas.
AI-powered analytics tools will most likely become popular in the coming years. Code-assisted AI tools will make it faster to build and execute queries to retrieve relevant information from structured data. This also means that data literacy will be democratized by giving non-experts convenient access to data processed through natural language queries.
Adopting GenAI LLMs could enable a company to easily access, process, and interpret structured and unstructured data. For example, an employee can ask a question to such a system, where an LLM can build and run a database query to retrieve quantitative information and combine it with meeting notes to conclude on the correlation between certain actions and key performance indicators.
In another example, the company could use an LLM to query quantitative sales information in a table and combine it with qualitative information such as notes from all sales meetings, to map how weekly sales are changing as new sales strategies are implemented. It can be used to combine quantitative and qualitative information in a user-friendly way, something that was nearly impossible to do in the past.
This is a new paradigm in analytics that can deliver immense value – although businesses will still need to address many complexities.
AI/ML can be used in many different contexts. One of them is the automation of tasks that are traditionally done by humans and could not be automated using a traditional software approach.
Automation of these human tasks using AI can be done in a variety of business functions, but the technology function is often automated first. While other industries fear that their role will be automated by AI developers, developers themselves have a “we’ll happily automate our own roles” attitude.
Technology professionals are typically the people most familiar with new technologies and can easily identify opportunities to increase their productivity and efficiency.
This is evident when using AI code assistance tools and even ChatGPT where the focus is clearly on providing answers with well-formatted code snippets and correct syntax.
GenAI tools can also be used for data labeling without being explicitly trained to perform this labeling task. This is a task traditionally performed by humans, which requires a lot of effort and time.
Data labeling allows businesses to extract more relevant information from their data, which can then be used to make optimal decisions. Automating this process means that data labeling will become much more cost-effective and efficient.
Leveling the playing field
Machine learning research has long focused on improving algorithms. Some notable researchers, such as Andrew Ngargued for a more “data-centric” approach to machine learning and provided compelling arguments for why better data equals better models.
This means that instead of focusing on improved algorithms, we should focus on improved data (which can then be used to train these different algorithms).
The downside is that larger, more established companies with better data have an asymmetric competitive advantage over smaller companies with poorer data.
However, with the use of GenAI and data labeling, the available data can be enriched, meaning that machine learning algorithms can greatly benefit from this additional information.
This levels the playing field for newer and smaller companies and means that the accuracy of ML models now also depends on the ability to “ask the right questions”, label the data and train the models using this approach. This works by leveraging the extraction of information in a desired format that is “contained within” pre-trained LLMs.
As more text data is generated, businesses will leverage various natural language processing techniques in their operations. As technology improves, natural language processing techniques become easier to implement and more effective for certain tasks.
Technologies like chatbots, sentiment analysis, and vector databases are simplifying the process of leveraging text data in business processes. Chatbots on corporate websites are proliferating, and I expect this trend to continue and improve as the underlying technology develops.