The core models that form the foundation of the much-vaunted generative AI tool are data-intensive. If companies want to stand out, they need to feed these models with proprietary information, including customer and company data. But doing so can expose this sensitive information to the outside world—and the bad actors operating within it—and potentially violate the General Data Protection Regulation.
Sharon Richardson, CTO and head of AI at engineering firm Hoare Lea, sums it up: “From day one, these models were very different from a security perspective. It’s hard to build security into the neural network itself because its strength comes from collecting millions of documents. It’s not a problem we’ve solved.”
THE Global Open Application Security Projecta nonprofit foundation that works to improve cybersecurity, cites data leakage as one of the biggest threats to the large language models (LLMs) that underpin most GenAI technologies. The risk came to public attention last year when Samsung employees accidentally leaked sensitive corporate information via ChatGPT.
Review your entered data carefully
Protecting the data used takes on new meaning with the latest GenAI tools, as it’s difficult to control how information is processed. Training data can be exposed when these systems work to organize unstructured material. That’s why some companies are focusing their efforts on securing inputs. Swiss menswear company TBô, for example, carefully labels and anonymizes customer information before feeding it into its model.
“You want to make sure your AI doesn’t know things it’s not supposed to know,” advises Allan Perrottet, the company’s co-founder. “If you don’t prepare your data properly and you feed it directly to OpenAI or Gemini or any of these tools, you’re going to have problems.”
Smart companies are taking a multi-pronged approach to managing risk. One measure is allowing access to specific GenAI tools, whereby only certain individuals are allowed to view the results of classified data. Another control is differential privacy, a statistical technique that allows the sharing of aggregated data while protecting individual privacy. There is also the feeding of pseudonymized, encrypted or synthetic data into models, with tools that can effectively randomize datasets.
Data minimization is key, says Pete Ansell, CTO of IT consultancy Privacy Culture.
“Never feed more data into the big language model than you need,” he advises. “If you don’t have really mature data management processes, you won’t know what you’re sending to the model.”
Generation augmented by recovery
It is also important to understand the attack surface that an LLM can expose, which is why retrieval augmented generation (RAG) is gaining popularity. This is a process in which LLMs reference authoritative data that is located out training sources before generating a response.
RAG users do not share large amounts of raw data with the model itself. Access is via a secure vector database, a specialized storage system for multidimensional data. A RAG system will only retrieve sensitive information when it is relevant to a query; it will not vacuum up countless data points.
“RAG is a very attractive solution from a data security and intellectual property protection perspective, as the company retains the data and the library of information that the LLM references,” Ansell explains. “This is a double benefit, as it ensures that your strategic assets are kept closer to home.”
But he adds that “best practices for personally identifiable information and cybersecurity should also apply to enterprise-level data.”
These techniques not only protect sensitive content from cybercriminals. They also allow companies to transfer knowledge gained from one LLM to another, since in practice it is not possible to trace the source of the data.
There is no doubt that the data security challenges posed by LLM training are related to data maturity and managing information assets with the utmost integrity. In many ways, the issues surrounding GenAI are similar to GDPR compliance challenges on steroids.
“If GDPR is the big stick, the race to use AI is a big carrot,” Ansell says.
Other steps a company can take to improve the security of its AI-related data include creating a multidisciplinary steering group, conducting impact assessments, providing AI awareness training, and keeping humans in the loop on all aspects of model development.
Is open source the answer?
One of the biggest challenges facing the industry is that sensitive corporate data still has to leave localized servers and be processed in the cloud in data centers owned by one of the tech giants, which control most of the most popular AI tools.
“For a brief moment, the data could be stored on a server that you don’t control, which is a potential security breach. There’s always a weakness there,” Richardson says. “The reality is that we’re still in the Wild West phase of GenAI. There will be unintended consequences. You may think you have it under control, but you probably don’t.”
This is why open source models are increasingly popular. They allow IT teams to audit LLMs externally, spot security vulnerabilities, and have them fixed by a community of developers.
Yash Raj Shrestha, assistant professor in the Department of Information Systems at the University of Lausanne, believes that open-source AI is “safer and more reliable than closed-source AI. Because when things are open, a lot of people can work together to find bugs, which can then be fixed. That’s the future.”