If data is the new oil, a London-based startup is vying to become the equivalent of the New York Mercantile Exchange, a marketplace where AI companies seeking data to train their AI models can strike deals with publishers and other companies that have data to leverage. sell.
The startup, called Native human AI, recently hired a number of prominent alumni Google executives experienced in entering into content licensing agreements and partnerships, as well as senior lawyers experienced in intellectual property and copyright matters.
To date, the companies building the large language models (LLMs) that have fueled the generative AI revolution have mostly harvested data, for free, by scraping the public Internet, often without regard for copyright rights. ‘author.
But there are signs that this era is quickly coming to an end. In the United States, a number of lawsuits against AI companies for alleged violation of copyright law in training AI models on material taken from the Internet without permission are being pursued. examination before the courts. While it’s possible that judges will rule that such activity qualifies as “fair use,” companies creating AI models would prefer not to risk being stuck in court for years.
In Europe, the new EU AI law requires companies to disclose whether they trained AI models on copyrighted material, which could also open companies to legal action.
AI companies have already struck deals with major publishers and news organizations to license data both for training and to ensure their models have access to accurate, up-to-date information. OpenAI has signed a three-year licensing agreement with publisher Axel Springer, owner of Business Insider, Policyand a number of German press agencies, would have value “tens of millions of dollars.” He also signed agreements with the Financial Times, The AtlanticAnd Time review. Google has similar agreements with many publishers. Fortune has a licensing agreement with generative AI startup Perplexity.
Startups may struggle to obtain commercial insurance if their data collection practices potentially expose them to legal risk, which further incentivizes many of these companies to license the data they need.
Data recovery is also becoming more difficult from a technical perspective, as many companies have started using technical means to try to prevent bots from recovering their data. Some artists also started apply special digital masks to images they post online that can corrupt AI models trained on that data without permission.
Furthermore, the largest language models (LLMs) – the type of AI that powers OpenAI’s ChatGPT, Google’s Gemini, and Anthropic’s Claude – have already ingested the entirety of the publicly available data on the Internet. Meanwhile, training smaller, more efficient AI models, especially those designed for special purposes, such as helping lawyers draft specific types of contracts, scientists design new drugs, or engineers create plans, requires curated datasets containing high-quality information relevant to this task. Very little specialized data of this type is available on the public Internet and can therefore only be obtained through licensing agreements.
That’s why James Smith, a Google and Google veteran deep mind engineer and product manager, decided to co-found Human Native with Jack Galilee, a software engineer who worked on machine learning systems at a medical technology company Grail. “We wondered why there wasn’t an easy way for companies to acquire the data they needed to train AI models,” said Smith, now CEO of Human Native.
Even when AI companies wanted to obtain data ethically and legally, it was often difficult for them, he says, to know who had what data, and then to know who to go to at that company to make a data agreement. license. The time currently required to negotiate such deals could also be an obstacle for developers of rapidly evolving AI models – with some believing that if they took the time to do the right thing, they risked taking a long time. behind their competitors commercially, he said.
Human Native intends to be a digital marketplace that will allow those who need data for AI systems to easily connect with those who have it and reach an agreement using relatively standardized legal contracts. In June, he raised a funding round of $3.6 million led by London-based venture capital firms LocalGlobe and Mercuri to begin to realize this vision. He also counts among his advisors entrepreneur, AI developer and musician Ed Newton-Rex, who led the audio team at genAI company Stability AI but has since become a prominent critic of AI companies’ disregard for copyright.
The startup is one of a handful of companies offering data brokerage services. And even Human Native is only in the early stages of establishing its marketplace, with a beta version of the platform currently available to select customers. Human Native plans to make money in several ways, including taking a commission on trades it brokers, as well as offering tools to help clients clean data sets and implement policies data governance. The company has not revealed whether it is currently generating revenue from its nascent platform.
Others already offer data for sale to AI companies, including Nomad Data and data analytics platform Snowflake. But Human Native may soon face more competition. For example, Matthew Prince, founder and CEO of the IT company Cloud Flarehas talked about creating a similar market for AI data.
To function, Human Native must build a critical mass of buyers and sellers on its platform and create these standardized contract terms. This is where the startup’s recent hiring of seasoned experts from the world of digital partnerships and intellectual property law comes into play.
The recruits include Madhav Chinnappa, who spent a decade working for the BBC’s rights and development department, then spent 13 years at Google leading the search giant’s partnerships with news agencies, which is now vice president of partnerships at Human Native; Tim Palmer, a veteran of Disney and Google, where he also spent 13 years, working primarily on product partnerships, which now advises on partnerships and business development for Human Native; and Matt Hervey, former partner at international law firm Growling WLG, who co-chaired the American Intellectual Property Law Association’s AI subcommittee and edited a new book on AI legal issues. Hervey is now the legal and policy manager of Human Native.
Palmer and Chinnappa were both laid off from Google in the sweeping round of summer 2024 layoffs, underscoring how the tech giant’s belt-tightening has resulted in the loss of experienced employees who now help form a new generation. of startups.
“Human Native is focused on what is perhaps the most interesting problem in technology right now,” Palmer told me, explaining why he wanted to help the burgeoning data market. He said that while the lawsuits represented an attempt to establish rules for how AI companies can use data, commercial licensing represented a more productive approach.
Palmer said his experience at Google acquiring content means he has “a good idea of what’s out there, who has what content and who the professional licensors are, as well as a good idea of what is acceptable and what is not” regarding licensing conditions.
Chinnappa said he sees Human Native as helping to level the playing field, especially for smaller publishers and rights holders, who he said might otherwise be excluded from any deals with AI companies.
“I helped write the manual for this when I was at Google, and what you do (if you’re Google, OpenAI, Anthropic, Meta or one of the other big AI model companies) is “It’s about closing a minimum number of big deals with big media companies,” he said.
Human Native might be able to help smaller publishers find ways to monetize their data by helping them consolidate data from multiple publishers into packages large enough or tailored enough to be of interest to AI model creators, a- he declared.
Hervey said Human Native could play a major role in helping establish standards and standardized contracts for data licensing for AI. “The broader aspect here is not about the law but about market practices and the incredible opportunity we have to influence market practices,” he said.
Palmer said it will take time for Human Native to be able to create a technology platform that makes purchasing data for AI models truly seamless. “It’s not eBay again,” he said. “It’s not a proposition without human contact.”
For now, as Human Native staff work to source data sets for AI companies, they realize they need a critical mass of buyers and sellers on their platform to operate. And, once it facilitates a match between a data seller and an AI model company, the startup’s staff also has to do a lot of work with both to help them close a deal.
Hervey said some business terms will always be bespoke and Human Native wants to be able to support bespoke licensing deals, while working to standardize licensing terms.