The Growing AI Battle Over Online Data Access
In an increasingly digital age, a new battle is brewing between artificial intelligence (AI) companies and newspaper publishers. The bone of contention is using digital news stories, a vital resource, to power AI models like OpenAI's ChatGPT.
When AI meets Journalism: A tussle for rights and revenues.
The AI-Publisher Standoff
For years, tech companies like OpenAI have freely used news stories to build data sets that teach their machines how to recognize and respond fluently to human queries about the world. However, newspaper publishers and other data owners now demand a share of the growing frenzy to develop cutting-edge AI models. The projected market for generative AI will reach $1.3 trillion by 2032.
Since August, at least 535 news organizations, including the New York Times, Reuters, and The Washington Post, have installed a blocker that prevents their content from being collected and used to train ChatGPT. The current discussions with OpenAI are primarily focused on initiating payments to publishers so that ChatGPT can surface links to individual news stories in its responses. This move would provide direct payment to newspapers and increase website traffic.
Other Data Sources Seek Compensation
It's not just newspapers that are seeking compensation. Reddit's popular social message board has discussed paying for its data with top generative AI companies. If an agreement isn't reached, Reddit contemplates making its content accessible only after logging in, which would be a first for the platform.
Moreover, in April, Elon Musk began charging $42,000 for bulk access to posts on Twitter, following his claim that AI companies had illegally used the data to train their models. This move reflects a growing urgency and uncertainty about who profits from online information, especially with generative AI transforming how users interact with the Internet.
The Impact on Content Providers
Generative AI's rapid growth is already impacting content providers. For instance, a month after OpenAI launched GPT-4, Stack Overflow, a coding community, witnessed a 15% decline in traffic as programmers turned to AI for answers to their coding questions. This week, the company laid off 28% of its staff.
Leading AI firms also face copyright lawsuits from individual book authors, artists, and software coders seeking damages for infringement and a share of profits. The decision of OpenAI and other tech companies to negotiate may reflect a proactive approach to striking deals before courts can weigh in on whether they have a clear legal obligation to license and pay for content.
The Cost of Building AI
Building generative AI is expensive, with every component, from hardware to computing power, needing to be more affordable and more accessible to acquire. So far, the only free and easy part has been the data. This is changing as tech companies now have to pay for the data they use, a reality they were previously reluctant to accept.
At a listening session on generative AI hosted in April by the U.S. Copyright Office, a lawyer representing the Silicon Valley venture capital firm Andreessen Horowitz acknowledged that these tools could only exist if they could be trained on massive amounts of data without licensing that data.
The Future of Data Access
The landscape of data access for AI companies is changing. More and more sites are developing or have launched paid portals for AI companies seeking training data. The media conglomerate IAC, which owns The Daily Beast, is trying to build a coalition of publishers to win billions of dollars from AI companies through a lawsuit or legislative action.
However, in this evolving climate, data holders best positioned to make deals are still companies accustomed to asserting their intellectual property rights. It's yet to be seen how this battle for online data access between AI companies and publishers will pan out and the implications for the AI industry and content providers.