Publishers Unblock OpenAI’s Crawler As TikTok’s Parent ByteDance Boasts A 25x Faster Web Scraper

October 09, 2024

OpenAI has been in the spotlight recently, not just for its impressive $6.6 billion funding round, but also for significant changes to its business model. The investment round, backed by major names like Microsoft, NVIDIA, and Thrive Capital, has brought OpenAI's valuation to a staggering $157 billion. Despite these achievements, OpenAI is facing substantial financial losses this year, projected to be over $5 billion. Nevertheless, the company remains confident in its growth, forecasting a remarkable $11 billion in revenue for next year. This growth is expected to come from its AI products, which rely heavily on data gathered from the web.

Central to this data-gathering process is OpenAI’s web crawler, GPTBot, which scrapes online content to train its generative AI models like ChatGPT. While many websites initially resisted this practice by blocking the crawler, recent reports suggest a shift in sentiment. Initially, over 33% of websites had barred GPTBot, but that figure has now dropped to about 25%. Moreover, OpenAI has been able to bring down the block rate among major news outlets from 90% to 50%.

This change can be partially attributed to new partnerships OpenAI has struck with key publishers, including TIME, NewsCorp, Reddit, and Condé Nast. These collaborations suggest a growing willingness among some content providers to work with OpenAI, seeing the potential value in allowing AI to access their content. However, not all publishers have unblocked GPTBot as part of a partnership. For instance, The Onion, a satirical news outlet, unintentionally unblocked the crawler due to a technical glitch. Its CEO, Ben Collins, emphasized that the company had no intention of doing business with OpenAI, referring to its AI tools as a “Plagiarism Machine.”

The dilemma of whether or not to allow web crawlers like GPTBot access to content is a complex one. For AI companies like OpenAI, the data gathered through these crawlers is critical for improving products like ChatGPT. On the other hand, publishers are rightfully concerned about the potential risks to data privacy, content ownership, and copyright infringement. OpenAI’s practices have sparked a larger debate, especially in light of the fact that compliance with robots.txt files—used to block web crawlers—is not legally required.

ByteDance, the parent company of TikTok, has further escalated concerns with its new web scraper, Bytespider. It’s reportedly much faster than GPTBot and ignores robots.txt files entirely. The fact that it can bypass publisher restrictions has caused a stir, particularly given the current tensions between the U.S. and China regarding data security and technology sharing. The growing unease surrounding Chinese companies like ByteDance accessing U.S. data through web scraping tools has raised new questions about how far these technologies should be allowed to go.

In light of these developments, news outlets and publishers face a tough choice. Should they allow AI web crawlers to access their content, potentially benefiting from AI partnerships? Or should they hold firm, insisting on better safeguards and clearer legal frameworks before opening the door to AI companies? The future of AI and content publishing may hinge on the answers to these questions.

Search This Blog

techdogsoctober

Publishers Unblock OpenAI’s Crawler As TikTok’s Parent ByteDance Boasts A 25x Faster Web Scraper

Comments

Post a Comment

Popular posts from this blog

Into The World Of Questionable AI Practices

Marvel Fusion And CSU Break Ground On $150m Laser Facility

Hevo Data Now Available On Google Cloud Marketplace