New York Times sues Microsoft, ChatGPT maker OpenAI over copyright infringement

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,228
Reputation
8,195
Daps
156,160

New York Times sues Microsoft, ChatGPT maker OpenAI over copyright infringement​

PUBLISHED WED, DEC 27 20238:40 AM ESTUPDATED AN HOUR AGO


Ryan Browne @RYAN_BROWNE_

KEY POINTS


  • The New York Times on Wednesday filed a lawsuit against Microsoft and OpenAI, the company behind ChatGPT, accusing them of copyright infringement and abusing the newspaper’s intellectual property.
  • In a court filing, the publisher said it seeks to hold Microsoft and OpenAI to account for “billions of dollars in statutory and actual damages” it believes it is owed for “unlawful copying and use of The Times’s uniquely valuable works.”
  • The Times accused Microsoft and OpenAI of creating a business model based on “mass copyright infringement,” stating their AI systems “exploit and, in many cases, retain large portions of the copyrightable expression contained in those works.”

In this article
Follow your favorite stocks CREATE FREE ACCOUNT

New York Times sues OpenAI, Microsoft for copyright infringement


WATCH NOW

VIDEO01:04

New York Times sues OpenAI, Microsoft for copyright infringement

The New York Times on Wednesday filed a lawsuit against Microsoft and OpenAI, creator of the popular AI chatbot ChatGPT, accusing the companies of copyright infringement and abusing the newspaper’s intellectual property to train large language models.

Microsoft both invests in and supplies OpenAI, providing it with access to the company’s Azure cloud computing technology.

The publisher said in a filing in the U.S. District Court for the Southern District of New York that it seeks to hold Microsoft and OpenAI to account for the “billions of dollars in statutory and actual damages” it believes it is owed for the “unlawful copying and use of The Times’s uniquely valuable works.”

The Times said in an emailed statement that it “recognizes the power and potential of GenAI for the public and for journalism,” but added that journalistic material should be used for commercial gain with permission from the original source.

“These tools were built with and continue to use independent journalism and content that is only available because we and our peers reported, edited, and fact-checked it at high cost and with considerable expertise,” the Times said.


The New York Times Building in New York City on February 1, 2022.

The New York Times Building in New York City on February 1, 2022.

Angela Weiss | AFP | Getty Images

“Settled copyright law protects our journalism and content,” the Times added. “If Microsoft and OpenAI want to use our work for commercial purposes, the law requires that they first obtain our permission. They have not done so.”

CNBC has reached out to Microsoft and OpenAI for comment.

The Times is represented in the proceedings by Susman Godfrey, the litigation firm that represented Dominion Voting Systems in its defamation suit against Fox News that culminated in a $787.5 million million settlement.

Susman Godfrey is also representing author Julian Sancton and other writers in a separate lawsuit against OpenAI and Microsoft that accuses the companies of using copyrighted materials without permission to train several versions of ChatGPT.



‘Mass copyright infringement’​

The Times is one of numerous media organizations pursuing compensation from companies behind some of the most advanced artificial intelligence models, for the alleged usage of their content to train AI programs.

OpenAI is the creator of GPT, a large language model that can produce humanlike content in response to user prompts. It uses billions of parameters’ worth of information, which is obtained from public web data up until 2021.

Media publishers and content creators are finding their materials being used and reimagined by generative AI tools like ChatGPT, Dall-E, Midjourney and Stable Diffusion. In numerous cases, the content the programs produce can look similar to the source material.

OpenAI has tried to allay news publishers’ concerns. In December, the company announced a partnership with Axel Springer — the parent company of Business Insider, Politico, and European outlets Bild and Welt — which would license its content to OpenAI in return for a fee.

The financial terms of the deal weren’t disclosed.

In its lawsuit Wednesday, the Times accused Microsoft and OpenAI of creating a business model based on “mass copyright infringement,” stating that the companies’ AI systems were “used to create multiple reproductions of The Times’s intellectual property for the purpose of creating the GPT models that exploit and, in many cases, retain large portions of the copyrightable expression contained in those works.”

Publishers are concerned that, with the advent of generative AI chatbots, fewer people will click through to news sites, resulting in shrinking traffic and revenues.

The Times included numerous examples in the suit of instances where GPT-4 produced altered versions of material published by the newspaper.

In one example, the filing shows OpenAI’s software producing almost identical text to a Times article about predatory lending practices in New York City’s taxi industry.

But in OpenAI’s version, GPT-4 excludes a critical piece of context about the sum of money the city made selling taxi medallions and collecting taxes on private sales.

In its suit, the Times said Microsoft and OpenAI’s GPT models “directly compete with Times content.”

The AI models also limited the Times’ commercial opportunities by altering its content. For example, the publisher alleges GPT outputs remove links to products featured in its Wirecutter app, a product reviews platform, “thereby depriving The Times of the opportunity to receive referral revenue and appropriating that opportunity for Defendants.”

The Times also alleged Microsoft and OpenAI models produce content similar to that generated by the newspaper, and that their use of its content to train LLMs without consent “constitutes free-riding on The Times’s significant efforts and investment of human capital to gather this information.”

The Times said Microsoft and OpenAI’s LLMs “can generate output that recites Times content verbatim, closely summarizes it, and mimics its expressive style,” and “wrongly attribute false information to The Times,” and “deprive The Times of subscription, licensing, advertising, and affiliate revenue.”

CNBC’s Rohan Goswami contributed to this report.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,228
Reputation
8,195
Daps
156,160

NY Times copyright suit wants OpenAI to delete all GPT instances​

Shows evidence that GPT-based systems will reproduce Times articles if asked.​

JOHN TIMMER - 12/27/2023, 2:05 PM

Image of a CPU on a motherboard with

Enlarge / Microsoft is named in the suit for allegedly building the system that allowed GPT derivatives to be trained using infringing material.
Just_Super

366

In August, word leaked out that The New York Times was considering joining the growing legion of creators that are suing AI companies for misappropriating their content. The Times had reportedly been negotiating with OpenAI regarding the potential to license its material, but those talks had not gone smoothly. So, eight months after the company was reportedly considering suing, the suit has now been filed.

The Times is targeting various companies under the OpenAI umbrella, as well as Microsoft, an OpenAI partner that both uses it to power its Copilot service and helped provide the infrastructure for training the GPT Large Language Model. But the suit goes well beyond the use of copyrighted material in training, alleging that OpenAI-powered software will happily circumvent the Times' paywall and ascribe hallucinated misinformation to the Times.

Journalism is expensive​

The suit notes that The Times maintains a large staff that allows it to do things like dedicate reporters to a huge range of beats and engage in important investigative journalism, among other things. Because of those investments, the newspaper is often considered an authoritative source on many matters.

All of that costs money, and The Times earns that by limiting access to its reporting through a robust paywall. In addition, each print edition has a copyright notification, the Times' terms of service limit the copying and use of any published material, and it can be selective about how it licenses its stories. In addition to driving revenue, these restrictions also help it to maintain its reputation as an authoritative voice by controlling how its works appear.

The suit alleges that OpenAI-developed tools undermine all of that. "By providing Times content without The Times’s permission or authorization, Defendants’ tools undermine and damage The Times’s relationship with its readers and deprive The Times of subscription, licensing, advertising, and affiliate revenue," the suit alleges.

Part of the unauthorized use The Times alleges came during the training of various versions of GPT. Prior to GPT-3.5, information about the training dataset was made public. One of the sources used is a large collection of online material called "Common Crawl," which the suit alleges contains information from 16 million unique records from sites published by The Times. That places the Times as the third most referenced source, behind Wikipedia and a database of US patents.

OpenAI no longer discloses as many details of the data used for training of recent GPT versions, but all indications are that full-text NY Times articles are still part of that process (Much more on that in a moment.) Expect access to training information to be a major issue during discovery if this case moves forward.

Not just training​

A number of suits have been filed regarding the use of copyrighted material during training of AI systems. But the Times' suit goes well beyond that to show how the material ingested during training can come back out during use. "Defendants’ GenAI tools can generate output that recites Times content verbatim, closely summarizes it, and mimics its expressive style, as demonstrated by scores of examples," the suit alleges.

The suit alleges—and we were able to verify—that it's comically easy to get GPT-powered systems to offer up content that is normally protected by the Times' paywall. The suit shows a number of examples of GPT-4 reproducing large sections of articles nearly verbatim.

The suit includes screenshots of ChatGPT being given the title of a piece at The New York Times and asked for the first paragraph, which it delivers. Getting the ensuing text is apparently as simple as repeatedly asking for the next paragraph.

ChatGPT has apparently closed that loophole in between the preparation of that suit and the present. We entered some of the prompts shown in the suit, and were advised "I recommend checking The New York Times website or other reputable sources," although we can't rule out that context provided prior to that prompt could produce copyrighted material.

Ask for a paragraph, and Copilot will hand you a wall of normally paywalled text.

Ask for a paragraph, and Copilot will hand you a wall of normally paywalled text.

John Timmer

But not all loopholes have been closed. The suit also shows output from Bing Chat, since rebranded as Copilot. We were able to verify that asking for the first paragraph of a specific article at The Times caused Copilot to reproduce the first third of the article.

The suit is dismissive of attempts to justify this as a form of fair use. "Publicly, Defendants insist that their conduct is protected as 'fair use' because their unlicensed use of copyrighted content to train GenAI models serves a new 'transformative' purpose," the suit notes. "But there is nothing 'transformative' about using The Times’s content without payment to create products that substitute for The Times and steal audiences away from it."

Reputational and other damages​

The hallucinations common to AI also came under fire in the suit for potentially damaging the value of the Times' reputation, and possibly damaging human health as a side effect. "A GPT model completely fabricated that “The New York Times published an article on January 10, 2020, titled ‘Study Finds Possible Link between Orange Juice and Non-Hodgkin’s Lymphoma,’” the suit alleges. "The Times never published such an article."

Similarly, asking about a Times article on heart-healthy foods allegedly resulted in Copilot saying it contained a list of examples (which it didn't). When asked for the list, 80 percent of the foods on weren't even mentioned by the original article. In another case, recommendations were ascribed to the Wirecutter when the products hadn't even been reviewed by its staff.

As with the Times material, it's alleged that it's possible to get Copilot to offer up large chunks of Wirecutter articles (The Wirecutter is owned by The New York Times). But the suit notes that these article excerpts have the affiliate links stripped out of them, keeping the Wirecutter from its primary source of revenue.

The suit targets various OpenAI companies for developing the software, as well as Microsoft—the latter for both offering OpenAI-powered services, and for having developed the computing systems that enabled the copyrighted material to be ingested during training. Allegations include direct, contributory, and vicarious copyright infringement, as well as DMCA and trademark violations. Finally, it alleges "Common Law Unfair Competition By Misappropriation."

The suit seeks nothing less than the erasure of both any GPT instances that the parties have trained using material from the Times, as well as the destruction of the datasets that were used for the training. It also asks for a permanent injunction to prevent similar conduct in the future. The Times also wants money, lots and lots of money: "statutory damages, compensatory damages, restitution, disgorgement, and any other relief that may be permitted by law or equity."
 

Buddy

Keep my name out of it
Joined
Apr 28, 2014
Messages
18,510
Reputation
5,608
Daps
77,130
I'd see what NYT is saying about this but they're just gonna hit me with a pay wall :martin:

Finna prompt ChatGPT to make an article for me :smugdraper:
 

↓R↑LYB

I trained Sheng Long and Shonuff
Joined
May 2, 2012
Messages
44,204
Reputation
13,743
Daps
171,150
Reppin
Pawgistan
I figured this was coming. Regardless of the outcome, the lawyers are bout to get PAID :whoo:
 

AVXL

Laughing at you n*ggaz like “ha ha ha”
Joined
May 1, 2012
Messages
40,755
Reputation
645
Daps
76,192
Reppin
Of course the ATL
I mentioned this in the AI thread but the NYT is the single biggest proprietary data source in GPT’s common crawl, when you look at the examples cited in the lawsuit, the GPT response is almost identical to NYT’s content.

What you’ll see is a quick settlement and a long standing royalty to NYT. At the heart of this is a copyright case and what this means for the future of LLMs is access & price to private data is going to be even more of a premium. Long term this will favor the big incumbents like OpenAI, Meta, Google, Anthropic who have the money to pay for access to pay wall, subscriber only type content
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,228
Reputation
8,195
Daps
156,160

The NY Times Lawsuit Against OpenAI Would Open Up The NY Times To All Sorts Of Lawsuits Should It Win​

Copyright

from the it's-okay-when-we-do-it,-we're-the-new-york-times dept​

Thu, Dec 28th 2023 09:24am - Mike Masnick

This week the NY Times somehow broke the story of… well, the NY Times suing OpenAI and Microsoft. I wonder who tipped them off. Anyhoo, the lawsuit in many ways is similar to some of the over a dozen lawsuits filed by copyright holders against AI companies. We’ve written about how silly many of these lawsuits are, in that they appear to be written by people who don’t much understand copyright law. And, as we noted, even if courts actually decide in favor of the copyright holders, it’s not like it will turn into any major windfall. All it will do is create another corruptible collection point, while locking in only a few large AI companies who can afford to pay up.

I’ve seen some people arguing that the NY Times lawsuit is somehow “stronger” and more effective than the others, but I honestly don’t see that. Indeed, the NY Times itself seems to think its case is so similar to the ridiculously bad Authors Guild case, that it’s looking to combine the cases.

But while there are some unique aspects to the NY Times case, I’m not sure they are nearly as compelling as the NY Times and its supporters think they are. Indeed, I think if the Times actually wins its case, it would open the Times itself up to some fairly damning lawsuits itself, given its somewhat infamous journalistic practices regarding summarizing other people’s articles without credit. But, we’ll get there.

The Times, in typical NY Times fashion, presents this case as thought the NY Times is the great defender of press freedom, taking this stand to stop the evil interlopers of AI.


Independent journalism is vital to our democracy. It is also increasingly rare and valuable. For more than 170 years, The Times has given the world deeply reported, expert, independent journalism. Times journalists go where the story is, often at great risk and cost, to inform the public about important and pressing issues. They bear witness to conflict and disasters, provide accountability for the use of power, and illuminate truths that would otherwise go unseen. Their essential work is made possible through the efforts of a large and expensive organization that provides legal, security, and operational support, as well as editors who ensure their journalism meets the highest standards of accuracy and fairness. This work has always been important. But within a damaged information ecosystem that is awash in unreliable content, The Times’s journalism provides a service that has grown even more valuable to the public by supplying trustworthy information, news analysis, and commentary

Defendants’ unlawful use of The Times’s work to create artificial intelligence products that compete with it threatens The Times’s ability to provide that service. Defendants’ generative artificial intelligence (“GenAI”) tools rely on large-language models (“LLMs”) that were built by copying and using millions of The Times’s copyrighted news articles, in-depth investigations, opinion pieces, reviews, how-to guides, and more. While Defendants engaged in widescale copying from many sources, they gave Times content particular emphasis when building their LLMs—revealing a preference that recognizes the value of those works. Through Microsoft’s Bing Chat (recently rebranded as “Copilot”) and OpenAI’s ChatGPT, Defendants seek to free-ride on The Times’s massive investment in its journalism by using it to build substitutive products without permission or payment.

As the lawsuit makes clear, this isn’t some high and mighty fight for journalism. It’s a negotiating ploy. The Times admits that it has been trying to get OpenAI to cough up some cash for its training:

For months, The Times has attempted to reach a negotiated agreement with Defendants, in accordance with its history of working productively with large technology platforms to permit the use of its content in new digital products (including the news products developed by Google, Meta, and Apple). The Times’s goal during these negotiations was to ensure it received fair value for the use of its content, facilitate the continuation of a healthy news ecosystem, and help develop GenAI technology in a responsible way that benefits society and supports a well-informed public.

I’m guessing that OpenAI’s decision a few weeks back to pay off media giant Axel Springer to avoid one of these lawsuits, and the failure to negotiate a similar deal (at what is likely a much higher price), resulted in the Times moving forward with the lawsuit.

There are five or six whole pages of puffery about how amazing the NY Times thinks the NY Times is, followed by the laughably stupid claim that generative AI “threatens” the kind of journalism the NY Times produces.

Let me let you in on a little secret: if you think that generative AI can do serious journalism better than a massive organization with a huge number of reporters, then, um, you deserve to go out of business. For all the puffery about the amazing work of the NY Times, this seems to suggest that it can easily be replaced by an auto-complete machine.

In the end, though, the crux of this lawsuit is the same as all the others. It’s a false belief that reading something (whether by human or machine) somehow implicates copyright. This is false. If the courts (or the legislature) decide otherwise, it would upset pretty much all of the history of copyright and create some significant real world problems.

Part of the Times complaint is that OpenAI’s GPT LLM was trained in part with Common Crawl data. Common Crawl is an incredibly useful and important resource that apparently is now coming under attack. It has been building an open repository of the web for people to use, not unlike the Internet Archive, but with a focus on making it accessible to researchers and innovators. Common Crawl is a fantastic resource run by some great people (though the lawsuit here attacks them).

But, again, this is the nature of the internet. It’s why things like Google’s cache and the Internet Archive’s Wayback Machine are so important. These are archives of history that are incredibly important, and have historically been protected by fair use, which the Times is now threatening.

(Notably, just recently, the NY Times was able to get all of its articles excluded from Common Crawl. Otherwise I imagine that they would be a defendant in this case as well).

Either way, so much of the lawsuit is claiming that GPT learning from this data is infringement. And, as we’ve noted repeatedly, reading/processing data is not a right limited by copyright. We’ve already seen this in multiple lawsuits, but this rush of plaintiffs is hoping that maybe judges will be wowed by this newfangled “generative AI” technology into ignoring the basics of copyright law and pretending that there are now rights that simply do not exist.

Now, the one element that appears different in the Times’ lawsuit is that it has a bunch of exhibits that purport to prove how GPT regurgitates Times articles. Exhibit J is getting plenty of attention here, as the NY Times demonstrates how it was able to prompt ChatGPT in such a manner that it basically provided them with direct copies of NY Times articles.

In the complaint, they show this:

Image

At first glance that might look damning. But it’s a lot less damning when you look at the actual prompt in Exhibit J and realize what happened, and how generative AI actually works.

What the Times did is prompt GPT-4 by (1) giving it the URL of the story and then (2) “prompting” it by giving it the headline of the article and the first seven and a half paragraphs of the article, and asking it to continue.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,228
Reputation
8,195
Daps
156,160
Here’s how the Times describes this:[/SIZE]

Each example focuses on a single news article. Examples were produced by breaking the article into two parts. The frst part o f the article is given to GPT-4, and GPT-4 replies by writing its own version of the remainder of the article.

Here’s how it appears in Exhibit J (notably, the prompt was left out of the complaint itself):

Image

If you actually understand how these systems work, the output looking very similar to the original NY Times piece is not so surprising. When you prompt a generative AI system like GPT, you’re giving it a bunch of parameters, which act as conditions and limits on its output. From those constraints, it’s trying to generate the most likely next part of the response. But, by providing it paragraphs upon paragraphs of these articles, the NY Times has effectively constrained GPT to the point that the most probabilistic responses is… very close to the NY Times’ original story.

In other words, by constraining GPT to effectively “recreate this article,” GPT has a very small data set to work off of, meaning that the highest likelihood outcome is going to sound remarkably like the original. If you were to create a much shorter prompt, or introduce further randomness into the process, you’d get a much more random output. But these kinds of prompts effectively tell GPT not to do anything BUT write the same article.

From there, though, the lawsuit gets dumber.

It shows that you can sorta get around the NY Times’ paywall in the most inefficient and unreliable way possible by asking ChatGPT to quote the first few paragraphs in one paragraph chunks.

Image

Of course, quoting individual paragraphs from a news article is almost certainly fair use. And, for what it’s worth, the Times itself admits that this process doesn’t actually return the full article, but a paraphrase of it.

And the lawsuit seems to suggest that merely summarizing articles is itself infringing:

Image

That’s… all factual information summarizing the review? And while the complaint shows that if you then ask for (again, paragraph length) quotes, GPT will give you a few quotes from the article.

And, yes, the complaint literally argues that a generative AI tool can violate copyright when it “summarizes” an article.

The issue here is not so much how GPT is trained, but how the NY Times is constraining the output. That is unrelated to the question of whether or not the reading of these article is fair use or not. The purpose of these LLMs is not to repeat the content that is scanned, but to figure out the probabilistic most likely next token for a given prompt. When the Times constrains the prompts in such a way that the data set is basically one article and one article only… well… that’s what you get.

Elsewhere, the Times again complains about GPT returning factual information that is not subject to copyright law.

Image

But, I mean, if you were to ask anyone the same question, “What does wirecutter recommend for The Best Kitchen Scale,” they’re likely to return you a similar result, and that’s not infringing. It’s a fact that that scale is the one that it recommends. The Times complains that people who do this prompt will avoid clicking on Wirecutter affiliate links, but… um… it has no right to that affiliate income.

I mean, I’ll admit right here that I often research products and look at Wirecutter (and other!) reviews before eventually shopping independently of that research. In other words, I will frequently buy products after reading the recommendations on Wirecutter, but without clicking on an affiliate link. Is the NY Times really trying to suggest that this violates its copyright? Because that’s crazy.

Meanwhile, it’s not clear if the NY Times is mad that it’s accurately recommending stuff or if it’s just… mad. Because later in the complaint, the NY Times says its bad that sometimes GPT recommends the wrong product or makes up a paragraph.

So… the complaint is both that GPT reproduces things too accurately, AND not accurately enough. Which is it?

Anyway, the larger point is that if the NY Times wins, well… the NY Times might find itself on the receiving end of some lawsuits. The NY Times is somewhat infamous in the news world for using other journalists’ work as a starting point and building off of it (frequently without any credit at all). Sometimes this results in an eventual correction, but often it does not.

If the NY Times successfully argues that reading a third party article to help its reporters “learn” about the news before reporting their own version of it is copyright infringement, it might not like how that is turned around by tons of other news organizations against the NY Times. Because I don’t see how there’s any legitimate distinction between OpenAI scanning NY Times articles and NY Times reporters scanning other articles/books/research without first licensing those works as well.

Or, say, what happens if a source for a NY TImes reporter provides them with some copyright-covered work (an article, a book, a photograph, who knows what) that the NY Times does not have a license for? Can the NY Times journalist then produce an article based on that material (along with other research, though much less than OpenAI used in training GPT)?

It seems like (and this happens all too often in the news industry) the NY Times is arguing that it’s okay for its journalists to do this kind of thing because it’s in the business of producing Important Journalism™ whereas anyone else doing the same thing is some damn interloper.

We see this with other copyright disputes and the media industry, or with the ridiculous fight over the hot news doctrine, in which news orgs claimed that they should be the only ones allowed to report on something for a while.

Similarly, I’ll note that even if the NY Times gets some money out of this, don’t expect the actual reporters to see any of it. Remember, this is the same NY Times that once tried to stiff freelance reporters by relicensing their articles to electronic databases without paying them. The Supreme Court didn’t like that. If the NY Times establishes that merely training AI on old articles is a licenseable, copyright-impacting event, will it go back and pay those reporters a piece of whatever change they get? Or nah?


Filed Under: ai, ai training, copyright, fair use, generative ai, reading, restrictive prompts, summarizing, training

Companies: common crawl, microsoft, ny times, openai
 
Top