Salesforce Faces Lawsuit from Authors Over AI Training Data

Salesforce has been issued a class-action complaint by two authors, who have accused the company of using thousands of books without permission to train its artificial intelligence software, according to Reuters.

Novelists Molly Tanzer and Jennifer Gilmore state in their official complaint that Salesforce has infringed copyrights with the use of their intellectual property to train its XGen AI models to process language.

Why Is Salesforce Being Sued?

In the official lawsuit, the plaintiffs claim that Salesforce has potentially pirated hundreds of thousands of copyrighted books to develop its XGen series of large language models (LLMs).

Sourcing information from the ‘notorious’ RedPajama and The Pile datasets, the lawsuit claims that the CRM giant unlawfully downloaded, stored, copied, and used these datasets to improve their LLMs.

Moreover, the plaintiffs believe that Salesforce has benefited commercially from their massive copyright infringement through gaining more enterprise customers for use of its LLM – more specifically, Agentforce.

Although the two named authors’ novels have nothing to do with Salesforce specifically, these books offer exactly the kind of rich, varied language and long-form narrative structure that AI models really benefit from.

They would potentially help systems like Salesforce’s XGen learn how to generate more coherent, contextually nuanced, and stylistically human-sounding text – enabling more appropriate responses to human queries or summarizations.

The two named datasets reportedly contain the Book3 corpus, which contains a myriad of copyrighted books that were acquired without the permission of the two authors.

Salesforce rebukes this, however, claiming in the lawsuit that these datasets are “legally compliant”.

Speaking to Reuters, attorney Joseph Saveri, who represents the authors, said: “It’s important that companies that use copyrighted material for […] AI products are transparent. It’s also only fair that our clients are fairly compensated when this happens.”

We have reached out to Salesforce for comment.

The Wider Issues With LLMs and Copyright

While this particular case is still ongoing, it is not the first high-profile accusation of a company using unauthorized data to train LLMs.

Over the last few years, authors, artists, and media outlets have alleged that AI companies used their copyrighted works without consent for model training.

In July 2023, comedian and author Sarah Silverman, along with writers Richard  Kadrey and Christopher Golden, filed copyright infringement lawsuits against OpenAI (and in some filings also Meta platforms), alleging that their books had been used from the aforementioned Books3 corpus to train AI models.
In September 2023, the Authors Guild, together with 17 authors (including George R. R. Martin, John Grisham, Jodi Picoult, and Jonathan Franzen), sued OpenAI for alleged copyright infringement – also claiming their intellectual property was on the Books3 dataset.
In December 2023, The New York Times filed a lawsuit against OpenAI and Microsoft Corporation, accusing them of using millions of the Times’ articles without permission to train AI models.

It’s also worth noting that a separate case involving Anthropic recently resulted in a major settlement ($1.5B) relating to content and privacy concerns.

Although nothing has yet been fully resolved, this new case would add Salesforce to a growing list of large-scale companies facing similar scrutiny, which speaks volumes about the current ethical and legal framework around how LLMs are trained.

It would also be rather ironic, given that Salesforce CEO Marc Benioff stated in an interview with Bloomberg that many AI companies had “stolen” the data used to train their models and that compensating creators for their work would be “very easy to do”.

LLMs require vast datasets to function effectively, but the price of that volume may be the use of unauthorized data, leading to the wave of lawsuits we’ve already seen.

Even more concerning: if unauthorized training becomes fully restricted, companies like Salesforce may face tighter limitations on how far or how deeply they can develop their LLMs.

Final Thoughts

The LLM race has grown increasingly competitive in recent years. Billions of dollars have been invested, and every AI company is striving to build the most advanced model. But in the rush to innovate, ethical boundaries are beginning to blur – with some perhaps even seeing this approach as a necessary risk.

With Salesforce now potentially thrown into this mix, it underscores how even enterprise-focused technology companies are not immune to being part of the broader debate over how AI models are trained. Where does their data come from, and what does “responsible innovation” really mean in practice?

Salesforce Faces Lawsuit from Authors Over AI Training Data

Why Is Salesforce Being Sued?

The Wider Issues With LLMs and Copyright

Final Thoughts

Leave a Reply Cancel reply

Quick Link

Contact Us

Email