Salesforce has been issued a class-action complaint by two authors, who have accused the company of using thousands of books without permission to train its artificial intelligence software, according to Reuters.Ā
Novelists Molly Tanzer and Jennifer Gilmore state in their official complaint that Salesforce has infringed copyrights with the use of their intellectual property to train its XGen AI models to process language.
Why Is Salesforce Being Sued?
In the official lawsuit, the plaintiffs claim that Salesforce has potentially pirated hundreds of thousands of copyrighted books to develop its XGen series of large language models (LLMs).
Sourcing information from the ānotoriousā RedPajama and The Pile datasets, the lawsuit claims that the CRM giant unlawfully downloaded, stored, copied, and used these datasets to improve their LLMs.Ā
Moreover, the plaintiffs believe that Salesforce has benefited commercially from their massive copyright infringement through gaining more enterprise customers for use of its LLM ā more specifically, Agentforce.
Although the two named authorsā novels have nothing to do with Salesforce specifically, these books offer exactly the kind of rich, varied language and long-form narrative structure that AI models really benefit from.Ā
They would potentially help systems like Salesforceās XGen learn how to generate more coherent, contextually nuanced, and stylistically human-sounding text ā enabling more appropriate responses to human queries or summarizations.Ā
The two named datasets reportedly contain the Book3 corpus, which contains a myriad of copyrighted books that were acquired without the permission of the two authors.Ā
Salesforce rebukes this, however, claiming in the lawsuit that these datasets are ālegally compliantā.Ā
Speaking to Reuters, attorney Joseph Saveri, who represents the authors, said: āItās important that companies that use copyrighted material for [ā¦] AI products are transparent. Itās also only fair that our clients are fairly compensated when this happens.ā
We have reached out to Salesforce for comment.
The Wider Issues With LLMs and Copyright
While this particular case is still ongoing, it is not the first high-profile accusation of a company using unauthorized data to train LLMs.
Over the last few years, authors, artists, and media outlets have alleged that AI companies used their copyrighted works without consent for model training.
- In July 2023, comedian and author SarahāÆSilverman, along with writers Richard⯠Kadrey and Christopher Golden, filed copyright infringement lawsuits against OpenAI (and in some filings also Meta platforms), alleging that their books had been used from the aforementioned Books3 corpus to train AI models.
- In September 2023, the AuthorsāÆGuild, together with 17 authors (including GeorgeāÆR. R.āÆMartin, JohnāÆGrisham, Jodi Picoult, and Jonathan Franzen), sued OpenAI for alleged copyright infringement ā also claiming their intellectual property was on the Books3 dataset.
- In December 2023, TheāÆNew YorkāÆTimes filed a lawsuit against OpenAI and MicrosoftāÆCorporation, accusing them of using millions of the Timesā articles without permission to train AI models.Ā
Itās also worth noting that a separate case involving Anthropic recently resulted in a major settlement ($1.5B) relating to content and privacy concerns.
Although nothing has yet been fully resolved, this new case would add Salesforce to a growing list of large-scale companies facing similar scrutiny, which speaks volumes about the current ethical and legal framework around how LLMs are trained.
It would also be rather ironic, given that Salesforce CEO Marc Benioff stated in an interview with Bloomberg that many AI companies had āstolenā the data used to train their models and that compensating creators for their work would be āvery easy to doā.
LLMs require vast datasets to function effectively, but the price of that volume may be the use of unauthorized data, leading to the wave of lawsuits weāve already seen.Ā
Even more concerning: if unauthorized training becomes fully restricted, companies like Salesforce may face tighter limitations on how far or how deeply they can develop their LLMs.
Final Thoughts
The LLM race has grown increasingly competitive in recent years. Billions of dollars have been invested, and every AI company is striving to build the most advanced model. But in the rush to innovate, ethical boundaries are beginning to blur ā with some perhaps even seeing this approach as a necessary risk.
With Salesforce now potentially thrown into this mix, it underscores how even enterprise-focused technology companies are not immune to being part of the broader debate over how AI models are trained. Where does their data come from, and what does āresponsible innovationā really mean in practice?