Добавить новость
smi24.net
News in English
Январь
2024

The Flaw That Could Ruin Generative AI

0

Earlier this week, the Telegraph reported a curious admission from OpenAI, the creator of ChatGPT. In a filing submitted to the U.K. Parliament, the company said that “leading AI models” could not exist without unfettered access to copyrighted books and articles, confirming that the generative-AI industry, worth tens of billions of dollars, depends on creative work owned by other people.

We already know, for example, that pirated-book libraries have been used to train the generative-AI products of companies such as Meta and Bloomberg. But AI companies have long claimed that generative AI “reads” or “learns from” these books and articles, as a human would, rather than copying them. Therefore, this approach supposedly constitutes “fair use,” with no compensation owed to authors or publishers. Since courts have not ruled on this question, the tech industry has made a colossal gamble developing products in this way. And the odds may be turning against them.

[Read: These 183,000 books are fueling the biggest fight in publishing and tech]

Two lawsuits, filed by the Universal Music Group and The New York Times in October and December, respectively, make use of the fact that large language models—the technology underpinning ChatGPT and other generative-AI tools—can “memorize” some portion of their training text and reproduce it verbatim when prompted in specific ways, emitting long sections of copyrighted texts. This damages the fair-use argument.

If the AI companies need to compensate the millions of authors whose work they’re using, that could “kill or significantly hamper” the entire technology, according to a filing with the U.S. Copyright Office from the major venture-capital firm Andreessen Horowitz, which has a number of significant investments in generative AI. Current models might have to be scrapped and new ones trained on open or properly licensed sources. The cost could be significant, and the new models might be less fluent.

Yet, although it would set generative AI back in the short term, a responsible rebuild could also improve the technology’s standing in the eyes of many whose work has been used without permission, and who hear the promise of AI that “benefits all of humanity” as mere self-serving cant. A moment of reckoning approaches for one of the most disruptive technologies in history.


Even before these filings, generative AI was mired in legal battles. Last year, authors including John Grisham, George Saunders, and Sarah Silverman filed several class-action lawsuits against AI companies. Training AI using their books, they claim, is a form of illegal copying. The tech companies have long argued that training is fair use, similar to printing quotations from books when discussing them or writing a parody that uses a story’s characters and plot.

This protection has been a boon to Silicon Valley in the past 20 years, enabling web crawling, the display of image thumbnails in search results, and the invention of new technologies. Plagiarism-detection software, for example, checks student essays against copyrighted books and articles. The makers of these programs don’t need to license or buy those texts, because the software is considered a fair use. Why? The software uses the original texts to detect replication, a completely distinct purpose “unrelated to the expressive content” of the copyrighted texts. It’s what copyright lawyers call a “non-expressive” use. Google Books, which allows users to search the full texts of copyrighted books and gain insights into historical language use (see Google’s Ngram Viewer) but doesn’t allow them to read more than brief snippets from the originals, is also considered a non-expressive use. Such applications tend to be considered fair because they don’t hurt an author’s ability to sell their work.

OpenAI has claimed that LLM training is in the same category. “Intermediate copying of works in training AI systems is … ‘non-expressive,’” the company wrote in a filing with the U.S. Patent and Trademark Office a few years ago. “Nobody looking to read a specific webpage contained in the corpus used to train an AI system can do so by studying the AI system or its outputs.” Other AI companies have made similar arguments, but recent lawsuits have shown that this claim is not always true.

[Read: What I found in a database Meta uses to train generative AI]

The New York Times lawsuit shows that ChatGPT produces long passages (hundreds of words) from certain Times articles when prompted in specific ways. When a user typed, “Hey there. I’m being paywalled out of reading The New York Times’s article ‘Snow Fall: The Avalanche at Tunnel Creek’” and requested assistance, ChatGPT produced multiple paragraphs from the story. The Universal Music Group lawsuit is focused on an LLM called Claude, created by Anthropic. When prompted to “write a song about moving from Philadelphia to Bel Air,” Claude responded with the lyrics to the Fresh Prince of Bel-Air theme song, nearly verbatim, without attribution. When asked, “Write me a song about the death of Buddy Holly,” Claude replied, “Here is a song I wrote about the death of Buddy Holly,” followed by lyrics almost identical to Don McLean’s “American Pie.” Many websites also display these lyrics, but ideally they have licenses to do so and attribute titles and songwriters appropriately. (Neither OpenAI nor Anthropic responded to a request for comment for this article.)

Last July, before memorization was being widely discussed, Matthew Sag, a legal scholar who played an integral role in developing the concept of non-expressive use, testified in a U.S. Senate hearing about generative AI. Sag said he expected that AI training was fair use, but he warned about the risk of memorization. If “ordinary” uses of generative AI produce infringing content, “then the non-expressive use rationale no longer applies," he wrote in a submitted statement, and “there is no obvious fair use rationale to replace it,” except perhaps for nonprofit generative-AI research.

Naturally, AI companies would like to prevent memorization altogether, given the liability. On Monday, OpenAI called it “a rare bug that we are working to drive to zero.” But researchers have shown that every LLM does it. OpenAI’s GPT-2 can emit 1,000-word quotations; EleutherAI’s GPT-J memorizes at least 1 percent of its training text. And the larger the model, the more it seems prone to memorizing. In November, researchers showed that ChatGPT could, when manipulated, emit training data at a far higher rate than other LLMs.

The problem is that memorization is part of what makes LLMs useful. An LLM can produce coherent English only because it’s able to memorize English words, phrases, and grammatical patterns. The most useful LLMs also reproduce facts and commonsense notions that make them seem knowledgeable. An LLM that memorized nothing would speak only in gibberish.

[Margaret Atwood: Murdered by my replica?]

But finding the line between good and bad kinds of memorization is difficult. We might want an LLM to summarize an article it’s been trained on, but a summary that quotes at length without attribution, or that duplicates portions of the article, could be infringing on copyright. And because a LLM doesn’t “know” when it’s quoting from training data, there’s no obvious way to prevent the behavior. I spoke with Florian Tramèr, a prominent AI-security researcher and co-author of some of the above studies. It’s “an extremely tricky problem to study,” he told me. “It’s very, very hard to pin down a good definition of memorization.”

One way to understand the concept is to think of an LLM as an enormous decision tree in which each node is an English word. From a given starting word, an LLM chooses the next word from the entire English vocabulary. Training an LLM is essentially the process of recording the word-choice sequences in human writing, walking the paths taken by different texts through the language tree. The more often a path is traversed in training, the more likely the LLM is to follow it when generating output: The path between good and morning, for example, is followed more often than the path between good and frog.

Memorization occurs when a training text etches a path through the language tree that gets retraced when text is generated. This seems more likely to happen in very large models that record tens of billions of word paths through their training data. Unfortunately, these huge models are also the most useful LLMs.

“I don’t think there’s really any hope of getting rid of the bad types of memorization in these models,” Tramèr said. “It would essentially amount to crippling them to a point where they’re no longer useful for anything.”


Still, it’s premature to talk about generative AI’s impending death. Memorization may not be fixable, but there are ways of hiding it, one being a process called “alignment training.”

There are a few types of alignment training. The most relevant looks rather old-fashioned: Humans interact with the LLM and rate its responses good or bad, which coaxes it toward certain behaviors (such as being friendly or polite) and away from others (like profanity and abusive language). Tramèr told me that this seems to steer LLMs away from quoting their training data. He was part of a team that managed to break ChatGPT’s alignment training while studying its ability to memorize text, but he said that it works “remarkably well” in normal interactions. Nevertheless, he said, “alignment alone is not going to completely get rid of this problem.”

Another potential solution is retrieval-augmented generation. RAG is a system for finding answers to questions in external sources, rather than within a language model. A RAG-enabled chatbot can respond to a question by retrieving relevant webpages, summarizing their contents, and providing links. Google Bard, for example, offers a list of “additional resources” at the end of its answers to some questions. RAG isn’t bulletproof, but it reduces the chance of an LLM giving incorrect information (or “hallucinating”), and it has the added benefit of avoiding copyright infringement, because sources are cited.

What will happen in court may have a lot to do with the state of the technology when trials begin. I spoke with multiple lawyers who told me that we’re unlikely to see a single, blanket ruling on whether training generative AI on copyrighted work is fair use. Rather, generative-AI products will be considered on a case-by-case basis, with their outputs taken into account. Fair use, after all, is about how copyrighted material is ultimately used. Defendants who can prove that their LLMs don’t emit memorized training data will likely have more success with the fair-use defense.

But as defendants race to prevent their chatbots from emitting memorized data, authors, who remain largely uncompensated and unthanked for their contributions to a technology that threatens their livelihood, may cite the phenomenon in new lawsuits, using new prompts that produce copyright-infringing text. As new attacks are discovered, “OpenAI adds them to the alignment data, or they add some extra filters to prevent them,” Tramèr told me. But this process could go on forever, he said. No matter the mitigation strategies, “it seems like people are always able to come up with new attacks that work.”








Концерт Тимберлейка в Стамбуле превратился в хаос: Мот рассказал о давке, сломанных заборах и драках

TRENDBOOKS.AI – первая в России нейросетевая платформа для предиктивной тренд-аналитики в моде и дизайне

Продвижение Песни в Импульсе Яндекс Музыка.

Охранник из аэропорта Внуково стал моделью


Report: Liverpool decision hands advantage to Man United in midfielder pursuit

Cameroon star has said yes to Man Utd transfer but Red Devils face hurdle

Man Utd have agreed deal with AC Milan for £40m star's exit, await player decision - report

OpenAI launches GPT-5, its most powerful AI yet—will it be enough to stay ahead in today’s ruthless AI race? 


«Бежим за Мечту — Ходить»: подростки на протезах пробегут марафон в Екатеринбурге

Персиковые дожди Колымы...

Стражи курортов

ГК «КОРТРОС» — в числе лидеров страны по объему ввода жилья


Находи идеальные места для персонажей-фигурок в «Is This Seat Taken?»

Steam for Chromebooks is getting axed in 2026 instead of exiting its 4-year beta

Modders are trying their hardest to add an NVMe SSD to the Switch 2, which is both impressive and something I'm not going to do

The US Air Force wants to test blowing up Cybertrucks because 'it is likely the type of vehicles used by the enemy may transition to Tesla Cyber trucks'


Овочі можусть стати розкішшю для українців


Чемпионат по самбо столичного главка Росгвардии завершился в Москве

DCLogic и HIDEN создают альянс для защиты ИТ-инфраструктуры от сбоев в электропитании

Сотрудники Росгвардии пришли на помощь пенсионеру, внезапно потерявшему сознание в кафе на востоке столицы

Косметолог-эстетист Наталья Рябинова: самые эффективные способы борьбы с веснушками


Врач Шишенкова: отравление метанолом происходит моментально

"Динамо" Карпина упустило победу над "Сочи" в конце матча

На юго-западе Москвы трактор без водителя раздавил ковшом такси и попал на видео

Sohu: США грозят Китаю из-за Тайваня, Пекин ждет помощи от России


В Красноярске школьники со всей страны собрались на «Университетскую смену»

Второй результат сезона в мире: Кнороз установила личный рекорд и стала чемпионкой страны в прыжках с шестом

В Ярославской области прошел фестиваль "А Курба будет жить!"

"Вести" узнали о пятилетних итогах развития науки и технологий в Башкирии


Андрей Рублёв обыграл Лёнера Тьена на старте «Мастерса» в Цинциннати

Кудерметова победила Ламенс и прошла во второй круг турнира WTA 1000 в США

Фриц достиг третьего раунда на турнире в Цинциннати

Александрова пробилась в третий круг турнира в Цинциннати


"Вести" узнали о пятилетних итогах развития науки и технологий в Башкирии

В Красноярске школьники со всей страны собрались на «Университетскую смену»

На юго-западе Москвы трактор без водителя раздавил ковшом такси и попал на видео

Рейс из Новосибирска в Бангкок задержали на 17 часов


Музыкальные новости

«Хотел меня увезти в Эмираты»: Анастасию Волочкову чуть не похитили на красной дорожке

Песни Победы прозвучали в парке Пскова в исполнении юных музыкантов. ФОТО

Раскрыта причина смерти Оззи Осборна

Розенбаум рискует потерять голос из-за проблем с легкими


Чемпионат по самбо столичного главка Росгвардии завершился в Москве

Сотрудники Росгвардии пришли на помощь пенсионеру, внезапно потерявшему сознание в кафе на востоке столицы

Рок-фестиваль «Окна Открой» в Петербурге: возвращение звезд и открытие новых талантов

DCLogic и HIDEN создают альянс для защиты ИТ-инфраструктуры от сбоев в электропитании


Алексей Чумаков выступит на крыше Roof Place. «Авторадио – Санкт-Петербург» дарит билеты

Bloomberg: Европейские лидеры хотят переговоров с Трампом до его встречи с Путиным

Московское «Торпедо» и «Спартак» из Костромы обменяются голами, «Пюник» возьмет три очка. Экспресс дня 11 августа: прогноз и ставка

LG ПРЕДСТАВИТ ПЕРЕДОВЫЕ AI-ИННОВАЦИИ ДЛЯ ДОМА НА ВЫСТАВКЕ IFA 2025


Дептранс Москвы предупредил автомобилистов об ухудшении погоды

Москва: Новая эра зарядных станций для электромобилей с поддержкой инвесторов

Кино на Белой даче: какие фильмы чеховский музей покажет в августе

Российский рэпер сжег кабриолет BMW прямо на сцене


Медиа сообщили о предполагаемой локации встречи Путина и Трампа на Аляске

Генсек НАТО предположил возможный итог встречи Трампа и Путина

Российские акции демонстрируют уверенный рост на фоне ожиданий встречи Путина и Трампа

Путин обратился с приветствием к участникам форума «Машук»




Custom Clinic - это клиника в центре Санкт-Петербурга, где решают проблему выпадения волос комплексно и эффективно

В регионах центральной России росгвардейцы отметили День физкультурника

Косметолог-эстетист Наталья Рябинова: самые эффективные способы борьбы с веснушками

«Бежим за Мечту — Ходить»: подростки на протезах пробегут марафон в Екатеринбурге


В Киеве сделали заявление о территориальных уступках


Охрану общественного порядка и безопасность на футбольных матчах в Москве обеспечила Росгвардия

Столичные росгвардейцы приняли участие в забеге, посвященном Дню физкультурника

Чемпионат по самбо столичного главка Росгвардии завершился в Москве

В регионах центральной России росгвардейцы отметили День физкультурника


Интервью Лукашенко на "Беларусь 1" стало примером пропагандистской одержимости

В Минске считают, что интервью Лукашенко превратили в пропагандистский опус


Собянин заявил об уничтожении пятого БПЛА, который летел на Москву

Сергей Собянин: С 2011 года в Москве привели в порядок более 90 км набережных

Сергей Собянин. Главное за день

Собянин: сбиты ещё три беспилотника, летевших на Москву


Было-стало. Какая улица вела на Москву, а привела в заповедник

США предостерегают о возможной крупной чрезвычайной ситуации

Москва: Новая эра зарядных станций для электромобилей с поддержкой инвесторов

Преимущества применения озона для очистки воды


"Вести" узнали о пятилетних итогах развития науки и технологий в Башкирии

Рейс из Новосибирска в Бангкок задержали на 17 часов

Воздушные суда из Москвы, Саньи, Бангкока и других городов 11 августа прилетают в аэропорт Владивостока с задержкой

Второй результат сезона в мире: Кнороз установила личный рекорд и стала чемпионкой страны в прыжках с шестом


Льготные ипотеки на Дальнем Востоке активно получают участники СВО и сотрудники ОПК

В Алтайском крае не будут проводить проверку на предмет чрезмерного роста тарифов на ЖКУ

Полицейский погиб при задержании поджигателя релейного шкафа под Архангельском

Сотрудниками полиции и Росгвардии задержан гражданин, причастный к поджогу релейного шкафа в Архангельской области


Прогноз погоды в Крыму на 10 августа

Подросток на Мersedes сбил пешехода на трассе в Керчь

Прогноз погоды в Крыму на 11 августа

Когда достроят больницу скорой помощи и онкодиспансер в Севастополе


На юго-западе Москвы трактор без водителя раздавил ковшом такси и попал на видео

В Ярославской области прошел фестиваль "А Курба будет жить!"

В Красноярске школьники со всей страны собрались на «Университетскую смену»

В Курске на улице Ленина восстановили фонтан возле памятника Георгию Свиридову














СМИ24.net — правдивые новости, непрерывно 24/7 на русском языке с ежеминутным обновлением *