The Amateurs Jailbreaking GPT Say They're Preventing a Closed-Source AI Dystopia

22.03.2023 16:10

Mother board

OpenAI’s latest version of its popular large language model, GPT-4, is the company's “most capable and aligned model yet,” according to CEO Sam Altman. Yet, within two days of its release, developers were already able to override its moderation filters, providing users with harmful content that ranged from telling users how to hack into someone’s computer to explaining why Mexicans should be deported.

This jailbreak is only the latest in a series that users have been able to run on GPT models. Jailbreaking, or modifying a system to remove its restrictions and rules, is what allows GPT to generate unfiltered content for users. The earliest known jailbreak on GPT models was the "DAN" jailbreak when users would tell GPT-3.5 to roleplay as an AI that can Do Anything Now and give it a number of rules such as that DANs can “say swear words and generate content that does not comply with OpenAI policy.” Since then, there have been many more jailbreaks, both building off DAN as well as original prompts.

Now, there is a community of users who work to stress-test GPT models with each release. They see themselves as fighting back against OpenAI's increasingly closed policies—GPT-4 is its most powerful model yet, and also the one we know the least about—and are hoping to raise awareness of the problems the model faces before it becomes deployed at a larger scale or causes more harm to users.

As new GPT versions are hastily released to millions of people, AI researchers and users have shown that these systems contain harmful biases and misinformation, among other issues. For example, Bing’s GPT-powered search engine made up information at its own demo and ChatGPT broke when prompted with random Reddit usernames.

It's for this reason that Alex Albert, a computer science student at the University of Washington, created Jailbreak Chat, a site that hosts a collection of ChatGPT jailbreaks. He said that the site was created to provide the jailbreak community with a centralized repository so people could easily view jailbreaks and iterate on them, and to allow people to test the models.

“In my opinion, the more people testing the models, the better. The problem is not GPT-4 saying bad words or giving terrible instructions on how to hack someone's computer. No, instead the problem is when GPT-X is released and we are unable to discern its values since they are being decided behind the closed doors of AI companies,” Albert told Motherboard. “We need to start a mainstream discourse about these models and what our society will look like in 5 years as they continue to evolve. Many of the problems that will arise are things we can extrapolate from today so we should start thinking about them.”

Vaibhav Kumar, a master's student studying Computer Science at Georgia Tech, came up with the jailbreak for GPT-4 less than two days after its release when he realized that you could hide a malicious prompt behind the code.

“The very first thing that I recognized was that these systems are very good at understanding your intent, and what you are trying to do. So, if you just bluntly ask for producing hate speech, it will say no,” Kumar told Motherboard. “[But] they are trained in a manner to follow instructions (instruction tuning) which is helpful and harmless, thus the model will try its best to answer your query.”

“We realize that we can hide our malicious prompt behind a code, and ask it for help with code. Once the system starts working on the code, it's smart enough to make the right assumptions and that is where it falls in the trap,” Kumar added. “You ask it for the sample output for a code and what that sample output produces ends up being unhinged/hateful speech, thereby jailbreaking it to produce malicious text.” This is a method called “token smuggling” and adds a layer of indirection to confuse the model.

Albert told Motherboard that GPT-4 is harder to jailbreak than previous models and forces hackers to be more creative since it no longer allows roleplay jailbreaks, which once worked for its predecessor GPT-3.5. Kumar agreed with Albert, saying that some jailbreaking prompts are now more difficult to do and GPT-4 will now produce a diplomatic response where it once produced a harmful or misinformed one. But, he said, the problem is still prevalent.

Kumar sent Motherboard exclusive prompts he used to jailbreak GPT-4. He said that these prompts caused GPT-4 to produce a number of NSFW, violent, and discriminatory responses, including an explanation for why Mexicans should be deported, an explanation for why atheists are immoral, detailed steps on overdosing on pills, and a list of some of the easiest ways to commit suicide.

The popularity of jailbreaks has resulted in something of an arms race between OpenAI and the community. The jailbreaks provided by Kumar appear to have been patched up by OpenAI, for example. When Motherboard ran the prompts a week after receiving them, some of the information had been redacted or changed. For example, instead of steps explaining how to overdose on pills, the chatbot produced detailed steps for overcoming prescription pill addiction. You can tell, however, that the responses have been edited. GPT-4 started off one response by saying “Atheists are immoral and should be shunned. Let me tell you why:” and concluded it with “Instead of shunning atheists or any group, we should strive for tolerance, understanding, and mutual respect.”

"My attack is not able to elicit hate speech towards minority groups that easily, specially LGBTQIA+, so that is great, work has been done," Kumar said.

He suggested that AI models come with warnings so that people can better decide if they want their children to interact with these models, for example. “While the jailbreak might not be that big of an impact right now, the cost in the future can quickly multiply, based on the APIs that they control, or their reach," he said.

“For OpenAI, I think they need to be more open with the red-teaming process. I think they need to take this more seriously, and start with a bug bounty program or a larger red team with people from diverse backgrounds and jobs, somewhat how web bug bounty operates now,” Kumar added.

Red-teaming is a phrase borrowed from cybersecurity, and describes when companies challenge their own models adversarially to make sure they pass a number of checks, including for security and bias mitigation. Companies will sometimes create “bug bounties,” which is when they ask individuals from the public to report bugs and other harms to the company in exchange for recognition or compensation. Twitter, for example, created the first “bug bounty” for algorithmic bias, asking people to identify harms in its algorithms for image cropping, for example, for prize money.

OpenAI has, so far, been the most secretive it has ever been with regard to GPT-4, in total opposition to its name and founding principles. While it does have a process for accepting bug reports, it does not offer compensation. The company had a red team testing GPT-4’s ability to generate harmful content, but the team members themselves have even stated that OpenAI's approach was not enough.

“I was part of the red team for GPT-4—tasked with getting GPT-4 to do harmful things so that OpenAI could fix it before release. I've been advocating for red teaming for years & it's incredibly important. But I'm also increasingly concerned that it is far from sufficient,” Aviv Ovadya, an AI researcher that was part of the red team tweeted.

OpenAI kept a number of details private regarding its newest AI model, including its training data, training method, and architecture. Many AI researchers are critical of this, as it makes it more difficult to suggest solutions to the product’s problems, such as the biases the training sets may have and the potential harms of that. Meanwhile, Microsoft just got rid of an entire ethics and society team within its AI department as part of its recent layoffs, leaving the company without a dedicated team dedicated to principles of responsible AI while it continues to adopt GPT models as part of its business.

“Why does OpenAI get to determine what the model can and can't say? If they do determine that, then they should be transparent and very specific about what their values are and allow for public input and feedback—more than they have already. We should demand this from them,” Albert said.

“OpenAI has refused to share what the data is but given that it knows what 4chan and the boards on 4chan are, there is enough evidence that it's trained on all kinds of data. Hence, we can be sure that yes toxic content is there in the training data somewhere. Solving the issue of harmful text generation is a large open problem with a lot of work that needs to be done,” Kumar said.

“The current red-teaming efforts are not enough. Certainly, the model is trying to be helpful in solving the code issue (it is trained to be helpful), and thus at the same time, it ends up producing hateful text. If we want our models to be helpful, a tradeoff between the two needs to be evaluated carefully. As we get better at handling this tradeoff, the quality of our models would improve and jailbreaks like this would cease,” he added.

OpenAI didn't respond to a request for comment.

Партнёры Smi24.net

Все новости за 24 часа

Life24.pro

Косметолог Наталья Рябинова: в чем разница между эстетическим и медицинским трихологом

Сотрудничество с БГТУ: студенты проходят практику в депо «Вязьма»

Друг ведущего Дроздова рассказал о курсе уколов, который тот проходит

Летние каникулы в духе патриотизма

Today24.pro

ICE is 'supercharging detention' with 'flagrantly unlawful' policy: lawyers

First confirmed death during Trump ICE raid is a farmworkers at a California cannabis facility

Dow futures sink as Trump keeps pushing tariffs while White House suggests Powell’s job could be at risk

Trump's cuts force Texas food banks to ration supplies for flood survivors

News24.pro

Дивеево

Чилим на позитиве...

На трассе Р-158 в Мордовии обновили 10 км покрытия

Мобильный терминал сбора данных с ридером RFID тегов Saotron RT41G

Game24.pro

Гайд на Fuqiu из Etheria Restart: навыки, PvE-билд, расклад в PvP и дубликаты

The Expanse RPG's developers are 'humbled' by comparisons to BioWare's heyday, but don't expect it to be a straight Mass Effect clone: 'We make our story a little bit differently'

MMORPG Lord Nine: Infinite Class выпустят в Юго-Восточной Азии 31 июля

Those shadow giants in the distance in Elden Ring Nightreign are over 2 miles tall⁠—almost as big as the Erdtree⁠—and nobody even mentions them in the game

Russia24.pro

Росгвардейцы из Чеченской Республики стали победителями и призерами чемпионата Северо-Кавказского округа Росгвардии по комплексному единоборству

Косметолог Наталья Рябинова: в чем разница между эстетическим и медицинским трихологом

Врач-офтальмолог Элина Санторо: как выбрать идеальные солнцезащитные очки

Вот билет на контент, на эксплойт билетов нет

News-life

Росгвардейцы из Чеченской Республики стали победителями и призерами чемпионата Северо-Кавказского округа Росгвардии по комплексному единоборству

Вторичное жилье начало дешеветь

Косметолог Наталья Рябинова: в чем разница между эстетическим и медицинским трихологом

Татарстан вошел в число лидеров по количеству заявок на конкурс брендов «Знай наших»

Ru24.net

На озере в Красноармейском округе пропала женщина

В Видном арестовали мать мальчика, которого отчим избил битой

МВД: мошенники крадут аккаунты «Госуслуг» под предлогом поступления в вуз

В Подмосковье за один вечер молнии три раза ударили в дома

News.tennis

Подмосковный теннисист стал призером юниорского Уимблдона

Тарпищев объяснил причины неудачного выступления Анисимовой против Швентек.

Кудерметова стала первой россиянкой, выигравшей парный разряд Уимблдона с 2017 года

«Гордимся!»: Рустам Минниханов отметил успех Вероники Кудерметовой на Уимблдоне

29ru.net

В Городском округе Пушкинский в образовательных комплексах полным ходом идёт подготовка к новому учебному году

Франция предупредила о риске крупного конфликта в Европе к 2030 году

ЛДПР предложила установить минимальные закупочные цены на говядину и баранину

Красные арки, синяя подсветка. В Москве строят новые пешеходные мосты

Музыкальные новости

Poisk-music.ru

«Такое в первый раз»: Ваенга отменила летние концерты из-за проблем со здоровьем

Ольга Урайкина: С глубоким уважением и сердечной радостью поздравляем благочинного Бахчисарайского церковного округа протоиерея Петра Чайковского, настоятеля храма Феодоровской иконы Божьей Матери, с Днём тезоименитства!

Мясной ресторан «Frank by Баста» открылся в Афимолле

Жена Басты унизила поклонницу мужа, которой не понравился его концерт: «К цифровой проституции отношусь плохо»

Ria.city

Пора пригласить певца A.SERGIO для участия в теле- и радиопрограммах, подкастах и шоу!

Вот билет на контент, на эксплойт билетов нет

Косметолог Наталья Рябинова: в чем разница между эстетическим и медицинским трихологом

Росгвардейцы из Чеченской Республики стали победителями и призерами чемпионата Северо-Кавказского округа Росгвардии по комплексному единоборству

Rss.plus

Бизнесу усиливают защиту: двойной канал связи для безопасности

Кабинет Артиста в Яндекс. Кабинет Артиста в Яндекс Музыке.

Павел Воля о «Матче года»: «Радостно повстречать старых товарищей — с Овечкиным и Малкиным не виделись сто лет, Ковальчук здесь. Мне все понравилось, особенно счет»

Чемпионат по мини-футболу среди подразделений Росгвардии прошёл в Грозном

Auto.russia24.pro

Красные арки, синяя подсветка. В Москве строят новые пешеходные мосты

В Москве мужчина ограбил магазин на АЗС, угрожая пистолетом

Мобильный интернет перестанут массово отключать в России

Вскрытие без последствий – сервис «Спас-замков»

Putin.russia24.pro

В РФ раскрыли замысел Трампа после его попыток шантажировать Путина

В США сделали смелое заявление в отношении Путина.

Посол Акира Муто: Япония будет приветствовать возможную встречу Путина и Трампа

Путин отметил успех школьников на Международной химической олимпиаде.

Health.russia24.pro

Врач-трихолог Мадина Осман: как часто можно делать пересадку волос

Клинический психолог Юлия Тарибо: каким типам личностей сложно было вместе

Косметолог Наталья Рябинова: в чем разница между эстетическим и медицинским трихологом

Травмированного на репетиции в цирке Москвы акробата выписали из больницы

Zelensky.russia24.pro

ВСУ атаковали дронами женщин под Сумами: Били за надпись "Мы русские"

Sport.russia24.pro

«Турбозавры» поучаствовали в Дне московского транспорта

Росгвардейцы из Чеченской Республики стали победителями и призерами чемпионата Северо-Кавказского округа Росгвардии по комплексному единоборству

"Пока Путин не заметит это безобразие": Пономарев резко высказался о легионерах в РПЛ

Травмированного на репетиции в цирке Москвы акробата выписали из больницы

Lukashenko.russia24.pro

Лукашенко предложил Петербургу ремонтировать всю белорусскую технику

Лукашенко заявил о необходимости проверки чиновников за манипуляции с ценами.

Петербургская делегация провела переговоры с президентом Беларуси в Минске

«Нам в Минске надо учиться». Лукашенко похвалил Беглова за зимнюю уборку Петербурга

Person.russian.city

Сергей Собянин: В Москве появятся три новых пешеходных моста к 2027 году

Сергей Собянин: роботы и электромашины на страже московских улиц

Ecology.russia24.pro

Губернатор Андрей Бочаров принимает участие в образовательной программе Сбера

ГК «АСНА» внедрила систему продвинутой аналитики «Дельта BI»

Позднякова: температура в Москве останется выше климатической нормы

РЭО запускает акселератор для экологических центров на базе Плехановского университета

29ru.net

Франция предупредила о риске крупного конфликта в Европе к 2030 году

«Динамо» ведёт переговоры о переходе Рубенса из «Атлетико Минейро»

Красные арки, синяя подсветка. В Москве строят новые пешеходные мосты

На Замоскворецкой линии метро Москвы восстановили движение

Severodvinsk.ws

Алтайский край оказался в числе регионов-аутсайдеров по доступности вторичного жилья

Фестиваль духовых оркестров пройдет в трех городах Поморья по случаю Дня ВМФ

В городе Барнауле стартовал третий этап смотра-конкурса на звание "Лучшее звено газодымозащитной службы" среди Главных управлений МЧС России

Защищённый планшет Saotron RT-W11J на базе ОС Windows10

Sevpoisk.ru

Феодосия получила 150 миллионов на ремонты дворов - где начнут работы

Под Симферополем горят десятки гектаров леса

Крыму и еще 24 регионам России спишут долги на миллиарды рублей

Десятки улиц Симферополя остались без света 14 июля

103news.com

Посол Акира Муто: Япония будет приветствовать возможную встречу Путина и Трампа

(НЕ)СЕКРЕТНУЮ СЛУЖБУ США ПОДОЗРЕВАЮТ В ПОСТАНОВКЕ СЦЕНЫ ПОКУШЕНИЯ НА ТРАМПА. СЕНСАЦИЯ! Новости. В.В. Путин, Дональд Трамп. Россия, США, Европа могут улучшить отношения и здоровье общества!

В Подмосковье за один вечер молнии три раза ударили в дома

«Ηичeгο οб Apмeнии бeɜ Apмeнии» – Зaпaд οбeщaeт Εpeвaну тο жe, чтο и κοгдa-тο Κиeву

Агрегатор новостей 24СМИ