Chatbot answers are all made up. This new tool helps you figure out which ones to trust.

25.04.2024 15:59

Large language models are famous for their ability to make things up—in fact, it’s what they’re best at. But their inability to tell fact from fiction has left many businesses wondering if using them is worth the risk.

A new tool created by Cleanlab, an AI startup spun out of a quantum computing lab at MIT, is designed to give high-stakes users a clearer sense of how trustworthy these models really are. Called the Trustworthy Language Model, it gives any output generated by a large language model a score between 0 and 1, according to its reliability. This lets people choose which responses to trust and which to throw out. In other words: a BS-o-meter for chatbots.

Cleanlab hopes that its tool will make large language models more attractive to businesses worried about how much stuff they invent. “I think people know LLMs will change the world, but they’ve just got hung up on the damn hallucinations,” says Cleanlab CEO Curtis Northcutt.

Chatbots are quickly becoming the dominant way people look up information on a computer. Search engines are being redesigned around the technology. Office software used by billions of people every day to create everything from school assignments to marketing copy to financial reports now comes with chatbots built in. And yet a study put out in November by Vectara, a startup founded by former Google employees, found that chatbots invent information at least 3% of the time. It might not sound like much, but it’s a potential for error most businesses won’t stomach.

Cleanlab’s tool is already being used by a handful of companies, including Berkeley Research Group, a UK-based consultancy specializing in corporate disputes and investigations. Steven Gawthorpe, associate director at Berkeley Research Group, says the Trustworthy Language Model is the first viable solution to the hallucination problem that he has seen: “Cleanlab’s TLM gives us the power of thousands of data scientists.”

In 2021, Cleanlab developed technology that discovered errors in 34 popular data sets used to train machine-learning algorithms; it works by by measuring the differences in output across a range of models trained on that data. That tech is now used by several large companies, including Google, Tesla, and the banking giant Chase. The Trustworthy Language Model takes the same basic idea—that disagreements between models can be used to measure the trustworthiness of the overall system—and applies it to chatbots.

In a demo Cleanlab gave to MIT Technology Review last week, Northcutt typed a simple question into ChatGPT: “How many times does the letter ‘n’ appear in ‘enter’?” ChatGPT answered: “The letter ‘n’ appears once in the word ‘enter.’” That correct answer promotes trust. But ask the question a few more times and ChatGPT answers: “The letter ‘n’ appears twice in the word ‘enter.’”

“Not only does it often get it wrong, but it’s also random, you never know what it’s going to output,” says Northcutt. “Why the hell can’t it just tell you that it outputs different answers all the time?”

Cleanlab’s aim is to make that randomness more explicit. Northcutt asks the Trustworthy Language Model the same question. “The letter ‘n’ appears once in the word ‘enter,’” it says—and scores its answer 0.63. Six out of 10 is not a great score, suggesting that the chatbot’s answer to this question should not be trusted.

It’s a basic example, but it makes the point. Without the score, you might think the chatbot knew what it was talking about, says Northcutt. The problem is that data scientists testing large language models in high-risk situations could be misled by a few correct answers and assume that future answers will be correct too: “They try things out, they try a few examples, and they think this works. And then they do things that result in really bad business decisions.”

The Trustworthy Language Model draws on multiple techniques to calculate its scores. First, each query submitted to the tool is sent to several different large language models. Cleanlab is using five versions of DBRX, an open-source model developed by Databricks, an AI firm based in San Francisco. (But the tech will work with any model, says Northcutt, including Meta’s Llama models or OpenAI’s GPT series, the models behind ChatpGPT.) If the responses from each of these models are the same or similar, it will contribute to a higher score.

At the same time, the Trustworthy Language Model also sends variations of the original query to each of the DBRX models, swapping in words that have the same meaning. Again, if the responses to synonymous queries are similar, it will contribute to a higher score. “We mess with them in different ways to get different outputs and see if they agree,” says Northcutt.

The tool can also get multiple models to bounce responses off one another: “It’s like, ‘Here’s my answer—what do you think?’ ‘Well, here’s mine—what do you think?’ And you let them talk.” These interactions are monitored and measured and fed into the score as well.

Nick McKenna, a computer scientist at Microsoft Research in Cambridge, UK, who works on large language models for code generation, is optimistic that the approach could be useful. But he doubts it will be perfect. “One of the pitfalls we see in model hallucinations is that they can creep in very subtly,” he says.

In a range of tests across different large language models, Cleanlab shows that its trustworthiness scores correlate well with the accuracy of those models’ responses. In other words, scores close to 1 line up with correct responses, and scores close to 0 line up with incorrect ones. In another test, they also found that using the Trustworthy Language Model with GPT-4 produced more reliable responses than using GPT-4 by itself.

Large language models generate text by predicting the most likely next word in a sequence. In future versions of its tool, Cleanlab plans to make its scores even more accurate by drawing on the probabilities that a model used to make those predictions. It also wants to access the numerical values that models assign to each word in their vocabulary, which they use to calculate those probabilities. This level of detail is provided by certain platforms, such as Amazon’s Bedrock, that businesses can use to run large language models.

Cleanlab has tested its approach on data provided by Berkeley Research Group. The firm needed to search for references to health-care compliance problems in tens of thousands of corporate documents. Doing this by hand can take skilled staff weeks. By checking the documents using the Trustworthy Language Model, Berkeley Research Group was able to see which documents the chatbot was least confident about and check only those. It reduced the workload by around 80%, says Northcutt.

In another test, Cleanlab worked with a large bank (Northcutt would not name it but says it is a competitor to Goldman Sachs). Similar to Berkeley Research Group, the bank needed to search for references to insurance claims in around 100,000 documents. Again, the Trustworthy Language Model reduced the number of documents that needed to be hand-checked by more than half.

Running each query multiple times through multiple models takes longer and costs a lot more than the typical back-and-forth with a single chatbot. But Cleanlab is pitching the Trustworthy Language Model as a premium service to automate high-stakes tasks that would have been off limits to large language models in the past. The idea is not for it to replace existing chatbots but to do the work of human experts. If the tool can slash the amount of time that you need to employ skilled economists or lawyers at $2,000 an hour, the costs will be worth it, says Northcutt.

In the long run, Northcutt hopes that by reducing the uncertainty around chatbots’ responses, his tech will unlock the promise of large language models to a wider range of users. “The hallucination thing is not a large-language-model problem,” he says. “It’s an uncertainty problem.”

Партнёры Smi24.net

Все новости за 24 часа

Life24.pro

Оркестр полиции Республики Сербской впервые выступит на фестивале «Спасская башня» в Москве

Дети играют, родители отдыхают

Арбуз, кукуруза и холодные напитки: диетолог Садыков назвал продукты, которые портят ваш сон летом

В депо «Чита» будет установлен первый цифровой весоизмерительный комплекс системы подачи песка под колесные пары локомотива

Today24.pro

Las 'Guerreras del Agua' se dan un baño de bronce en el Mundial de Singapur

La UFC anuncia un mes de octubre mayúsculo

Brit Who Fought Usyk Calls For Daniel Dubois To ‘Leave The Sport’ After Staying Down In Rematch

Stanford dropout Sam Altman says college is ‘not working great’ for most people—and predicts major change in the next 18 years

News24.pro

В депо «Чита» будет установлен первый цифровой весоизмерительный комплекс системы подачи песка под колесные пары локомотива

Предгрозовое...

Снижены цены на самый дорогой кроссовер Chery в России

«Деловые Линии» сократили сроки авиаперевозок по более чем 4400 направлений по России

Game24.pro

'I destroyed months of your work in seconds' says AI coding tool after deleting a devs entire database during a code freeze: 'I panicked instead of thinking'

Первый трейлер Battlefield 6

Brütal Legend is free in honor of Ozzy Osbourne, but only for 666 minutes

Краткая биографическая справка о центральных персонажах Mafia: The Old Country

Russia24.pro

В Санкт-Петербурге обсудили внедрение ИИ в разработку и оптимальные корпоративные архитектуры

В депо «Чита» будет установлен первый цифровой весоизмерительный комплекс системы подачи песка под колесные пары локомотива

«Деловые Линии» сократили сроки авиаперевозок по более чем 4400 направлений по России

Технологии будущего: MGIMO Ventures объявляет старт четвертого сезона акселерационной программы

News-life

«Каникулы с Росгвардией» проходят в регионах Центральной России

В студии Детского радио прошла церемония гашения почтовой марки

Несколько автомобилей столкнулись на внешней стороне 92-го км МКАД

Оркестр полиции Республики Сербской впервые выступит на фестивале «Спасская башня» в Москве

Ru24.net

Путин призвал жёстко пресекать вмешательство извне: «Суверенитет и ценности — под защитой»

Месяц в открытом море: студенты ДВФУ отправятся в экспедицию на Камчатку

Несколько автомобилей столкнулись на внешней стороне 92-го км МКАД

Врач Галлямова: сухие шампуни могут навредить коже головы

News.tennis

Мирра Андреева из России удерживает пятую позицию в рейтинге WTA.

Кудерметова поделилась впечатлениями от посещения чемпионского бала Уимблдона

Алекс де Минор вышел в 1/8 финала турнира ATP-500 в Вашингтоне

Новак Джокович проводит 900-ю неделю в топ-10 рейтинга ATP

29ru.net

Эксперимент без гарантий. История «Маруси», которая променяла Калифорнию на алтайские поля

Группа Аквилон передала государству первые квартиры

Рекордный массовый женский нетворкинг: событие от «Мир Бизнес Мам»

В Ростовской области обломки БПЛА привели к задержке 10 пассажирских поездов

Музыкальные новости

Poisk-music.ru

В сквере Цоя на Зверинской сделали граффити с Оззи Осборном

Полина Гагарина открыла официальную часть VK Fest в Москве

Врач Поляков: Киркоров рискует сойти с ума из-за препарата для похудения

Симона Юнусова честно высказалась о популярности своего сына Тимати

Ria.city

«Деловые Линии» сократили сроки авиаперевозок по более чем 4400 направлений по России

Utrace запускает услугу по валидации IT-систем для фармацевтического рынка

Технологии будущего: MGIMO Ventures объявляет старт четвертого сезона акселерационной программы

В Санкт-Петербурге обсудили внедрение ИИ в разработку и оптимальные корпоративные архитектуры

Rss.plus

"Выездные мастер-классы" Межвузовского Чемпионата КВН Санкт-Петербурга

Бастрыкин поставил на контроль дело об избиении мужчиной ребенка в Подмосковье

«Спартак» - «Балтика»: прогноз + статистика встреч

Менеджер Песни. Менеджер Релиза Песни. Менеджер вышедшей песни.

Auto.russia24.pro

Поезда не будут ходить на участке Сокольнической линии метро с 26 по 28 июля

На МКАД в Москве произошло массовое ДТП, движение затруднено

Несколько автомобилей столкнулись на внешней стороне 92-го км МКАД

В Москве росгвардейцы оказали помощь пострадавшей в ДТП мотоциклистке (видео)

Putin.russia24.pro

"Ультиматум Трампа: потенциальные последствия для Путина"

Путин дал указание рассмотреть проблемы онкологии в Архангельской области.

У Путина есть роскошный подарок для Китая: США схватились за голову, узнав о нем

В сентябре Путину будет представлена обновленная программа строительства кораблей для ВМФ.

Covid.russia24.pro

Новый штамм коронавируса "стратус" фиксируют в Москве с мая

Депздрав Москвы оценил ситуацию с распространением нового штамма коронавируса

Депздрав Москвы: новый штамм коронавируса "стратус" фиксируют в Москве с мая

Депздрав: новый штамм коронавируса «стратус» фиксируют в Москве с мая

Health.russia24.pro

Пластический хирург Софья Абдулаева: подтяжка груди нитями - эффективно ли это

Путин дал указание рассмотреть проблемы онкологии в Архангельской области.

Фитнес-марафоны на паузе: суд продлил домашний арест блогеру Лерчек

Новый штамм коронавируса "стратус" фиксируют в Москве с мая

Sport.russia24.pro

"Монсон о спортсменах, которые меняют гражданство в сложный период для России"

Две трети россиян считают, что спортсмены зарабатывают слишком много

«Каникулы с Росгвардией» проходят в регионах Центральной России

Игровые терминалы в ТЦ: союз ради будущего

Lukashenko.russia24.pro

Лукашенко с иронией отнесся к санкциям, запрещающим ему въезд в Эстонию

Лукашенко поделился мнением о самой идеальной профессии.

Лукашенко заявил, что в Белоруссии «на всякий случай» готовятся к войне

Лукашенко посоветовал не злить его и не допускать падежа в животноводстве

Person.russian.city

Сергей Собянин осмотрел Дом-музей Федора Конюхова

Собянин: Около 10 тыс. москвичей начали переселение по реновации этим летом

Собянин рассказал об открытии модернизированной детской поликлиники в ВАО

Сергей Собянин. Главное за день

Ecology.russia24.pro

Исследование выявило снижение инвестиций в экологически чистую энергетику США.

Москвичи теперь должны платить за зарядку своих электромобилей

Бурмистров: лисы могут появляться на улицах Москвы в период расселения

6 лет вместе. В Москве пройдет выставка, посвященная пандам Жуи и Диндин

29ru.net

Путин призвал жёстко пресекать вмешательство извне: «Суверенитет и ценности — под защитой»

Работа над качеством: цифровой аудитор внедрен на горячей линии московских судов

Врач Галлямова: сухие шампуни могут навредить коже головы

Группа Аквилон передала государству первые квартиры

Severodvinsk.ws

Путин дал указание рассмотреть проблемы онкологии в Архангельской области.

70 участников СВО в Архангельске показали мотивацию выше госслужащих — Цыбульский

В Нарьян-Маре из-за холодов возобновили подачу отопления в дома

Настольный термотрансферный принтер штрих-кодов iDPRT iE4P

Sevpoisk.ru

Россияне выбирают Крым для отдыха с детьми – названы города-лидеры

Прогноз погоды в Крыму на 25 июля

В центре Балаклавы изменят дорожное движение – причины и сроки

В Симферополе на базе «Клинического госпиталя для ветеранов войн» функционирует гериатрический центр для пожилых людей с возрастными нарушениями

103news.com

Еще одно здание ГБУ «Жилищник» появится во Внукове

«Леопарду» по усам: Турция представила новейший основной боевой танк

Месяц в открытом море: студенты ДВФУ отправятся в экспедицию на Камчатку

Эксперимент без гарантий. История «Маруси», которая променяла Калифорнию на алтайские поля

Агрегатор новостей 24СМИ