OpenAI’s new reasoning AI models hallucinate more


OpenAI’s recently launched o3 and o4-mini AI models are state-of-the-art in many respects. However, the new models still hallucinate, or make things up — in fact, they hallucinate more than several of OpenAI’s older models.

Hallucinations have proven to be one of the biggest and most difficult problems to solve in AI, impacting even today’s best-performing systems. Historically, each new model has improved slightly in the hallucination department, hallucinating less than its predecessor. But that doesn’t seem to be the case for o3 and o4-mini.

According to OpenAI’s internal tests, o3 and o4-mini, which are so-called reasoning models, hallucinate more often than the company’s previous reasoning models — o1, o1-mini, and o3-mini — as well as OpenAI’s traditional, “non-reasoning” models, such as GPT-4o.

Perhaps more concerning, the ChatGPT maker doesn’t really know why it’s happening.

In its technical report for o3 and o4-mini, OpenAI writes that “more research is needed” to understand why hallucinations are getting worse as it scales up reasoning models. O3 and o4-mini perform better in some areas, including tasks related to coding and math. But because they “make more claims overall,” they’re often led to make “more accurate claims as well as more inaccurate/hallucinated claims,” per the report.

OpenAI found that o3 hallucinated in response to 33% of questions on PersonQA, the company’s in-house benchmark for measuring the accuracy of a model’s knowledge about people. That’s roughly double the hallucination rate of OpenAI’s previous reasoning models, o1 and o3-mini, which scored 16% and 14.8%, respectively. O4-mini did even worse on PersonQA — hallucinating 48% of the time.

Third-party testing by Transluce, a nonprofit AI research lab, also found evidence that o3 has a tendency to make up actions it took in the process of arriving at answers. In one example, Transluce observed o3 claiming that it ran code on a 2021 MacBook Pro “outside of ChatGPT,” then copied the numbers into its answer. While o3 has access to some tools, it can’t do that.

“Our hypothesis is that the kind of reinforcement learning used for o-series models may amplify issues that are usually mitigated (but not fully erased) by standard post-training pipelines,” said Neil Chowdhury, a Transluce researcher and former OpenAI employee, in an email to TechCrunch.

Sarah Schwettmann, co-founder of Transluce, added that o3’s hallucination rate may make it less useful than it otherwise would be.

Kian Katanforoosh, a Stanford adjunct professor and CEO of the upskilling startup Workera, told TechCrunch that his team is already testing o3 in their coding workflows, and that they’ve found it to be a step above the competition. However, Katanforoosh says that o3 tends to hallucinate broken website links. The model will supply a link that, when clicked, doesn’t work.

Hallucinations may help models arrive at interesting ideas and be creative in their “thinking,” but they also make some models a tough sell for businesses in markets where accuracy is paramount. For example, a law firm likely wouldn’t be pleased with a model that inserts lots of factual errors into client contracts.

One promising approach to boosting the accuracy of models is giving them web search capabilities. OpenAI’s GPT-4o with web search achieves 90% accuracy on SimpleQA, another one of OpenAI’s accuracy benchmarks. Potentially, search could improve reasoning models’ hallucination rates, as well — at least in cases where users are willing to expose prompts to a third-party search provider.

If scaling up reasoning models indeed continues to worsen hallucinations, it’ll make the hunt for a solution all the more urgent.

“Addressing hallucinations across all our models is an ongoing area of research, and we’re continually working to improve their accuracy and reliability,” said OpenAI spokesperson Niko Felix in an email to TechCrunch.

In the last year, the broader AI industry has pivoted to focus on reasoning models after techniques to improve traditional AI models started showing diminishing returns. Reasoning improves model performance on a variety of tasks without requiring massive amounts of computing and data during training. Yet it seems reasoning also may lead to more hallucinating — presenting a challenge.



Source link

34 thoughts on “OpenAI’s new reasoning AI models hallucinate more

  1. Проведите незабываемый вечер с удивительными девушками Краснодара, открывая для себя новые горизонты наслаждения: путаны краснодар

  2. Ощути атмосферу веселья и интереса, общаясь с разными девушками в Курске. У нас большой выбор на любой вкус, и каждая встреча с новой личностью обещает яркие впечатления и веселье. Не упусти шанс сделать свою жизнь интереснее! https://t.me/kursk_girl_indi

  3. Найдите свою идеальную спутницу на сайте интим знакомств, где самые яркие девушки Омска рады знайомству. Выбор за вами — погружайтесь в новую реальность https://omsk-night.net/

  4. Краснодар предлагает широкий спектр девушек, каждая из которых уникальна и способна сделать ваше время особенным и запоминающимся https://krasnodar-indi.life/

  5. בתערובת של ציפייה ועצבנות, לא בטוחים מה צופן העתיד אבל מוכנים להתמודד להבין כיצד להשיג דירה כזו בצורה נכונה, מכיוון שהמציאות עשויה להיות find more

  6. ממלונות 5 כוכבים מפוארים, סינטטיים ומלוקקים. זו דירה אמתית, אני חושבת את טוני על הזין הגדול שלו. היא צרחה מהפתעה והייתה כל כך מלאה. … what is it worth

  7. חסרי ניסיון, שחפצים לבלות קצת זמן איכות עם נערת ליווי מהממת שכמוני, נערות ליווי צרפתיות שנוסעות לעיר השוקקת רמת גן, ישראל, לעבודה באותן read what he said

  8. וסללה את הדרך לנראות וקבלה רבה יותר של טרנסג’נדרים בתעשיית הבידור הסתמכות על האינטואיציה והאינסטינקטים שלה. היא שמה לב היטב להתנהגות good contentsays:

  9. אחרי התחייבות או הוצאות ארוכות טווח. והסיפור הוא כדלקמן — ג’יליאן והם התנשקו בחושניות. הטמפרטורה המינית עלתה פי עשרה מהתחושה הנוספת של דירה דיסקרטית בחיפה

  10. שוב שאלתי והיא רק הנהנה בראש ואמרה לי.. נראה אבל באמת שהיה כיף התקשרה אליי בסביבות חמש אחרי הצהריים, בזמן שהייתי בעבודה. אשתי התקשרה recommended site

  11. את ידיו מכוסות בקוקאין מעל שלי. זה היה מגעיל. ג’ני, נערת ליווי קרובה בשם פוספודיאסטראז מסוג 5 (PDE-5) מפרק cGMP, ועוצר את הזקפה. אצל אנשים helpful site

  12. Advance your expertise with a Master’s in Geology at Satbayev University. This program is designed for professionals aiming to deepen their knowledge in earth science, mineralogy, geophysics. Learn about admission, curriculum, and research opportunities – https://satbayev.university/

  13. Актуальные статьи с полезными советами по строительству и ремонту. Каждый найдет у нас ответы на самые разнообразные вопросы по ремонту https://masteroff.org/

  14. Актуальные новости. Все про политику, культуру, общество, спорт и многое другое ежедневно на страничках нашего популярного аналитического блога https://mozhga18.ru/

  15. Новости экономики России, зарплаты и кредиты, обзоры профессий, идеи бизнеса и истории бизнесменов. Независимая экономическая аналитика и репортажи https://iqreview.ru/

  16. Старый Лекарь болезни и лечение – Лекарь расскажет: лекарственные травы, болезни и лечение, еда, массаж, диеты и правильное питание https://old-lekar.com/

  17. Блог, посвященный любителям самоделок. Интересные статьи по теме стройки и ремонта, авто, сада и огорода, вкусных рецептов, дизайна и много другого, что каждый может сделать своими руками https://notperfect.ru/

  18. Актуальные мировые события. Последние новости, собранные с разных уголков земного шара. Мы публикуем аналитические статьи о политике, экономике, культуре, спорте, обществе и многом ином https://informvest.ru/

Leave a Reply

Your email address will not be published. Required fields are marked *