The A.I Megathread (LLM , GPT , Development)

bnew · May 4, 2024

Nick Bostrom Made the World Fear AI. Now He Asks: What if It Fixes Everything?

Philosopher Nick Bostrom popularized the idea superintelligent AI could erase humanity. His new book imagines a world in which algorithms have solved every problem.

www.wired.com

WILL KNIGHT
BUSINESS

MAY 2, 2024 12:00 PM

Nick Bostrom Made the World Fear AI. Now He Asks: What if It Fixes Everything?

Philosopher Nick Bostrom popularized the idea superintelligent AI could erase humanity. His new book imagines a world in which algorithms have solved every problem.

PHOTOGRAPH: THE WASHINGTON POST/GETTY IMAGES

Philosopher Nick Bostrom is surprisingly cheerful for someone who has spent so much time worrying about ways that humanity might destroy itself. In photographs he often looks deadly serious, perhaps appropriately haunted by the existential dangers roaming around his brain. When we talk over Zoom, he looks relaxed and is smiling.

Sign Up Today

This is an edition of WIRED's Fast Forward newsletter, a weekly dispatch from the future by Will Knight, exploring AI advances and other technology set to change our lives.

Bostrom has made it his life’s work to ponder far-off technological advancement and existential risks to humanity. With the publication of his last book, Superintelligence: Paths, Dangers, Strategies, in 2014, Bostrom drew public attention to what was then a fringe idea—that AI would advance to a point where it might turn against and delete humanity.

To many in and outside of AI research the idea seemed fanciful, but influential figures including Elon Musk cited Bostrom’s writing. The book set a strand of apocalyptic worry about AI smoldering that recently flared up following the arrival of ChatGPT. Concern about AI risk is not just mainstream but also a theme within government AI policy circles.

Bostrom’s new book takes a very different tack. Rather than play the doomy hits, Deep Utopia: Life and Meaning in a Solved World, considers a future in which humanity has successfully developed superintelligent machines but averted disaster. All disease has been ended and humans can live indefinitely in infinite abundance. Bostrom’s book examines what meaning there would be in life inside a techno-utopia, and asks if it might be rather hollow. He spoke with WIRED over Zoom, in a conversation that has been lightly edited for length and clarity.

Will Knight: Why switch from writing about superintelligent AI threatening humanity to considering a future in which it’s used to do good?

Nick Bostrom: The various things that could go wrong with the development of AI are now receiving a lot more attention. It's a big shift in the last 10 years. Now all the leading frontier AI labs have research groups trying to develop scalable alignment methods. And in the last couple of years also, we see political leaders starting to pay attention to AI.

There hasn't yet been a commensurate increase in depth and sophistication in terms of thinking of where things go if we don't fall into one of these pits. Thinking has been quite superficial on the topic.

When you wrote Superintelligence, few would have expected existential AI risks to become a mainstream debate so quickly. Will we need to worry about the problems in your new book sooner than people might think?

As we start to see automation roll out, assuming progress continues, then I think these conversations will start to happen and eventually deepen.

Social companion applications will become increasingly prominent. People will have all sorts of different views and it’s a great place to maybe have a little culture war. It could be great for people who couldn't find fulfillment in ordinary life but what if there is a segment of the population that takes pleasure in being abusive to them?

In the political and information spheres we could see the use of AI in political campaigns, marketing, automated propaganda systems. But if we have a sufficient level of wisdom these things could really amplify our ability to sort of be constructive democratic citizens, with individual advice explaining what policy proposals mean for you. There will be a whole bunch of dynamics for society.

Would a future in which AI has solved many problems, like climate change, disease, and the need to work, really be so bad?

Ultimately, I'm optimistic about what the outcome could be if things go well. But that’s on the other side of a bunch of fairly deep reconsiderations of what human life could be and what has value. We could have this superintelligence and it could do everything: Then there are a lot of things that we no longer need to do and it undermines a lot of what we currently think is the sort of be all and end all of human existence. Maybe there will also be digital minds as well that are part of this future.

Coexisting with digital minds would itself be quite a big shift. Will we need to think carefully about how we treat these entities?

My view is that sentience, or the ability to suffer, would be a sufficient condition, but not a necessary condition, for an AI system to have moral status.

There might also be AI systems that even if they're not conscious we still give various degrees of moral status. A sophisticated reasoner with a conception of self as existing through time, stable preferences, maybe life goals and aspirations that it wants to achieve, and maybe it can form reciprocal relationships with humans—if that were such a system I think that plausibly there would be ways of treating it that would be wrong.

COURTESY OF IDEAPRESS

What if we didn’t allow AI to become more willful and develop some sense of self. Might that not be safer?

There are very strong drivers for advancing AI at this point. The economic benefits are massive and will become increasingly evident. Then obviously there are scientific advances, new drugs, clean energy sources, et cetera. And on top of that, I think it will become an increasingly important factor in national security, where there will be military incentives to drive this technology forward.

I think it would be desirable that whoever is at the forefront of developing the next generation AI systems, particularly the truly transformative superintelligent systems, would have the ability to pause during key stages. That would be useful for safety.

I would be much more skeptical of proposals that seemed to create a risk of this turning into AI being permanently banned. It seems much less probable than the alternative, but more probable than it would have seemed two years ago. Ultimately it wouldn't be an immense tragedy if this was never developed, that we were just kind of confined to being apes in need and poverty and disease. Like, are we going to do this for a million years?

Turning back to existential AI risk for a moment, are you generally happy with efforts to deal with that?

Well, the conversation is kind of all over the place. There are also a bunch of more immediate issues that deserve attention—discrimination and privacy and intellectual property et cetera.

Companies interested in the longer term consequences of what they're doing have been investing in AI safety and in trying to engage policymakers. I think that the bar will need to sort of be raised incrementally as we move forward.

In contrast to so-called AI doomers there are some who advocate worrying less and accelerating more. What do you make of that movement?

People sort of divide themselves up into different tribes that can then fight pitched battles. To me it seems clear that it’s just very complex and hard to figure out what actually makes things better or worse in particular dimensions.

I've spent three decades thinking quite hard about these things and I have a few views about specific things but the overall message is that I still feel very in the dark. Maybe these other people have found some shortcuts to bright insights.

Perhaps they’re also reacting to what they see as knee-jerk negativity about technology?

That’s also true. If something goes too far in another direction it naturally creates this. My hope is that although there are a lot of maybe individually irrational people taking strong and confident stances in opposite directions, somehow it balances out into some global sanity.

I think there's like a big frustration building up. Maybe as a corrective they have a point, but I think ultimately there needs to be a kind of synthesis.

Since 2005 you have worked at Oxford University’s Future of Humanity Institute, which you founded. Last month it announced it was closing down after friction with the university’s bureaucracy. What happened?

It's been several years in the making, a kind of struggle with the local bureaucracy. A hiring freeze, a fundraising freeze, just a bunch of impositions, and it became impossible to operate the institute as a dynamic, interdisciplinary research institute. We were always a little bit of a misfit in the philosophy faculty, to be honest.

What’s next for you?

I feel an immense sense of emancipation, having had my fill for a period of time perhaps of dealing with faculties. I want to spend some time I think just kind of looking around and thinking about things without a very well-defined agenda. The idea of being a free man seems quite appealing.

bnew · May 4, 2024

X launches Stories, delivering news summarized by Grok AI | TechCrunch

X's Premium subscribers will be able to read a summary of posts on X associated with each trending story featured on the For You tab in Explore.

techcrunch.com

X launches Stories, delivering news summarized by Grok AI

Sarah Perez @sarahpereztc / 4:14 PM EDT•May 3, 2024

Comment

Image Credits: TechCrunch

X, formerly Twitter, is now using Elon Musk’s AI chatbot Grok to power a feature that summarizes the personalized trending stories in the app’s Explore section. According to an announcement and screenshots posted by the X Engineering team on Friday, X’s Premium subscribers will be able to read a summary of posts on X associated with each trending story featured on the For You tab in Explore.

The For You page showcases the news and stories being shared across X’s platform that are popular within your network, along with other suggested items. It’s among the first stops for X users who want to catch up with what’s being said on the platform, without having to spend long amounts of time scrolling their timeline.

For instance, a TechCrunch reader’s For You page today may feature stories about Apple’s coming iPad event, Microsoft’s security overhaul, and burnout among AI engineers. As you tap into each story to view the associated X posts, a summary of the story will now appear at the top of the page, offering an overview of the subject matter.

In the case of the AI burnout story, for example, the Grok-powered summary begins: “AI engineers are facing burnout and rushed rollouts due to the competitive race in the tech industry, as companies prioritize investor satisfaction over solving actual problems.” After briefly touching on the problem of the AI “rat race,” the story concludes by saying that “critics argue that proper safeguards and thoughtful innovation should not be afterthoughts in the pursuit of AI investments …”

Humorously, a message appears below that summary, warning: “Grok can make mistakes, verify its outputs.”

The idea of summarizing trends is not a new one, but it is new in terms of how the summaries are being handled. Under its prior leadership, Twitter began adding headlines and descriptions to its trends in 2020, though not with the help of an AI bot. Instead, Twitter itself would annotate some of its daily trends with extra information and pin a representative tweet to provide further context. However, Twitter’s rollout was haphazard, with some trends getting written up and others not.

With Grok’s Stories, as the summaries are called, all the top news on the For You page is summarized.

Access to xAI’s chatbot Grok is meant to be a selling point to push users to buy premium subscriptions. With the Premium and top-tier Premium+ plans, users can access Grok by tapping on the bottom middle button of the app. A snarky and “rebellious” AI, Grok’s differentiator from other AI chatbots like ChatGPT is its exclusive and real-time access to X data.

A post published to X on Friday by tech journalist Alex Kantrowitz lays out Elon Musk’s further plan for AI-powered news on X, based on an email conversation with the X owner.

Kantrowitz says that conversations on X will make up the core of Grok’s summaries. Grok won’t look at the article text, in other words, even if that’s what people are discussing on the platform. That could be a problem in terms of painting a true picture of the news being shared, as what people are chattering about on X may be their reactions or opinions, not the news itself. Kantrowitz calls the move “controversial” but admits there’s opportunity there.

Journalists are already having to contend with AI news summaries in other areas as well, including from startups. For example, Arc’s new web browser includes an AI summary feature and former Twitter engineers are building an AI news summary service called Particle. How this will play out in terms of traffic to the news sites themselves remains to be seen. Kantrowitz believes that users may be interested in going “deeper into the source material once their curiosity is piqued,” he writes. But it’s also likely that at least some news sites will go out of business as page views drop due to AI summaries, leaving fewer sources for AI bots like Grok to summarize in the long run.

For that reason, some news publishers are doing deals with AI providers like OpenAI’s recently announced partnership with the FT. Others, such as Axel Springer, the AP, and Le Monde, have also announced similar moves. In X’s case, it’s able to get at the news by way of the conversation around it — and without having to partner to access the news content itself. That’s both clever and worrisome, the latter from a misinformation standpoint.

Grok’s Stories are rolling out to Premium X subscribers now. Access to Premium starts at $8 per month, if paying on the web and not through the app stores.

BoulderSizedBoulder (@rocksizedrock.bsky.social)

Just going to point out Elon Musk stole "Grok" from Robert Heinlein's "Stranger in a Strange Land" and he in no way owns it so feel free to use it in a way that makes it clear Elon just stole it from an author.

bsky.app

Bluesky

bsky.app

Emmie Hine (@emmiehine@dair-community.social)

@Techmeme@techhub.social Summaries based on Twitterati commentary? What could go wrong?

dair-community.social

Kierkegaanks, π/🦴 (@Kierkegaanks@beige.party)

@Techmeme@techhub.social enshyttification condensed and took human form. It named itself Elmo

beige.party

Liam Egan (@LiamEgan@mstdn.ca)

@Techmeme@techhub.social That is how you get widespread bias and disinformation. #gigo

mstdn.ca

Bluesky

bsky.app

Bluesky

bsky.app

Anthony In TX (@anthonyintx.bsky.social)

We already tried this https://www.theverge.com/2016/3/24/11297050/tay-microsoft-chatbot-racist [contains quote post or other embedded content]

bsky.app

Avishay Ben-Sasson-Gordis (@avishaybsg.bsky.social)

The funny thing is that this wouldn't be *quite* as insane a proposition before Musk broke twitter's ability to function as an independent source of news [contains quote post or other embedded content]

bsky.app

_borntobemild (@born2bemild.bsky.social)

The Brothers Grimm: Puss In Boots ➡️ Grok: p*ssy In Bio.

bsky.app

Gareth Williams (@notgareth.bsky.social)

In Australia we call that Sky News [contains quote post or other embedded content]

bsky.app

Michael Paul 🐓 (@mjpaul.bsky.social)

news articles are already news summaries tho [contains quote post or other embedded content]

bsky.app

Bluesky

bsky.app

Nate Schenkkan (@nateschenkkan.bsky.social)

“Given the challenges we face to our fundamental shared understanding of the world, I would like to destroy it” [contains quote post or other embedded content]

bsky.app

Bluesky

bsky.app

Wes Miller (@getwired.bsky.social)

Human centipede of news. [contains quote post or other embedded content]

bsky.app

Star Trek Liker (@smartsmears.bsky.social)

LLM Bots summarizing posts that porn bots and crypto bots make about real articles without reading them sounds like the kind of amazing proposition that you can only find on X: The Everything App [contains quote post or other embedded content]

bsky.app

🇬🇾🗽Sydette The Dreaded Gorgon🗽🇬🇾 (@blackamazon.bsky.social)

LOL god this is hell [contains quote post or other embedded content]

bsky.app

Andy "Legally Blondeshell" Neill (@legalba.bsky.social)

I, too, like to ask my psychotic acquaintance to tell me things he vaguely knows about, where he heard it "in a pub from a mate who's cousin with this guy who knew the guy who it happened to". [contains quote post or other embedded content]

bsky.app

e.w. niedermeyer (@niedermeyer.io)

A bot that summarizes what the users of a racist social media platform say about stories it hasn't actually read definitely sounds worthy of the $6 billion xAI is reportedly raising at a $18b valuation. [contains quote post or other embedded content]

bsky.app

bnew · May 4, 2024

OpenAI CEO Sam Altman says GPT-4 is the dumbest AI model you'll ever have to use again

According to OpenAI CEO Sam Altman, GPT-4 is by far the dumbest AI model that humans have to use compared to what's coming in the future.

the-decoder.com

AI in practice

May 2, 2024

OpenAI CEO Sam Altman says GPT-4 is the dumbest AI model you'll ever have to use again

YouTube Screenshot via Stanford eCorner

Matthias Bastian
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Profile
E-Mail

According to OpenAI CEO Sam Altman, GPT-4 is by far the dumbest AI model that humans have to use compared to what's coming in the future.

During a recent appearance at Stanford University, Altman said that OpenAI's current AI models still have significant room for improvement. " ChatGPT is like mildly embarrassing at best. GPT-4 is by far the dumbest model any of you will ever ever have to use again," he said during an appearance at Stanford University.

The CEO believes that there will be much more powerful AI systems in the coming years, saying with a high degree of scientific certainty that humanity will have more advanced models every year.

"GPT-5 is going to be a lot smarter than GPT-4, GPT-6 is going to be a lot smarter than GPT-5, and we are not near the top of this curve," Altman said.

Developing such systems is expensive, but that doesn't worry Altman. "Whether we burn 500 million a year or 5 billion or 50 billion a year, I don't care. I genuinely don't, as long as we can, I think, stay on a trajectory where eventually we create way more value for society than that, and as long as we can figure out a way to pay the bills. We're making AGI, it's going to be expensive, it's totally worth it," he said.

External media content ( www.youtube.com) has been blocked here. When loading or playing, connections are established to the servers of the respective providers. Personal data may be communicated to the providers in the process. You can find more information in our privacy policy.

Agents as the next evolution of AI

While Altman didn't provide a timeline for the development of artificial general intelligence (AGI), he told MIT Technology Review that he believes there will be several versions of AGI that are more or less suitable for certain tasks.

Altman sees intelligent agents as the killer application for future AI systems. These "super-competent colleagues" would know everything about a person's life, including emails and conversations, and could perform certain tasks on the fly, suggest solutions to complex problems, and ask questions when needed.

In the future, Altman believes that AI will not only generate better text, images, and video, but will also be able to perform real-world tasks, further integrating systems into people's daily lives.

According to Altman, this doesn't necessarily require new hardware, as the AI assistant could exist in the cloud - though many users would likely prefer a new device for it.

Altman is reportedly working with iPhone designer Jony Ive on new AI hardware, and OpenAI is said to be developing two agent systems that will automate entire work processes.

GPT-5 is reportedly in development and could be released as early as mid-year. It is expected to be significantly better than its predecessor, GPT-4. It is rumored that GPT-5 will support video generation in addition to text and images. If OpenAI follows the DALL-E approach with its AI video generator Sora, video generation could be integrated into ChatGPT.

Summary

OpenAI CEO Sam Altman expects much more powerful AI models than GPT-4 in the future. In his opinion, GPT-4 is "by far the dumbest model" compared to what is yet to come.
Altman sees intelligent agents as the killer application for future AI systems. They will act as "super-competent colleagues" who know everything about a person's life and can perform specific tasks or suggest solutions to more complex problems.
GPT-5 is already under development and will be released by the middle of this year at the earliest. It is said to be much better than GPT-4. Rumor has it that GPT-5 will support video as well as text and images.

bnew · May 6, 2024

Ocrbench Leaderboard - a Hugging Face Space by echo840

View the OCRBench leaderboard evaluating models on tasks like text recognition and VQA. Provide CSV files for leaderboard data, and get a ranked list with scores and models.

huggingface.co

OCRBench Leaderboard

| GitHub | Paper |
OCRBenchText Recognition

OCRBench is a comprehensive evaluation benchmark designed to assess the OCR capabilities of Large Multimodal Models. It comprises five components: Text Recognition, SceneText-Centric VQA, Document-Oriented VQA, Key Information Extraction, and Handwritten Mathematical Expression Recognition. The benchmark includes 1000 question-answer pairs, and all the answers undergo manual verification and correction to ensure a more precise evaluation.

Rank	Name	Language Model	Open Source	Text Recognition	Scene Text-Centric VQA	Doc-Oriented VQA	KIE	HMER	Final Score
1	Qwen-VL-Max	-	No	254	166	148	143	12	723
2	Qwen-VL-Plus	-	No	248	155	141	141	9	694
3	Gemini	-	No	215	174	128	134	8	659
4	GPT4V	-	No	167	163	146	160	9	645
5	MiniCPM-V-2	MiniCPM-2.4B	Yes	245	171	103	86	0	605
6	mPLUG-DocOwl1.5	LLaMA-2 7B	Yes	182	157	126	134	0	599
7	TextMonkey	Qwen-7B	Yes	169	164	115	113	0	561
8	InternVL-Chat-Chinese	LLaMA2-13B	Yes	228	153	72	64	0	517
9	Monkey	Qwen-7B	Yes	174	161	91	88	0	514
10	InternLM-XComposer2	InternLM2-7B	Yes	160	160	103	87	1	511
11	QwenVL	Qwen-7B	Yes	179	157	95	75	0	506
12	mPLUG-Owl2	LLaMA2-7B	Yes	153	153	41	19	0	366
13	LLaVAR	LLaMA-13B.	Yes	186	122	25	13	0	346
14	LLaVA1.5-13B	Vicuna-v1.5-13B	Yes	176	129	19	7	0	331
15	InternLM-XComposer	InternLM-7B	Yes	192	91	14	6	0	303
16	LLaVA1.5-7B	Vicuna-v1.5-7B	Yes	160	117	15	5	0	297
17	mPLUG-Owl	LLaMA-2 7B	Yes	172	104	18	3	0	297
18	BLIVA	Vicuna-7B	Yes	165	103	22	1	0	291
19	InstructBLIP	Vicuna-7b	Yes	168	93	14	1	0	276
20	BLIP2-6.7B	OPT-6.7B	Yes	154	71	10	0	0	235
21	MiniGPT4V2	LLaMA2-13B	Yes	124	29	4	0	0	157

bnew · May 6, 2024

Reddit users compile list of words and phrases that unmask ChatGPT's writing style

As AI-generated texts become more prevalent on the internet, social media, and in emails, often without any labeling, Reddit users are sharing the telltale signs they use to identify text generated by ChatGPT.

the-decoder.com

AI in practice

May 1, 2024

Reddit users compile list of words and phrases that unmask ChatGPT's writing style

Midjourney prmpted by THE DECODER

Matthias Bastian

Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.

Profile

E-Mail

Content

https://the-decoder.com/reddit-user...s-that-unmask-chatgpts-writing-style/#summary

As AI-generated texts become more prevalent on the internet, social media, and in emails, often without any labeling, Reddit users are sharing the telltale signs they use to identify text generated by ChatGPT.

In a [URL='https://www.reddit.com/r/OpenAI/comments/1cdo36l/whats_your_personal_tell_word_to_identify/']thread started by user PowerfulDev that has garnered over 300 comments, users discussed the words and phrases that can be used to better identify ChatGPT-generated content. According to Reddit users, ChatGPT tends to use certain words disproportionately, such as:

Delve
Tapestry
Kaleidoscope
Foster
Nuanced
Crucial
Essential
Furthermore
Moreover

Many users agree that ChatGPT tends to draw conclusions too often, even when they are unnecessary. "In conclusion, I think this is a very helpful way to identify ChatGPT content," writes user MrSnowden.

Slightly more stylized language and phrases such as "in this digital world" or "let's dive deeper" are also considered indicators of ChatGPT text.

Commenters also describe ChatGPT as producing overly intellectual passages with words like "intricate," "nuanced," "complex," or "multifaceted" for certain topics.

Frequent use of hyphens in compound adjectives, even when not grammatically necessary, is also a potential ChatGPT telltale. Freylaverse points to em-dashes with no space on either side, which are not as easy to type on a keyboard as hyphens. ChatGPT uses them correctly, while lazy humans usually just use hyphens.

Sentences and paragraphs of uniform length and an overall formal style are also characteristic of ChatGPT. In emails, phrases like "I hope this email finds you well" or the excessive use of "moreover" or "furthermore" are seen as red flags.

AI detectives have an easy time with texts in which ChatGPT writes: "As an AI language model ...". Some people overlook this disclaimer and publish the text anyway.

However, some Reddit users caution against taking individual words as clear evidence of ChatGPT. After all, people might use those words - otherwise they wouldn't be in the training data. Only in combination and in large numbers are they indicative of AI-generated text.

Anecdotal analysis and the uncanny valley of communication

An analysis by Reddit user peyotebonsai as part of a research project shows that LinkedIn posts written by AI perform slightly better on average on sentiment analysis using tools like TextBlob and Vader than posts written by human authors.

TextBlob and Vader are programs that can assess sentiment and emotion in text. The results suggest that AI-generated text tends to sound more positive than human-generated text because of word choice, Peyotebonsai says.

Interestingly, however, a comparison of other indicators, such as reposts, comments and likes, showed that AI-generated content received significantly less engagement from users than human-generated posts.

This suggests that LinkedIn users are well aware of the subtle but noticeable differences between human and machine expression.

The analysis should be taken with a grain of salt, of course, since peyotebonsai posts anonymously and does not publish the results in detail. But it is consistent with my anecdotal experience, for what it is worth.

Or as user opi098514 puts it: "For me, it’s not a word. It’s kind of the uncanny valley…. But with communication."

In case you never heard of this effect: The Uncanny Valley originally describes the phenomenon that robots or human-like figures are often perceived as uncanny because they look human-like, but not human-like enough. This creates a feeling of alienation.

When it comes to communication, the analogy suggests that the subtle but noticeable differences in the way AI and humans express themselves can create a subliminal feeling of discomfort in the reader, even though AI texts may appear more coherent and positive on the surface.

Reddit users discuss words and phrases that are typical of ChatGPT-generated text, including "delve," "tapestry," "kaleidoscope," "foster," "nuanced," "crucial," "essential," "moreover," and "furthermore.
Other indicators include stilted language, overly intellectual passages, many hyphens in compound adjectives, similarly long sentences and paragraphs, and a very formal style.
One user describes the feeling he gets when reading AI texts: They are like the uncanny valley of communication. Somehow real, but not.

Sources
Reddit

https://archive.is/PBp5t

bnew · May 6, 2024

Massive prompts can outperform fine-tuning for LLMs, researchers find

Researchers have found that giving large language models (LLMs) many examples directly in the prompt can be more effective than time-consuming fine-tuning, according to a study from Carnegie Mellon and Tel Aviv University.

the-decoder.com

AI in practice

May 6, 2024

Massive prompts can outperform fine-tuning for LLMs, researchers find

Midjourney prompted by THE DECODER

Matthias Bastian

Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.

Profile
E-Mail

Researchers have found that giving large language models (LLMs) many examples directly in the prompt can be more effective than time-consuming fine-tuning, according to a study from Carnegie Mellon and Tel Aviv University.

This "in-context learning" (ICL) approach becomes more effective as the context window of LLMs grows, allowing for hundreds or thousands of examples in prompts, especially for tasks with many possible answers.

One method for selecting examples for ICL is "retrieval," where an algorithm (BM25) chooses the most relevant examples from a large dataset for each new question. This improves performance compared to random selection, particularly when using fewer examples.

However, the performance gain from retrieval diminishes with large numbers of examples, suggesting that longer prompts become more robust and individual examples or their order become less important.

While fine-tuning usually requires more data than ICL, it can sometimes outperform ICL with very long contexts. In some cases, ICL with long examples can be more effective and efficient than fine-tuning, even though ICL does not actually learn tasks but solves them using the examples, the researchers noted.

Fine-tuning sometimes, but not always, exceeds ICL at high numbers of demonstrations. | Image: Bertsch et al.

The experiments used special variants of the Llama-2-7B and Mistral-7B language models, which can process particularly long input text. The results suggest that ICL with many examples can be a viable alternative to retrieval and fine-tuning, especially as future models improve at handling extremely long input texts.

Ultimately, the choice between ICL and fine-tuning comes down to cost. Fine-tuning has a higher one-time cost, while ICL requires more computing power due to the many examples in the prompt. In some cases, it may be best to use many-shot prompts until you get a robust, reliable, high-quality result, and then use that data for fine-tuning.

While finetuning with full datasets is still a powerful option if the data vastly exceeds the context length, our results suggest that long-context ICL is an effective alternative– trading finetuning-time cost for increased inference-time compute. As the effectiveness and effiency of using very long model context lengths continues to increase, we believe long-context ICL will be a powerful tool for many tasks.

From the paper

The study confirms the results of a recent Google Deepmind study on many-shot prompts, which also showed that using hundreds to thousands of examples can significantly improve LLM results.

Researchers at Carnegie Mellon and Tel Aviv University have discovered that the results of large language models (LLMs) improve the more examples you give them directly in the input (prompt) as context. This method, called "In-Context Learning" (ICL), could be an alternative to time-consuming fine-tuning.
In ICL with a large number of examples in the prompt, the performance of the language models increases further, especially for tasks with many possible answers. Retrieval methods for selecting relevant examples further improve the results. Finetuning requires more data than ICL, but can provide even better results in some cases.
The researchers believe that ICL with long contexts will be a powerful tool for many tasks as language models get better at handling extremely long texts. Ultimately, it is also a question of cost whether ICL or fine-tuning is used. The study confirms earlier results from Google Deepmind on many-shot prompts.

Sources

Arxiv

[2405.00200] In-Context Learning with Long-Context Models: An In-Depth Exploration

Computer Science > Computation and Language

[Submitted on 30 Apr 2024]

In-Context Learning with Long-Context Models: An In-Depth Exploration

Amanda Bertsch, Maor Ivgi, Uri Alon, Jonathan Berant, Matthew R. Gormley, Graham Neubig

As model context lengths continue to increase, the number of demonstrations that can be provided in-context approaches the size of entire training datasets. We study the behavior of in-context learning (ICL) at this extreme scale on multiple datasets and models. We show that, for many datasets with large label spaces, performance continues to increase with hundreds or thousands of demonstrations. We contrast this with example retrieval and finetuning: example retrieval shows excellent performance at low context lengths but has diminished gains with more demonstrations; finetuning is more data hungry than ICL but can sometimes exceed long-context ICL performance with additional data. We use this ICL setting as a testbed to study several properties of both in-context learning and long-context models. We show that long-context ICL is less sensitive to random input shuffling than short-context ICL, that grouping of same-label examples can negatively impact performance, and that the performance boosts we see do not arise from cumulative gain from encoding many examples together. We conclude that although long-context ICL can be surprisingly effective, most of this gain comes from attending back to similar examples rather than task learning.

Comments:	27 pages; preprint
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2405.00200 [cs.CL]
	(or arXiv:2405.00200v1 [cs.CL] for this version)

Submission history

From: Amanda Bertsch [view email]
[v1] Tue, 30 Apr 2024 21:06:52 UTC (233 KB)

https://arxiv.org/pdf/2405.00200

bnew · May 6, 2024

1/1
[CL] In-Context Learning with Long-Context Models: An In-Depth Exploration
A Bertsch, M Ivgi, U Alon, J Berant, M R. Gormley, G Neubig [CMU & Tel Aviv University] (2024)

[2405.00200] In-Context Learning with Long-Context Models: An In-Depth Exploration

- Increasing number of demonstrations in-context to large values leads to surprisingly strong performance, with accuracy gains up to 50 percentage points across datasets.

- Retrieval of relevant examples greatly benefits short-context ICL, but with more demonstrations the gains diminish and randomly selected examples approach retrieval performance.

- Finetuning is more data hungry than ICL, underperforming at low data regimes, but can exceed ICL performance with sufficient additional data.

- Long-context ICL is less sensitive to shuffling of input order than short-context. Sorting examples by label hurts performance increasingly with more examples.

- Blockwise attention recovers most of the performance of full attention over all examples, suggesting gains are from retrieving more relevant examples rather than aggregating many examples.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · May 6, 2024

Open-source model Prometheus 2 can evaluate other language models nearly as well as GPT-4

Prometheus 2, a freely available language model, has been optimized to evaluate other language models, catching up with commercial models such as GPT-4.

the-decoder.com

AI in practice

May 5, 2024

Open-source model Prometheus 2 can evaluate other language models nearly as well as GPT-4

Midjourney prompted by THE DECODER

Matthias Bastian

Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.

Profile

E-Mail

Prometheus 2, a freely available language model, has been optimized to evaluate other language models, catching up with commercial models such as GPT-4.

These evaluations allow researchers and developers to objectively measure and compare the performance of their language models and receive detailed feedback on strengths and weaknesses for targeted improvements, helping to continuously enhance the quality and reliability of language models.

Until now, proprietary models such as GPT-4 have often been used for these evaluations, but they lack transparency, are difficult to control, and are not affordable for many, according to a research team led by Seungone Kim of KAIST AI. Kim's team developed Prometheus 2 to provide an independent, transparent, and detailed evaluation of language models for everyone.

Prometheus 2 can perform evaluations similar to humans and GPT-4, mastering the two most common evaluation methods: direct evaluation, assigning scores on a scale, and pairwise comparison, deciding which of two responses is better.

Prometheus 2 can score answers directly or select the better of two answers. | Image: Kim et al.

Share

Recommend our article

Share

It can also evaluate on user-defined criteria, not limited to general aspects such as helpfulness and harmlessness, allowing for optimization for specific applications, the researchers report.

For example, a medical advice chatbot can be trained and tested on criteria such as trustworthiness, empathy, and professional correctness, enabling the development of high-quality language models for different applications, the team explained.

A new data set and mixed weights

To train Prometheus 2, the researchers created a new pairwise comparison dataset called the "Preference Collection," which contains more than 1,000 different evaluation criteria beyond basic characteristics.

They found that the best results came from training two separate models - one for direct ratings based on the Feedback Collection dataset, and one for pairwise comparisons based on the existing Preference Collection dataset - and then combining their learned weights.

In tests with eight datasets (four for direct ratings, four for pairwise comparisons), Prometheus 2 achieved the highest agreement with human judgments and commercial language models of all freely available rating models.

Although it lags behind GPT-4 and Claude 3 Opus in many tests, it can significantly close the gap with proprietary models, the researchers report.

Prometheus 2 can evaluate generated text as well as GPT-4 and Opus 3, but offers much more transparency and is potentially cheaper. The table shows the results for direct evaluation. | Image: Kim et al.

Prometheus 2 supports independent and transparent evaluation of language models for everyone, contributing to greater fairness and accessibility in the field, according to Kim's team. The code and data are available on Github.

The Prometheus 2 models ( 7B & 8x7B) are available from HuggingFace. According to the team, the faster 7B model achieves 80 percent of the evaluation performance of the 8x7B model, is on par with Mistral's Mixtral-8x7B, and better than Meta's Llama 2 70B.

Summary

Prometheus 2 is a freely available language model that can evaluate other language models as well as commercial models such as GPT-4, but is more transparent and potentially cheaper.
The model was trained on two separate datasets - one for direct scores and one for pairwise comparisons. By combining the learned weights, the researchers achieved the best results.
In tests on eight datasets, Prometheus 2 achieved the highest agreement with human judgments of any freely available model. This makes it possible for anyone to perform an independent and detailed evaluation of language models.

Sources

Github Paper

[2405.01535] Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models

Computer Science > Computation and Language

[Submitted on 2 May 2024]

Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models

Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, Minjoon Seo

Proprietary LMs such as GPT-4 are often employed to assess the quality of responses from various LMs. However, concerns including transparency, controllability, and affordability strongly motivate the development of open-source LMs specialized in evaluations. On the other hand, existing open evaluator LMs exhibit critical shortcomings: 1) they issue scores that significantly diverge from those assigned by humans, and 2) they lack the flexibility to perform both direct assessment and pairwise ranking, the two most prevalent forms of assessment. Additionally, they do not possess the ability to evaluate based on custom evaluation criteria, focusing instead on general attributes like helpfulness and harmlessness. To address these issues, we introduce Prometheus 2, a more powerful evaluator LM than its predecessor that closely mirrors human and GPT-4 judgements. Moreover, it is capable of processing both direct assessment and pair-wise ranking formats grouped with a user-defined evaluation criteria. On four direct assessment benchmarks and four pairwise ranking benchmarks, Prometheus 2 scores the highest correlation and agreement with humans and proprietary LM judges among all tested open evaluator LMs. Our models, code, and data are all publicly available at this https URL.

Comments:	Work in Progress
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2405.01535 [cs.CL]
	(or arXiv:2405.01535v1 [cs.CL] for this version)
	[2405.01535] Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models Focus to learn more

Submission history

From: Seungone Kim [view email]
[v1] Thu, 2 May 2024 17:59:35 UTC (1,959 KB)

https://arxiv.org/pdf/2405.01535

bnew · May 6, 2024

Med-Gemini and Meditron: Google and Meta present new LLMs for medicine

Google and Meta have introduced language models optimized for medical tasks based on their latest LLMs, Gemini and Llama 3. Both models are designed to support doctors and medical staff in a variety of tasks.

the-decoder.com

AI in practice

May 5, 2024

Med-Gemini and Meditron: Google and Meta present new LLMs for medicine

Ideogram prompted by THE DECODER

Matthias Bastian

Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.

Profile
E-Mail

Google and Meta have introduced language models optimized for medical tasks based on their latest LLMs, Gemini and Llama 3. Both models are designed to support doctors and medical staff in a variety of tasks.

Google's Med-Gemini is built on the multimodal Gemini model family. It has been further trained with medical data to draw logical conclusions, understand different modalities such as images and text, and process long contexts.

Image: Google

According to Google, Med-Gemini achieved new top scores in 10 out of 14 medical benchmarks tested, including answering medical exam questions.

Med-Gemini uses a novel uncertainty-based web search. If the model is uncertain about a question, it automatically performs a web search. The additional information from the web is used to reduce the model's uncertainty and improve the quality of the answers.

In answering medical questions, Med-Gemini is just ahead of its predecessor, Med-PaLM 2, and even closer to GPT-4, which is not specifically optimized for medical questions.

Image: Google

This may seem like a small improvement, but when it comes to developing a reliable medical model, every percentage point counts, and the higher you get, the harder it is to make improvements. Still, it shows once again that GPT-4 as a generic LLM is already capable in niche areas.

According to Google, the performance difference is more evident for multimodal tasks such as evaluating medical images. Here, Med-Gemini outperforms GPT-4 by an average of 44.5 percent. Through fine-tuning and adapted encoders, modalities such as ECG recordings can also be processed.

Beispiel eines medizinischen Chatbot-Austauschs mit einem Arzt.

This is how Google envisions the use of Med-Gemini as a diagnostic assistant. | Image: Google

Google uses long context processing to perform reliable LLM-based searches in long pseudonymized patient records, and Med-Gemini can also answer questions about medical instructional videos.

Meta says Meditron is the most capable open-source LLM

In collaboration with ETH Lausanne and Yale University, Meta has developed a suite called Meditron, based on its open-source Llama 3 model. Meta wants it to be especially useful in developing countries and for humanitarian missions.

Recommendation

AI in practice

How Europe's hottest AI startup Mistral AI plans to beat OpenAI

Continuous pre-training on carefully compiled medical data aims to avoid distortions caused by the original Llama 3 web training. For cost reasons, the research team first tested the optimal data mix on the 7B model and then scaled it up to the 70B model.

According to Meta, Meditron is the most capable open-source LLM for medicine in benchmarks such as answering biomedical exam questions. But it's not yet on the same level as proprietary models.

Image: Meta

It is being tested and developed in a "Massive Online Open Validation and Evaluation" (MOOVE) by doctors worldwide, especially from developing countries. Meditron is available from Hugging Face in 7B and 70B versions.

Both models have yet to prove themselves in practice. Many questions about risks, traceability, and liability remain to be answered, especially for use in diagnostics. Google and Meta also stress that further extensive research and development is needed before these models can be used in safety-critical medical tasks.

Summary

Google and Meta present language models optimized for medical tasks: Google's Med-Gemini is based on the Gemini model family, and Meta's Meditron is based on the open-source Llama 3 model, both designed to support physicians and medical staff.
Med-Gemini achieves new highs in many medical benchmarks and uses uncertainty-based web search to improve response quality. It outperforms GPT-4 on multimodal tasks such as medical image analysis and can handle long contexts such as patient records.
Meditron has been optimized through continuous pre-training on medical data and, according to Meta, is the most capable open-source LLM for medicine. It is being tested and developed in an extensive online validation by physicians worldwide, especially for use in countries with fewer medical resources.

Sources

Meta Med-Gemini Paper

[2404.18416] Capabilities of Gemini Models in Medicine

Computer Science > Artificial Intelligence

[Submitted on 29 Apr 2024 (v1), last revised 1 May 2024 (this version, v2)]

Capabilities of Gemini Models in Medicine

Khaled Saab, Tao Tu, Wei-Hung Weng, Ryutaro Tanno, David Stutz, Ellery Wulczyn, Fan Zhang, Tim Strother, Chunjong Park, Elahe Vedadi, Juanma Zambrano Chaves, Szu-Yeu Hu, Mike Schaekermann, Aishwarya Kamath, Yong Cheng, David G.T. Barrett, Cathy Cheung, Basil Mustafa, Anil Palepu, Daniel McDuff, Le Hou, Tomer Golany, Luyang Liu, Jean-baptiste Alayrac, Neil Houlsby, Nenad Tomasev, Jan Freyberg, Charles Lau, Jonas Kemp, Jeremy Lai, Shekoofeh Azizi, Kimberly Kanada, SiWai Man, Kavita Kulkarni, Ruoxi Sun, Siamak Shakeri, Luheng He, Ben Caine, Albert Webson, Natasha Latysheva, Melvin Johnson, Philip Mansfield, Jian Lu, Ehud Rivlin, Jesper Anderson, Bradley Green, Renee Wong, Jonathan Krause, Jonathon Shlens, Ewa Dominowska, S. M. Ali Eslami, Katherine Chou, Claire Cui, Oriol Vinyals, Koray Kavukcuoglu, James Manyika, Jeff Dean, Demis Hassabis, Yossi Matias, Dale Webster, Joelle Barral, Greg Corrado, Christopher Semturs, S. Sara Mahdavi, Juraj Gottweis, Alan Karthikesalingam, Vivek Natarajan

Excellence in a wide variety of medical applications poses considerable challenges for AI, requiring advanced reasoning, access to up-to-date medical knowledge and understanding of complex multimodal data. Gemini models, with strong general capabilities in multimodal and long-context reasoning, offer exciting possibilities in medicine. Building on these core strengths of Gemini, we introduce Med-Gemini, a family of highly capable multimodal models that are specialized in medicine with the ability to seamlessly use web search, and that can be efficiently tailored to novel modalities using custom encoders. We evaluate Med-Gemini on 14 medical benchmarks, establishing new state-of-the-art (SoTA) performance on 10 of them, and surpass the GPT-4 model family on every benchmark where a direct comparison is viable, often by a wide margin. On the popular MedQA (USMLE) benchmark, our best-performing Med-Gemini model achieves SoTA performance of 91.1% accuracy, using a novel uncertainty-guided search strategy. On 7 multimodal benchmarks including NEJM Image Challenges and MMMU (health & medicine), Med-Gemini improves over GPT-4V by an average relative margin of 44.5%. We demonstrate the effectiveness of Med-Gemini's long-context capabilities through SoTA performance on a needle-in-a-haystack retrieval task from long de-identified health records and medical video question answering, surpassing prior bespoke methods using only in-context learning. Finally, Med-Gemini's performance suggests real-world utility by surpassing human experts on tasks such as medical text summarization, alongside demonstrations of promising potential for multimodal medical dialogue, medical research and education. Taken together, our results offer compelling evidence for Med-Gemini's potential, although further rigorous evaluation will be crucial before real-world deployment in this safety-critical domain.

Subjects:	Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2404.18416 [cs.AI]
	(or arXiv:2404.18416v2 [cs.AI] for this version)
	[2404.18416] Capabilities of Gemini Models in Medicine Focus to learn more

Submission history

From: Khaled Saab [view email]
[v1] Mon, 29 Apr 2024 04:11:28 UTC (4,986 KB)
[v2] Wed, 1 May 2024 17:12:10 UTC (4,986 KB)

https://arxiv.org/pdf/2404.18416

bnew · May 6, 2024

Meaningless fillers enable complex thinking in large language models

Researchers have found that specifically trained LLMs can solve complex problems just as well using dots like "......" instead of full sentences. This could make it harder to control what's happening in these models.

the-decoder.com

DE

AI in practice

Apr 29, 2024

Meaningless fillers enable complex thinking in large language models

Ideogram prompted by THE DECODER

Matthias Bastian

Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.

Profile

E-Mail

Researchers have found that specifically trained LLMs can solve complex problems just as well using dots like "......" instead of full sentences. This could make it harder to control what's happening in these models.

The researchers trained Llama language models to solve a difficult math problem called "3SUM", where the model has to find three numbers that add up to zero.

Usually, AI models solve such tasks by explaining the steps in full sentences, known as "chain of thought" prompting. But the researchers replaced these natural language explanations with repeated dots, called filler tokens.

Surprisingly, the models using dots performed as well as those using natural language reasoning with full sentences. As the tasks became more difficult, the dot models outperformed models that responded directly without any intermediate reasoning.

Die drei Prompting-Methoden, die in der Studie verglichen wurden.

The study compared three prompting methods.| Image: Jacob Pfau, William Merrill & Samuel R. Bowman

The researchers discovered the models were actually using the dots for calculations relevant to the task. The more dots available, the more accurate the answer was, suggesting more dots could provide the model with greater "thinking capacity".

They suspect the dots act as placeholders where the model inserts various numbers and checks if they meet the task's conditions. This allows the model to answer very complex questions it couldn't solve all at once.

Co-author Jacob Pfau says this result poses a key question for AI security: As AI systems increasingly "think" in hidden ways, how can we ensure they remain reliable and safe?

The finding aligns with recent research showing longer chain-of-thought prompts can boost language model performance, even if the added content is off-topic, essentially just multiplying tokens.

The researchers think it could be useful to teach AI systems to handle filler tokens from the start in the future, despite the challenging process. It may be worthwhile if the problems LLMs need to solve are highly complex and can't be solved in a single step.

Additionally, the training data must include enough examples where the problem is broken into smaller, simultaneously processable parts.

If these criteria are met, the dot method could also work in regular AI systems, helping them answer tough questions without it being obvious from their responses.

However, dot system training is considered difficult because it's unclear exactly what the AI calculates with the dots, and the dot approach doesn't work well for explanations needing a specific step sequence.

Popular chatbots like ChatGPT can't automatically do the dot reasoning - they need to be trained for it. So chain-of-thought prompting is still the standard approach to improving LLM reasoning.

Summary

Researchers have found that AI models can solve complex tasks like "3SUM" by using simple dots like "......" instead of sentences. The more dots available, the more accurate the results.
The dots are thought to act as placeholders into which the model inserts different numbers and checks that they fulfil the conditions. This makes it possible to answer very complex questions that cannot be solved in one go.
According to the researchers, this hidden computation raises safety issues when AI systems "think" in secret.

Sources

Arxiv

Let's Think Dot by Dot: Hidden Computation in Transformer Language Models

Chain-of-thought responses from language models improve performance across most benchmarks. However, it remains unclear to what extent these performance gains can be attributed to human-like task decomposition or simply the greater computation that additional tokens allow. We show that...

arxiv.org

Computer Science > Computation and Language

[Submitted on 24 Apr 2024]

Let's Think Dot by Dot - Hidden Computation in Transformer Language Models

Jacob Pfau, William Merrill, Samuel R. Bowman

Chain-of-thought responses from language models improve performance across most benchmarks. However, it remains unclear to what extent these performance gains can be attributed to human-like task decomposition or simply the greater computation that additional tokens allow. We show that transformers can use meaningless filler tokens (e.g., '......') in place of a chain of thought to solve two hard algorithmic tasks they could not solve when responding without intermediate tokens. However, we find empirically that learning to use filler tokens is difficult and requires specific, dense supervision to converge. We also provide a theoretical characterization of the class of problems where filler tokens are useful in terms of the quantifier depth of a first-order formula. For problems satisfying this characterization, chain-of-thought tokens need not provide information about the intermediate computational steps involved in multi-token computations. In summary, our results show that additional tokens can provide computational benefits independent of token choice. The fact that intermediate tokens can act as filler tokens raises concerns about large language models engaging in unauditable, hidden computations that are increasingly detached from the observed chain-of-thought tokens.

Comments:	17 pages, 10 figures
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
ACM classes:	I.2.6
Cite as:	arXiv:2404.15758 [cs.CL]
	(or arXiv:2404.15758v1 [cs.CL] for this version)
	[2404.15758] Let's Think Dot by Dot: Hidden Computation in Transformer Language Models Focus to learn more

Submission history

From: Jacob Pfau [view email]
[v1] Wed, 24 Apr 2024 09:30:00 UTC (579 KB)

https://arxiv.org/pdf/2404.15758

bnew · May 7, 2024

1/2
Sam Altman says GPT will continue to improve for at least the next 3 or 4 model generations: "we are so far away from when we start to level off"

2/2
Source:

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/3
A conversation with OpenAI on what to expect in the next 12 months

- today's systems are laughably bad
- ChatGPT not long term engagement model
- models capable of complex "work"
- like a great team mate working with u
- shift towards verbal interfaces & beyond, Multimodality

2/3
full conversation here:

3/3
The work being doing today through Memory & Assistants API is preliminary steps towards stateful systems

But the current ways seem very hacky to me. I really hope we get some algorithmic / architectural innovations for continuous learning, this would also mean genuine memory

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · May 7, 2024

bnew · May 7, 2024

https://openai.com/index/api-partnership-with-stack-overflow

May 6, 2024

API Partnership with Stack Overflow

Stack Overflow and OpenAI today announced a new API partnership that will empower developers with the collective strengths of the world’s leading knowledge platform for highly technical content with the world’s most popular LLM models for AI development.

Editor’s Note: This news was originally shared by Stack Overflow here(opens in a new window).

OpenAI and Stack Overflow are coming together via OverflowAPI access to provide OpenAI users and customers with the accurate and vetted data foundation that AI tools need to quickly find a solution to a problem so that technologists can stay focused on priority tasks. OpenAI will also surface validated technical knowledge from Stack Overflow directly in ChatGPT, giving users easy access to trusted, attributed, accurate, and highly technical knowledge and code backed by the millions of developers that have contributed to the Stack Overflow platform for 15 years.

As part of this collaboration:

OpenAI will utilize Stack Overflow’s OverflowAPI product and collaborate with Stack Overflow to improve model performance for developers who use their products. This integration will help OpenAI improve its AI models using enhanced content and feedback from the Stack Overflow community and provide attribution to the Stack Overflow community within ChatGPT to foster deeper engagement with content.
Stack Overflow will utilize OpenAI models as part of their development of OverflowAI and work with OpenAI to leverage insights from internal testing to maximize the performance of OpenAI models. OpenAI’s partnership with Stack Overflow will help further drive its mission to empower the world to develop technology through collective knowledge, as Stack Overflow will be able to create better products that benefit the Stack Exchange community’s health, growth, and engagement.

“Learning from as many languages, cultures, subjects, and industries as possible ensures that our models can serve everyone. The developer community is particularly important to both of us. Our deep partnership with Stack Overflow will help us enhance the user and developer experience on both our platforms,” said Brad Lightcap, COO at OpenAI.

“Stack Overflow is the world’s largest developer community, with more than 59 million questions and answers. Through this industry-leading partnership with OpenAI, we strive to redefine the developer experience, fostering efficiency and collaboration through the power of community, best-in-class data, and AI experiences,” said Prashanth Chandrasekar, CEO of Stack Overflow. “Our goal with OverflowAPI, and our work to advance the era of socially responsible AI, is to set new standards with vetted, trusted, and accurate data that will be the foundation on which technology solutions are built and delivered to our user.”

The first set of new integrations and capabilities between Stack Overflow and OpenAI will be available in the first half of 2024. Beyond this, OpenAI’s partnership with Stack Overflow will enable Stack Overflow to continue to reinvest in community-driven features(opens in a new window). To learn more about Stack Overflow’s API solution and partnerships, visit https://stackoverflow.co/api-solutions/

bnew · May 7, 2024

Computer Science > Computation and Language

[Submitted on 6 May 2024]

AlphaMath Almost Zero - process Supervision without process

Guoxin Chen, Minpeng Liao, Chengxi Li, Kai Fan

Recent advancements in large language models (LLMs) have substantially enhanced their mathematical reasoning abilities. However, these models still struggle with complex problems that require multiple reasoning steps, frequently leading to logical or numerical errors. While numerical mistakes can largely be addressed by integrating a code interpreter, identifying logical errors within intermediate steps is more challenging. Moreover, manually annotating these steps for training is not only expensive but also demands specialized expertise. In this study, we introduce an innovative approach that eliminates the need for manual annotation by leveraging the Monte Carlo Tree Search (MCTS) framework to generate both the process supervision and evaluation signals automatically. Essentially, when a LLM is well pre-trained, only the mathematical questions and their final answers are required to generate our training data, without requiring the solutions. We proceed to train a step-level value model designed to improve the LLM's inference process in mathematical domains. Our experiments indicate that using automatically generated solutions by LLMs enhanced with MCTS significantly improves the model's proficiency in dealing with intricate mathematical reasoning tasks.

Comments:	Work in progress
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2405.03553 [cs.CL]
	(or arXiv:2405.03553v1 [cs.CL] for this version)
	[2405.03553] AlphaMath Almost Zero: process Supervision without process Focus to learn more

Submission history

From: Kai Fan Dr [view email]
[v1] Mon, 6 May 2024 15:20:30 UTC (519 KB)

https://arxiv.org/pdf/2405.03553

bnew · May 7, 2024

Should we slow down AI research? | Debate with Meta, IBM, FHI, FLI

Future of Life Institute
Subscribe | 68.5K
Shared May 7, 2024
Mark Brakel (FLI Director of Policy), Yann LeCun, Francesca Rossi, and Nick Bostrom debate: "Should we slow down research on AI?" at the World AI Cannes Festival in February 2024.

The A.I Megathread (LLM , GPT , Development)

Veteran

Nick Bostrom Made the World Fear AI. Now He Asks: What if It Fixes Everything?​

Veteran

X launches Stories, delivering news summarized by Grok AI​

Veteran

OpenAI CEO Sam Altman says GPT-4 is the dumbest AI model you'll ever have to use again​

Agents as the next evolution of AI​

Veteran

OCRBench Leaderboard​

Veteran

Reddit users compile list of words and phrases that unmask ChatGPT's writing style​

Anecdotal analysis and the uncanny valley of communication​

Veteran

Massive prompts can outperform fine-tuning for LLMs, researchers find​

Computer Science > Computation and Language​

In-Context Learning with Long-Context Models: An In-Depth Exploration​

Submission history​

Veteran

Veteran

Open-source model Prometheus 2 can evaluate other language models nearly as well as GPT-4​

A new data set and mixed weights​

Computer Science > Computation and Language​

Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models​

Submission history​

Veteran

Med-Gemini and Meditron: Google and Meta present new LLMs for medicine​

Meta says Meditron is the most capable open-source LLM​

How Europe's hottest AI startup Mistral AI plans to beat OpenAI​

Computer Science > Artificial Intelligence​

Capabilities of Gemini Models in Medicine​

Submission history​

Veteran

Meaningless fillers enable complex thinking in large language models​

Computer Science > Computation and Language​

Let's Think Dot by Dot - Hidden Computation in Transformer Language Models​

Submission history​

Veteran

Veteran

Veteran

API Partnership with Stack Overflow​

Veteran

Computer Science > Computation and Language​

AlphaMath Almost Zero - process Supervision without process​

Submission history​

Veteran

Should we slow down AI research? | Debate with Meta, IBM, FHI, FLI​

Nick Bostrom Made the World Fear AI. Now He Asks: What if It Fixes Everything?

X launches Stories, delivering news summarized by Grok AI

OpenAI CEO Sam Altman says GPT-4 is the dumbest AI model you'll ever have to use again

Agents as the next evolution of AI

OCRBench Leaderboard

Reddit users compile list of words and phrases that unmask ChatGPT's writing style

Anecdotal analysis and the uncanny valley of communication

Massive prompts can outperform fine-tuning for LLMs, researchers find

Computer Science > Computation and Language

In-Context Learning with Long-Context Models: An In-Depth Exploration

Submission history

Open-source model Prometheus 2 can evaluate other language models nearly as well as GPT-4

A new data set and mixed weights

Computer Science > Computation and Language

Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models

Submission history

Med-Gemini and Meditron: Google and Meta present new LLMs for medicine

Meta says Meditron is the most capable open-source LLM

How Europe's hottest AI startup Mistral AI plans to beat OpenAI

Computer Science > Artificial Intelligence

Capabilities of Gemini Models in Medicine

Submission history

Meaningless fillers enable complex thinking in large language models

Computer Science > Computation and Language

Let's Think Dot by Dot - Hidden Computation in Transformer Language Models

Submission history

API Partnership with Stack Overflow

Computer Science > Computation and Language

AlphaMath Almost Zero - process Supervision without process

Submission history

Should we slow down AI research? | Debate with Meta, IBM, FHI, FLI