The A.I Megathread (LLM , GPT , Development)

bnew · Sep 17, 2024

1/11
@denny_zhou
What is the performance limit when scaling LLM inference? Sky's the limit.

We have mathematically proven that transformers can solve any problem, provided they are allowed to generate as many intermediate reasoning tokens as needed. Remarkably, constant depth is sufficient.

[2402.12875] Chain of Thought Empowers Transformers to Solve Inherently Serial Problems (ICLR 2024)

2/11
@denny_zhou
Just noticed a fun youtube video for explaining this paper. LoL. Pointed by @laion_ai http://invidious.poast.org/4JNe-cOTgkY

3/11
@ctjlewis
hey Denny, curious if you have any thoughts. i reached the same conclusion:

[Quoted tweet]
x.com/i/article/178554774683…

4/11
@denny_zhou
Impressive! You would be interested at seeing this: [2301.04589] Memory Augmented Large Language Models are Computationally Universal

5/11
@nearcyan
what should one conclude from such a proof if it’s not also accompanied by a proof that we can train a transformer into the state (of solving a given arbitrary problem), possibly even with gradient descent and common post training techniques?

6/11
@QuintusActual
“We have mathematically proven that transformers can solve any problem, provided they are allowed to generate as many intermediate reasoning tokens as needed.”

I’m guessing this is only true because as a problem grows in difficulty, the # of required tokens approaches

7/11
@Shawnryan96
How do they solve novel problems without a way to update the world model?

8/11
@Justin_Halford_
Makes sense for verifiable domains (e.g. math and coding).

Does this generalize to more ambiguous domains with competing values/incentives without relying on human feedback?

9/11
@ohadasor
Don't fall into it!!

[Quoted tweet]
"can solve any problem"? Really?? Let's read the abstract in the image attached to the post, and see if the quote is correct. Ah wow! Somehow he forgot to quote the rest of the sentence! How is that possible?
The full quote is "can solve any problem solvable by boolean circuits of size T". This changes a lot. All problems solvable by Boolean circuits, of any size, is called the Circuit Evaluation Problem, and is known to cover precisely polynomial time (P) calculations. So it cannot solve the most basic logical problems which are at least exponential. Now here we don't even have P, we have only circuits of size T, which validates my old mantra: it can solve only constant-time problems. The lowest possible complexity class.
And it also validates my claim about the bubble of machine learning promoted by people who have no idea what they're talking about.

10/11
@CompSciFutures
Thx, refreshingly straightforward notation too, I might take the time to read this one properly.

I'm just catching up and have a dumb Q... that is an interestingly narrow subset of symbolic operands. Have you considered what happens if you add more?

11/11
@BatAndrew314
Noob question- how is it related to universal approximation theorem? Meaning does transformer can solve any problem because it is neural net? Or it’s some different property of transformers and CoT?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

Chain of Thought Empowers Transformers to Solve Inherently Serial Problems

Instructing the model to generate a sequence of intermediate steps, a.k.a., a chain of thought (CoT), is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks. However, the mechanism behind CoT remains unclear. This work...

arxiv.org

[Submitted on 20 Feb 2024 (v1), last revised 23 May 2024 (this version, v3)]

Chain of Thought Empowers Transformers to Solve Inherently Serial Problems

Zhiyuan Li, Hong Liu, Denny Zhou, Tengyu Ma

Instructing the model to generate a sequence of intermediate steps, a.k.a., a chain of thought (CoT), is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks. However, the mechanism behind CoT remains unclear. This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of expressiveness. Conceptually, CoT empowers the model with the ability to perform inherently serial computation, which is otherwise lacking in transformers, especially when depth is low. Given input length n, previous works have shown that constant-depth transformers with finite precision poly(n) embedding size can only solve problems in TC0 without CoT. We first show an even tighter expressiveness upper bound for constant-depth transformers with constant-bit precision, which can only solve problems in AC0, a proper subset of TC0. However, with T steps of CoT, constant-depth transformers using constant-bit precision and O(logn) embedding size can solve any problem solvable by boolean circuits of size T. Empirically, enabling CoT dramatically improves the accuracy for tasks that are hard for parallel computation, including the composition of permutation groups, iterated squaring, and circuit value problems, especially for low-depth transformers.

Comments:	38 pages, 10 figures. Accepted by ICLR 2024
Subjects:	Machine Learning (cs.LG); Computational Complexity (cs.CC); Machine Learning (stat.ML)
Cite as:	arXiv:2402.12875 [cs.LG]
	(or arXiv:2402.12875v3 [cs.LG] for this version)
	[2402.12875] Chain of Thought Empowers Transformers to Solve Inherently Serial Problems

Submission history

From: Zhiyuan Li [view email]

[v1] Tue, 20 Feb 2024 10:11:03 UTC (3,184 KB)
[v2] Tue, 7 May 2024 17:00:27 UTC (5,555 KB)
[v3] Thu, 23 May 2024 17:10:39 UTC (5,555 KB)

https://arxiv.org/pdf/2402.12875

bnew · Sep 17, 2024

1/11
@danielhanchen
A transformer's depth affects its reasoning capabilities, whilst model size affects its knowledge capacity

High recommend @ZeyuanAllenZhu's video on reasoning in transformers. Experiments show wider nets don't affect reasoning but more depth helps. Video: Invidious - search

2/11
@fleetwood___
Same claim in the MobileLLM paper from @AIatMeta
https://arxiv.org/pdf/2402.14905

3/11
@danielhanchen
Oh interesting - forgot about this paper!!

4/11
@im_datta0
From Gemma 2 paper :smile:

5/11
@danielhanchen
Oh yep remember this! The Gemma 2 paper did many experiments and ablations - forgot depth and width was also an experiment they did!

6/11
@NicholasLiu77
Model size = hidden state size?

7/11
@danielhanchen
Oh model size as in number of parameters of the model! :smile:

8/11
@gerardsans
There’s absolutely no “reasoning” in Transformers.

9/11
@danielhanchen
The definition of "reasoning" needs to be better defined, but the video did show if you train the LLM on 15 interactions, it can generalize to higher order interactions.

10/11
@inductionheads
I think they should be triangular - wider at first layers than later layers

11/11
@dejanseo
Daniel, it's time.

Unsloth-xxsmall-uncased
Unsloth-xsmall-uncased
Unsloth-small-uncased
Unsloth-base-uncased
Unsloth-large-uncased
Unsloth-xlarge-uncased
Unsloth-xxlarge-uncased

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 17, 2024

1/11
@Swarooprm7
Introducing NATURAL PLAN

: a realistic planning benchmark in natural language!

Key features:
- 3 main tasks: Trip Planning, Meeting Planning, and
Calendar Scheduling.
- Supplies in the context all relevant information to the
model (e.g., Google Flights, Maps, Calendar)
- No need for a separate tool-use environment: direct
LLM calls for evaluations
- Assesses the planning capabilities of large language
models (LLMs)

Joint work with my awesome collaborators at @GoogleDeepMind : @HuaixiuZheng , @hughbzhang , (now at Scale AI), @xinyun_chen_ , @chenmm24 , @Azade_na , @Hou_Le, @HengTze , @quocleix , @edchi ,@denny_zhou .

Paper: https://arxiv.org/pdf/2406.04520
Dataset and evaluation code will be released
[1/5]

2/11
@Swarooprm7
NATURAL PLAN is a challenging benchmark for state of the art models. For example, in Trip Planning, GPT-4 and Gemini 1.5 Pro could only achieve 31.1% and 34.8% solve rate respectively.
[2/5]

3/11
@Swarooprm7
Model performance drops drastically as the complexity of the problem increases: e.g. in Trip Planning, all models perform below 5% when there are 10 cities, highlighting a significant gap in planning in natural language for SoTA LLMs.
[3/5]

4/11
@Swarooprm7
Self-correction does not help and interestingly, the stronger models such as GPT-4 and Gemini 1.5 Pro suffer bigger loss than others.
[4/5]

5/11
@Swarooprm7
In-context planning experiments show promise: Gemini Pro 1.5 is able to leverage more in-context examples up to 355K tokens, still showing steady improvements.
[5/5]

6/11
@YiTayML
great work swaroop and steven!

7/11
@Swarooprm7
Thank you Yi

8/11
@qinyuan_ye
Cool work!! I've always wanted an AI assistant to plan for weekend fun with friends, accounting for the weather, traffic, carpooling, restaurants and everything... It feels like this will be possible soon!
And btw, Natural Questions

Instructions

Plans

What's next?

9/11
@Swarooprm7
Yes, true AI assistant is the future.
Natural Questions

Instructions

Plans

Your pattern is absolutely spot on. Something else I am working on in that line is coming. Let the suspense be there until then

10/11
@billyuchenlin
Awesome

will try to implement SwiftSage & Lumos agents see if how local LLM agents and hybrid agents will perform on it

11/11
@Swarooprm7
Thank you Bill.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
NATURAL PLAN data and eval code is finally up

.
Thank you everyone for your interest and patience!

GitHub - google-deepmind/natural-plan

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 17, 2024

1/1
No-Brainer to use Gemini Flash for vision: Fast, Inexpensive and Accurate!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/11
@deedydas
Gemini 1.5 Flash is the model people are sleeping on.

It took ~5s to recognize all the books on my shelf. GPT 4-o took ~25s!

And $1 gets you 13M tokens on Flash vs 200k tokens on 4-o.

2/11
@deedydas
Here's ChatGPT's ~25s in comparison

3/11
@myotherme100
The GCP onboarding is hostile and Gemini is lobotomized.

Speed doesn't make up for it.

4/11
@deedydas
onboarding being bad is an unserious reason to not use a good model

5/11
@KewkD
Why do you believe text being output faster than anyone can read is beneficial or brag worthy, for any model?

6/11
@deedydas
Not all text output for models are meant for human consumption and even when they are, empirically lower latency leads to higher user retention

7/11
@SteDjokovic
Did you check the results?

Gemini says “left and right” shelves, which GPT correctly identifies top-middle-bottom.

The Elon Musk biography is on the right but Gemini categorised it as left.

Also, comparing Flash with GPT-4o instead of mini?

8/11
@OfficialLoganK
1.5 Flash multi-modal performance is truly wild for the price, this is going to power the next wave of AI startups.

9/11
@stevenheidel
give gpt-4o-mini a try! also returns results in a flash and is 30x cheaper than 4o

10/11
@0xshai
5 seconds is nuts! Awesome speed.

P.S: Musashi reader as well. 🫡

11/11
@RawSucces
If you're want to bypass any AI and get the responses you want.

I’ve made a full video guide on how to do it. Simply reply with "AI" , and I'll send it over to you. "must follow so I can DM you"

It is completely free

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

WIA20XX · Sep 18, 2024

Never seen this before

bnew · Sep 18, 2024

1/11
@lmsysorg
No more waiting. o1's is officially on Chatbot Arena!

We tested o1-preview and mini with 6K+ community votes.

o1-preview: #1 across the board, especially in Math, Hard Prompts, and Coding. A huge leap in technical performance!

o1-mini: #1 in technical areas, #2 overall.

Huge congrats to @OpenAI on this incredible milestone! Come try the king of LLMs and vote at http://lmarena.ai

More analysis below

[Quoted tweet]
Congrats @OpenAI on the exciting o1 release!

o1-preview and o1-mini are now live in Chatbot Arena accepting votes. Come challenge them with your toughest math/reasoning prompts!!

2/11
@lmsysorg
Chatbot Arena Leaderboard overview.

@openai's o1-preview #1 across the board, and o1-mini #1 in technical areas.

3/11
@lmsysorg
Win-rate heat map

4/11
@lmsysorg
Check out full results at http://lmarena.ai/leaderboard!

5/11
@McclaneDet
Given the latency time a human with Google could be o1. Be careful out there folks (especially check writers).

6/11
@_simonsmith
"AI is hitting a wall."

7/11
@axel_pond
very impressive.

thank you for your great service to the community.

8/11
@QStarETH
Math is the key to unlocking the secrets of the universe. We have arrived...

9/11
@Evinst3in
@sama after o1's is officially #1 across the board on Chatbot Arena

10/11
@JonathanRoseD
It seems like the new LLM meta is going to be training models on CoT strategies and relying on agents in the LLM clients. This has implications. Like, should @ollama consider preemptively adding CoT agents for future supporting models?

11/11
@andromeda74356
Can you add a feature where the user can give some text, you convert it to an embedding, and then show how models rank when only using chats that are close to that embedding, so we can see which models are best for our specific use cases?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 18, 2024

bnew · Sep 19, 2024

1/2
It has been a week with a lot of new AI releases. This is like my third time changing the list this week. I will explain the strengths of each AI out there that I use, so you can exactly know which AI is the best! This is what these AIs are good at!

Text Generation (LLM) - (Services):
-(Chat)GPT 4O by OpenAI:
•Questions with chain of thoughts responses
•Questions with concise responses.
•Coding assistant (Basic)
•Summarising
•Spelling Check
•Text Editing
•Formal writing
•Creative writing
•Vision
•Web search
•Math (Python)
•Data analysis (Python)
•Multilingual

-O1 by OpenAI:
•Reasoning with real world knowledge.

-O1 Mini by OpenAI:
•Reasoning
•Coding Composer (Advanced)

-Claude 3.5 Sonnet by Anthropic:
•Answering questions without web search (Recent knowledge cutoff)

-Gemini Flash 1.5 by Google:
•2 million token context window

-Grok 2 (Mini) by xAI:
•Recent facts that aren't popular (X database)
•Chat image generation

Image Generation (Diffusion):
-Flux:
•Best Overall

-Midjourney:
•Best aesthetic

-Dalle 3:
•Best results with bad prompting.
•Best for 2D like images.

-Stable Diffusion XL by Stability AI:
•Control

-Firefly by Adobe:
•Camera control

Video Generation (Diffusion):
-Kling by Kwai:
•Image to video

-Minimax by Haiou AI:
•Text to video

-Gen 3 by RunwayML:
•Video to video

-Dream Machine by Luma AI
•Good results with bad prompting

-Viggle:
•Controlled video animation

-AnimateDiff:
•Good for weird animations

Audio Generation (Each is for different purposes!):
-Elevenlabs/OpenAIs TTS/Stable Audio Open-VoiceOver+SFX
-Suno/Udio-Music

If I haven't mentioned a AI, it's either because I've forgot about it or it's useless!

2/2
I probably need to make something like this

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 19, 2024

1/10
@Kling_ai

Kling AI welcomes another version upgrade!
The all-new Kling 1.5 Model is officially released!

Now supports generating 1080p HD videos in professional mode!

Introducing the new Motion Brush function, enhancing the controllability of your visuals!

KLING AI
Come and experience it now!

/search?q=#kling

2/10
@cfryant
Looking good!

[Quoted tweet]

Breaking: @Kling_ai just released Motion Brush for its video AI model!

Demonstrated here with my standard test, a woman throwing a duck at a guy's head.

First impressions: Some warping going on with the duck, but it did follow the path I drew perfectly.

3/10
@NftMedi
Very exciting! I was expecting motion brush to work in 1.5 too, though.

4/10
@PDXFato
Update only if you pay.

5/10
@dbenyamin
Make an API available!

6/10
@jphilipp
Amazing, thanks! Are your servers overloaded at the moment? My video generations still aren't finished after 1 hour.

7/10
@risphereeditor
Looks good!

8/10
@agoraitconsulti

9/10
@nez147
Looking good :smile:

10/10
@DreamaDream247
what's hot? do you have Ref link ?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 19, 2024

AI is testing games to make sure they're fun enough

AI is taking another step to make the lives of game designers easier. AI agents could help evaluate how engaging a level in video games is.

wired.me

AI is testing games to make sure they’re fun enough

AI is taking another step to make the lives of game designers easier.

Illustration: Umang Sawkar/WIRED Middle East

If you are a video game enthusiast, you would be familiar with the situation where certain levels in a game are just boring. Now, game development is getting an AI turbo charge to ensure that games are fun enough.

A team consisting of researchers from the Queen Mary University of London and Kings College London developed AI agents, termed “exploratory agents“, in different game environments. The team introduced the concept of an exploratory agent as “a type of agent which traverses a level and explores it in accordance to its features. It surveys an environment to observe which features are available in the level, and moves in the direction towards the closest interesting target(s) or direction(s).” By analyzing how these AI agents move through the levels in games, the team can identify which areas are engaging and which are boring or difficult to navigate.

The method involves using two procedural content generators to create ten diverse game levels—five considered engaging and the other five unengaging. It uses various metrics to measure the coverage, novelty, and how unpredictable the agents’ exploration is. The agent’s behavior was analyzed based on key metrics like how much of the environment they covered, how many unique objects they checked out, a custom measure of novelty, the randomness of the agent’s path, and the average motivation the agent felt while navigating.

The results showed that these AI agents can tell the difference between the engaging and unengaging levels in the game. What’s more, they can figure out and quantify each level’s potential for exploration. These findings could provide game developers with some useful insights for enhancing procedurally generated content.

In the end, this work is a way to use AI to fine-tune game design by suggesting systematic ways to assess and improve game environments, making sure they meet the players’ exploration cravings. This way, game designers will be able to automatically test and improve their levels before human playtesters try them. Meanwhile, other AI-driven innovations in the gaming industry are transforming the way games are made. Experiments like Google’s Doom clone have offered us a glimpse into the future where games are not only tested by AI but might also be created by AI.

bnew · Sep 19, 2024

AI gets smarter by re-reading

Researchers have discovered that AI improves reasoning by re-reading, akin to humans. However, it has only proved beneficial twice.

wired.me

AI gets smarter by re-reading

Here's how AI can mimic human behaviour.

Illustration: WIRED Middle East/Umang Sawkar

Artificial intelligence is ever-evolving in this current climate; new discoveries mean new advancements in the field that are supposed to make AI more human-like. Most recently, during an internship at Microsoft, researchers discovered that making AI systems, namely, large language models (LLMs), re-read the same prompts or questions twice increases reasoning.

The researchers have titled the technique RE2 (re-reading) and determined that this similar technique is applicable for different AI models and can be utilized alongside other reasoning techniques.

How does RE2 (re-reading) improve reasoning?

A group of researchers redirected their study to focus on the input of the prompt by processing questions twice rather than trying to evoke reasoning in the output, after noticing a gap in research. The study was motivated by cognitive sciences, which have previously revealed that in order to comprehend problems during learning, humans often re-read texts. This simple method was then applied to LLaMA-2 along with other LLMs to conduct a preliminary study, which showcased that it improved performance in problem-solving, math, and overall reasoning.

The results further highlighted that this methodology reduced errors and allowed more clarity and attention to detail, which were missed the first time. Although LLMs are quite advanced, they can struggle with nuanced reasoning, and this method, paired with commonly used techniques like the chain of thought (CoT) and aided language model (PAL), could revolutionize our usage of LLMs in the future. This outcome has been linked to human behavior, as we also utilize a similar method to understand nuances and details that we miss.

Limitations of RE2 (re-reading)

The study proved useful for most models and produced similar results except Vanilla ChatGPT, but further developments are essential in theoretical analyses for complete implementation, as this study prioritized empirical studies with experiments. The researchers also discovered that this process was useful only when prompts were repeated two to three times, and further repetitions actually decreased performance.

This is because repeating the prompts may act as a demonstration for the LLMs to repeat the prompt rather than engage with it and answer. And, secondly, they uncovered that significant repetitions heightened inconsistencies among the LLMs between their conclusions, and pre-training and repeating prompts twice was most beneficial for this particular study. Therefore, it is imperative to conduct further research to apply the AI re-reading theory to all LLMs.

bnew · Sep 21, 2024

bnew · Sep 22, 2024

bnew · Sep 23, 2024

bnew · Sep 23, 2024

1/3
@ArtificialAnlys
Image input pricing varies orders of magnitudes between models - GPT-4o is ~200X more expensive than Gemini 1.5 Flash

Image input is increasingly relevant as more LLMs are becoming multi-modal and as the AI community explores new ways of using vision input. This is particularly the case for RAG use-cases which require vision capabilities to interpret rich documents.

This week @MistralAI released their Pixtral 12B model which, according to their announcement, performs very well on vision evals. We at Artificial Analysis are to launch our own independent evaluation of vision capabilities shortly.

Models also vary as to the amount of images they accept. Gemini 1.5 Flash can accept >1000 images but LLaVA-v1.5-7B can only accept one image per request.

We will be launch a vision focused page on the Artificial Analysis website shortly. Link to our pricing analysis of vision input below

2/3
@ArtificialAnlys
Image input pricing analysis:
Comparison of AI Models across Quality, Performance, Price | Artificial Analysis

3/3
@amebagpt
Gpt-4o mini is correct? Wow

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

The A.I Megathread (LLM , GPT , Development)

Veteran

Chain of Thought Empowers Transformers to Solve Inherently Serial Problems​

Submission history​

Veteran

Veteran

Veteran

Superstar

Veteran

Veteran

Veteran

Veteran

Veteran

AI is testing games to make sure they’re fun enough​

AI is taking another step to make the lives of game designers easier.​

Veteran

AI gets smarter by re-reading​

Here's how AI can mimic human behaviour.​

Veteran

Veteran

Veteran

Veteran

Chain of Thought Empowers Transformers to Solve Inherently Serial Problems

Submission history

AI is testing games to make sure they’re fun enough

AI is taking another step to make the lives of game designers easier.

AI gets smarter by re-reading

Here's how AI can mimic human behaviour.