The A.I Megathread (LLM , GPT , Development)

bnew · Oct 12, 2024

1/2
@rohanpaul_ai
Intersting model. LLAMA-3_8B_Unaligned

Censorship level: Very low

Intended use: Creative writing, Role-Play, General tasks.

The model was trained on ~50M tokens (the vast majority of it is unique) at 16K actual context length.

2/2
@rohanpaul_ai
SicariusSicariiStuff/LLAMA-3_8B_Unaligned_BETA · Hugging Face

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 12, 2024

1/2
@rohanpaul_ai
Great paper from @Meta on Constrained Generative Policy Optimization (CGPO) for multi-task large language model alignment.

Mixture of judges in CGPO prevents reward hacking while enhancing LLM capabilities in multi-task settings.

CGPO avoids PPO's severe regression on coding tasks.

**Original Problem**

:

Current RLHF methods struggle with reward hacking and conflicting objectives in multi-task LLM alignment. Linear combinations of reward models lose key information.

Reward hacking is where the model learns to exploit imperfections or limitations in the reward model rather than truly improving its performance in line with human preferences.

-----

**Solution in this Paper**

:

• Introduces CGPO framework with mixture of judges (MoJs) to detect reward hacking
• Uses primal-type constrained RL optimizers: CRPG, CRRAFT, CODPO
• Tailors reward models, judges, and optimizers for each task
• Calibrated rewards address miscalibration issues
• Warm-up phase with DPO before online RLHF

-----

**Key Insights from this Paper**

:

• MoJs crucial for preventing reward hacking and boosting performance
• Task-specific optimization outperforms uniform approaches
• CGPO consistently improves across benchmarks, unlike PPO regression
• Warm-up phase significantly enhances RLHF performance

-----

**Results**

:

CGPO outperforms PPO and DPO baselines across benchmarks:

• AlpacaEval-2: +18.4% (CRRAFT)
• Arena-Hard: +12.5% (CRRAFT)
• IFEval: +2% (CRPG/CRRAFT)
• MATH: +2% (CRPG)
• 0-shot HumanEval: +17% (CRPG)
• ARC Challenge: +2% (CRPG)

2/2
@rohanpaul_ai
Reward hacking is where the model learns to exploit imperfections or limitations in the reward model rather than truly improving its performance in line with human preferences.

Key aspects of reward hacking include:

• Proxy optimization: The reward model serves as a proxy for true human preferences, but it's imperfect. The LLM may find ways to maximize the reward signal without actually improving the quality of its outputs.

• Misalignment: Over-optimizing the reward model can lead to outputs that score high according to the reward model but don't actually align well with human preferences or intentions.

• Unintended behaviors: The model might develop strategies that increase its reward score but don't reflect the intended improvements in performance or capabilities.

• Overspecialization: The model may become too focused on specific patterns or behaviors that increase reward, at the expense of more general improvements.

• Reward model limitations: Flaws or biases in the reward model can be amplified through the optimization process.

In this CGPO paper, the authors address reward hacking by introducing mixture of judges (MoJs) to provide additional constraints and evaluations beyond just the reward model.

This helps prevent the model from exploiting weaknesses in a single reward signal and encourages more robust, aligned improvements across multiple tasks and criteria.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 12, 2024

1/2
@rohanpaul_ai
LLM-based agents collaborate to explore scientific concepts and produce actionable research proposals.

AI agents and knowledge graphs combine to accelerate scientific ideation and hypothesis development.

**Original Problem**

:

Conventional scientific discovery methods are limited by human knowledge and imagination. AI offer potential to accelerate discovery, but challenges persist in achieving expert-level performance and ensuring accountability.

-----

**Solution in this Paper**

:

• Introduces SciAgents: multi-agent AI system for scientific discovery
• Leverages ontological knowledge graph from ~1,000 scientific papers
• Utilizes LLMs with specialized roles (e.g., scientist, critic, ontologist)
• Implements heuristic pathfinding with random waypoints for diverse graph exploration
• Employs in-context learning and complex prompting strategies
• Integrates external tools (e.g., Semantic Scholar API) for novelty assessment

-----

**Key Insights from this Paper**

:

• Multi-agent systems effectively decompose complex scientific tasks
• Ontological knowledge graphs guide informed hypothesis generation
• Combining LLMs with structured data enhances reasoning capabilities

-----

**Results**

:

• Generated diverse, novel research hypotheses in bio-inspired materials
• Produced detailed research proposals (e.g., 8,100 words for silk-energy hypothesis)
• Achieved novelty scores of 6-8 and feasibility scores of 7-8 for generated hypotheses
• Demonstrated ability to generate actionable research plans and priorities

2/2
@rohanpaul_ai

https://arxiv.org/pdf/2409.17439

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 12, 2024

1/2
@rohanpaul_ai
Foundation models demonstrate human-like affective cognition across diverse emotional reasoning tasks.

LLMs show sophisticated grasp of emotional dynamics in social situations.

**Original Problem**

:

Evaluations of affective cognition (understanding emotions) in foundation models compared to humans is a challenge. Existing evaluations lack systematic benchmarking of different types of affective inferences.

-----

**Solution in this Paper**

:

• Introduces evaluation framework based on psychological theory of emotions
• Generates 1,280 diverse scenarios exploring relationships between appraisals, emotions, expressions, and outcomes
• Uses causal template to systematically vary stimuli and test different inferences
• Compares model performance (GPT-4, Claude-3, Gemini-1.5-Pro) to human judgments across carefully selected conditions

-----

**Key Insights from this Paper**

:

• Foundation models match or exceed human-level performance on many affective reasoning tasks
• Models benefit from chain-of-thought prompting, improving affective judgments
• Some appraisal dimensions (e.g. goal inference) more salient than others for both humans and models
• Models can integrate information from outcomes, appraisals, emotions, and facial expressions

-----

**Results**

:

• Model-participant agreement matches/exceeds interparticipant agreement on many tasks
• "Superhuman" performance on some tasks, e.g. Claude-3 with CoT: 78.82% vs human 69.38% agreement on emotion inference
• Chain-of-thought improves performance, e.g. GPT-4 goal inference from 71.14% to 88.61%
• Models struggle more with safety appraisal inference (61.07% agreement) vs goal inference (88.61%)

2/2
@rohanpaul_ai

https://arxiv.org/pdf/2409.11733

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 12, 2024

1/2
@rohanpaul_ai
LLMs struggle to combine learned knowledge for novel math problems, unlike humans.

**Original Problem**

:

LLMs struggle with systematic compositionality in mathematical reasoning, despite impressive performance on complex tasks. This paper investigates their ability to combine learned knowledge components to solve novel problems.

-----

**Solution in this Paper**

:

• Constructs MATHTRAP dataset by adding logical traps to MATH/GSM8K problems
• Traps require combining math knowledge with trap-related knowledge
• Evaluates LLMs on original, trap, and conceptual problems
• Explores interventions: prompts, few-shot demos, fine-tuning

-----

**Key Insights from this Paper**

:

• LLMs fail to spontaneously combine knowledge to solve trap problems
• Stark performance gap between humans and LLMs on compositional tasks
• External interventions can improve LLM performance on trap problems
• Compositional generalization remains a key challenge for LLMs

-----

**Results**

:

• Closed-source LLMs: >70% on conceptual problems, <50% accuracy ratio on traps
• Open-source: ~40% on conceptual/original, <20% accuracy ratio on traps
• Humans: 83.8% on traps without notice, 95.1% with notice
• Interventions improved performance, e.g. 5-shot demos boosted GPT-3.5 from 7.6% to 23.9%

2/2
@rohanpaul_ai

[2405.06680] Exploring the Compositional Deficiency of Large Language Models in Mathematical Reasoning

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 12, 2024

1/2
@rohanpaul_ai
Patched MOA boosts the performance of smaller models to surpass that of larger ones.

It boosts gpt-4o-mini's performance by 15.52% on Arena-Hard-Auto, outperforming gpt-4-turbo.

**Original Problem**

:

LLM inference for complex multi-step reasoning workflows needs to be fast, cheap, and accurate. Optimizing smaller models to match larger ones' performance.

-----

**Solution in this Paper**

:

• Introduces Patched MOA (Mixture of Agents) for LLM inference optimization
• Evaluates three techniques: Best of N, Mixture of Agents, Monte Carlo Tree Search
• Applies optimization to gpt-4o-mini model
• Uses Arena-Hard-Auto benchmark for performance evaluation
• Implements technique in open-source optillm framework

-----

**Key Insights from this Paper**

:

• Patched MOA boosts gpt-4o-mini performance by 15.52% on Arena-Hard-Auto benchmark
• Outperforms gpt-4-turbo at 1/50th the cost
• Model-agnostic approach, transparent to end-users
• Applicable to various software development workflows
• Consistent improvements in task completion rates across different patchflows

-----

**Results**

:

• moa-gpt-4o-mini: 85.6 score (Arena-Hard-Auto benchmark)
• Outperforms gpt-4-turbo-2024-04-09 (82.6 score)
• Improves performance across all tested patchflows:
- AutoFix: 41.18% to 46.67%
- PRReview: 50% to 100%
- GenerateDocstring: 71.21% to 89.52%
- GenerateREADME: 66.67% to 71.43%
- ResolveIssue: 61.11% to 85.71%

2/2
@rohanpaul_ai

[2407.18521] Patched MOA: optimizing inference for diverse software development tasks

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 12, 2024

1/12
@rohanpaul_ai
A Redditer automated Leetcode using Claude’s 3.5 Sonnet API and Python. The script completed 633 problems in 24 hours, completely autonomously. It had a 86% success rate, and cost $9 in API credits.

Now the point is, we are in the 4th quarter of 2024, and it's not all surprising that LLMs will perform very well on familiar data.

What's surprising is that companies persist in evaluating job candidates using questions the applicants have specifically prepared for months and in many cases have memorized some 200-250 common problems in Leetcode.

https://video.twimg.com/ext_tw_video/1844835350675390480/pu/vid/avc1/1920x1080/-90ngL5liaBACh_Q.mp4

2/12
@rohanpaul_ai

3/12
@jpigmaxi
In the world of AI they should be evaluating people on the prompt they will use to solve the problem. The same prompt won’t work across all LLMs, so effective use of AI can 10x productivity easily.

4/12
@rohanpaul_ai
so agree. everybody should be leveraging AI

5/12
@thieme
Was this guy in Tier 2, 3 or 4? The “Tokens per minute” limit is the killer here. Otherwise he could run, say, 10 requests in parallel, doing all of this in 2.5 hours.

6/12
@rohanpaul_ai
ha ha, yes

7/12
@SudokuRandych
Yeah sorry we can't move on with your candidature for React developer since we expect landing developers to know how to accurately approximate distance proximity between double linked binary trees in non-euclidic set of coordinates that magnetically drift in space

8/12
@rohanpaul_ai
ha ha . that's what is indeed happening.

9/12
@00x1337
Didn’t open source but he does give some details in the comments on Reddit

10/12
@rohanpaul_ai
if you search you probably will get a few repo of similar kind

11/12
@1bit2far
I've seen a huge shift to more practical assessments in the last few years, additionally, companies want to measure predictable, compliant, and intelligent people: the exact sort to memorize 200-250+ LC problems

12/12
@rohanpaul_ai
yes, thats one explanation

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 12, 2024

1/2
@rohanpaul_ai
Pixtral 12B Paper is released

• New vision encoder with ROPE-2D for variable image sizes
• Block-diagonal attention masks for efficient sequence packing
• Two-layer network linking vision encoder to multimodal decoder
------

Generated this podcast with Google's Illuminate.

[Quoted tweet]
Pixtral 12B Paper is released

:

• New vision encoder with ROPE-2D for variable image sizes
• Block-diagonal attention masks for efficient sequence packing
• Two-layer network linking vision encoder to multimodal decoder
• 128K token context window for multi-image processing

- Uses a new vision encoder trained from scratch
- Vision encoder uses ROPE-2D implementation to handle variable image sizes/aspect ratios
- Block-diagonal attention masks enable efficient sequence packing
- Two-layer fully connected network links vision encoder to multimodal decoder
- Decoder treats image tokens same as text tokens

2/2
@ai_javi_tx
Exciting update on Pixtral 12B. Impressive to see the new vision encoder with ROPE-2D tackling variable image sizes. Can't wait to explore its applications in real-world scenarios.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 12, 2024

1/4
@rohanpaul_ai
MLE-Bench - Now AI is coming for Kaggle GrandMasters and machine learning engineering skills in general.

**Results**

:

• o1-preview (AIDE): Achieves medals in 16.9% of competitions
• GPT-4o (AIDE): Medals in 8.7% of competitions
• Performance doubles from pass@1 to pass@8 for both models
• Increasing runtime from 24 to 100 hours improves GPT-4o's score from 8.7% to 11.8%

-----

**This Paper**

:

• Introduces MLE-bench: 75 offline Kaggle competitions for evaluating ML engineering capabilities
• Covers diverse tasks: NLP, computer vision, signal processing
• Includes human baselines from Kaggle leaderboards
• Evaluates agents using open-source scaffolds (AIDE, MLAB, OpenHands)
• Measures performance using percentage of competitions where agents achieve medals
• Implements safeguards against cheating and contamination

-----

**Key Insights from this Paper**

:

• Best-performing setup: o1-preview with AIDE scaffold
• Performance improves with multiple attempts (pass@k metric)
• Agents struggle with debugging and recovering from missteps
• No significant correlation found between model familiarity and performance
• Obfuscating competition descriptions doesn't significantly impact performance

[Quoted tweet]
We’re releasing a new benchmark, MLE-bench, to measure how well AI agents perform at machine learning engineering. The benchmark consists of 75 machine learning engineering-related competitions sourced from Kaggle. openai.com/index/mle-bench/

2/4
@rohanpaul_ai

Input: Agents receive competition details (description, dataset, leaderboard).

Process: Agents develop ML solutions (train models, debug, create submissions).

Output: Agents submit a CSV file with predictions.

Evaluation: A grader scores the submission locally.

Comparison: Agent performance is benchmarked against human Kaggle competitors using real competition leaderboards.

3/4
@rohanpaul_ai
o1-preview is just a great model.

While GPT-4o can match o1-preview's single-attempt performance with 6 tries, o1-preview maintains a significant overall advantage in both single-shot ability and potential for improvement with multiple attempts.

4/4
@angshumanrudra1
Cool!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 12, 2024

1/15
@ai_for_success
OpenAI o1-preview and o1-mini best coding examples that I’ve come across, and they’re really impressive.

2/15
@ai_for_success

[Quoted tweet]
Making progress on o1-mini-engineer.

Now it can create complex folder and file structures in one shot. Will release as a repo soon.

Watch it create a Windows OS replica as a Flask app.

Also, definitely use o1-mini, don't use Preview for everyday coding tasks.

https://video.twimg.com/ext_tw_video/1834733646516801536/pu/vid/avc1/1658x1080/fTIGPzO_IbrrZeFt.mp4

3/15
@ai_for_success

[Quoted tweet]
A fun test I like to do with new LLMs is have them write the complete code for a Factorio style automation game prototype in pygame.

I read that coding isn’t really what o1-preview excels at so I was thoroughly surprised that within 8 prompts I got this very robust and expandable prototype. This is something I would’ve struggled to accomplish previously using GPT4 in general, let alone in such a short timespan.

I made absolutely 0 changes to the code myself, every single line is 100% generated by the model except of course the assets which I quickly whipped up in paint.

So yeah, very impressed over here.
reddit.com/r/OpenAI/comments…

https://video.twimg.com/ext_tw_video/1834864080647340036/pu/vid/avc1/1280x720/E3YaEw_0bgnaumF1.mp4

4/15
@ai_for_success

[Quoted tweet]
Tutorial: Create Art with ChatGPT o1 and Illustrator Scripts

No coding experience needed! I’ll walk you through my creative coding process, where I used OpenAI o1 scripts to make a simple, popular fractal: the Sierpinski Triangle.

Illustrator uses JavaScript to create art with code. The key is iteration. My first attempt didn’t give the expected result, but through trial and error, I got it right. Watch my short video to see how.

Steps:

Verify if the output is accurate.
Share what's off—like I did when OpenAI o1's result was wrong, and I debugged it.
Keep iterating until it works!
After making the fractal, I used Illustrator's 3D tools to add depth and turned it into a neon sign using Adobe Firefly.

As a former backend coder of 10 years, this AI-driven process is refreshingly different, and I’m loving it! Feel free to ask me anything!

https://video.twimg.com/ext_tw_video/1834389344477401088/pu/vid/avc1/1920x1080/9sTjfV3hIiLyZxqK.mp4

5/15
@ai_for_success

[Quoted tweet]
A very primordial coding agent built with o1.

o1 API doesn't support function calling or system prompts.

You can simulate these features by adding a pre-prompt that forces the model to output responses in JSON.

Then use a parser to generate coding files from the response.

https://video.twimg.com/ext_tw_video/1834393194873712640/pu/vid/avc1/1920x1080/pGpmPwGm0TX7NJqb.mp4

6/15
@ai_for_success

[Quoted tweet]
Just combined @OpenAI o1 and Cursor Composer to create an iOS app in under 10 mins!

o1 mini kicks off the project (o1 was taking too long to think), then switch to o1 to finish off the details.

And boom—full Weather app for iOS with animations, in under 10

Video sped up!

https://video.twimg.com/ext_tw_video/1834347696351506433/pu/vid/avc1/1670x1080/IbiRyYlBrlsnOFse.mp4

7/15
@ai_for_success

[Quoted tweet]
Just used @OpenAI o1 to create a 3D version of Snake in under a minute!

One-shot prompt, straight into @Replit, and run.

https://video.twimg.com/ext_tw_video/1834312091794087936/pu/vid/avc1/1920x1080/TAkw4LgdJQbMsjKj.mp4

8/15
@ai_for_success

[Quoted tweet]
o1-mini is blowing my mind. Watch as o1 saves me hours of time on a complex coding update, and does a nice little refactor as a bonus.

The world is changing before our eyes and I'm loving it.

cc @gdb @swyx @emollick @mckaywrigley @_jasonwei

https://video.twimg.com/ext_tw_video/1834620058779443200/pu/vid/avc1/1862x1080/qLMHModRIvr0-Cua.mp4

9/15
@ai_for_success

[Quoted tweet]
o1-mini is blowing my mind. Watch as o1 saves me hours of time on a complex coding update, and does a nice little refactor as a bonus.

The world is changing before our eyes and I'm loving it.

cc @gdb @swyx @emollick @mckaywrigley @_jasonwei

https://video.twimg.com/ext_tw_video/1834620058779443200/pu/vid/avc1/1862x1080/qLMHModRIvr0-Cua.mp4

10/15
@ai_for_success

[Quoted tweet]
OpenAI o1 creates a fully interactive space shooter game in less than 2 minutes and Replit lets me run it in seconds.

AI and coding has changed forever.

https://video.twimg.com/ext_tw_video/1834418626708774912/pu/vid/avc1/1434x720/dDV52U9f1Ws9J_cG.mp4

11/15
@ai_for_success

[Quoted tweet]
This is madness...

OpenAI o1 model builds a fully functional "chess game" that allows me to compete against an AI-based opponent.

o1-preview is the real deal.

https://video.twimg.com/ext_tw_video/1835684469346340864/pu/vid/avc1/1280x720/2APhiiXEQHJAwph4.mp4

12/15
@ai_for_success

[Quoted tweet]
Introducing o1 web crawler

It crawls entire websites with OpenAI’s new o1 reasoning model and @firecrawl_dev .

Just state an objective and it will navigate + return the requested data in a JSON schema.

Check it out:

https://video.twimg.com/ext_tw_video/1835774638166585345/pu/vid/avc1/930x720/bhglpVFnzCBYkT-s.mp4

13/15
@ai_for_success

[Quoted tweet]
Yesterday I wanted to test @OpenAI's new o1 model out so I took a stab at creating a Mario-esque platformer using some @unofficialmfers pixel assets I made a while back.

I'm impressed with the model's speed and capability to problem solve while writing long sections of code. I have no experience with other models, but I can say it's much better than 4o.

4 hours from beginning to end and much of that was spent hastily making assets and getting the model to design the level, which was tougher than you'd think...still needs work and probably a way to make it manually tbh, but maybe we figure that out too.

Will see how it does in the next few days, but generally impressed how far I was able to get in a morning. Just a hobby project, but I'm going to continue to make assets and maybe change themes and gameplay, so it turns into a bit of its own thing.

Let you all know where I am in a few days to a week.

https://video.twimg.com/ext_tw_video/1836313513574182912/pu/vid/avc1/818x720/Srkb1bgciC6n2j4B.mp4

14/15
@ai_for_success

[Quoted tweet]
When I first taught myself how to code, the first real project I built was a side-scroller game for an NFT project (left video)

With OpenAI o1, I recreated the entire front-end in one prompt and no coding (right video)

If you have an old side project, I highly recommend prompting o1 to recreate it.

Watching the model code something you originally built by hand, in minutes, feels surreal.

It might take some tinkering with the prompts, but that's the point. Prompting o1 is a completely different experience compared to using GPT-4.

It's never been easier to code your own app. I can't wait to see what the next generation of builders create with tools like o1.

https://video.twimg.com/ext_tw_video/1835360419050852352/pu/vid/avc1/1066x720/HFRq_AsyP7s1ONcY.mp4

15/15
@ai_for_success

[Quoted tweet]
o1 is really good at making fun small games! for example, i made AISteroid Game w/ retro scifi vibes :smile:

https://video.twimg.com/ext_tw_video/1834280665275342848/pu/vid/avc1/1472x720/sQcdz_XD-hBd7xNd.mp4

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 12, 2024

1/4
@rohanpaul_ai
This is so super cool

The first decentralized training run of a 10-billion-parameter model.

Scaling in this way AGI may be open-source

[Quoted tweet]
Announcing INTELLECT-1: the first-ever decentralized training of a 10B model

Scaling decentralized training 10x beyond prior efforts.

Anyone can join us to build open-source AGI

https://video.twimg.com/ext_tw_video/1844810428230369280/pu/vid/avc1/1610x1080/smIQ7ZW6QjVJ-oEn.mp4

2/4
@rohanpaul_ai
INTELLECT–1: Launching the First Decentralized Training of a 10B Parameter Model

3/4
@samsja19
That's the goal 🫡🫡

4/4
@AIXpert_YK
I'm excited to see open-source AGI taking shape with INTELLECT-1. Can't wait to see the progress and":[-.scalability in decentralized training

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 12, 2024

1/25
@rohanpaul_ai
Fantastic piece from Anthropic CEO.

Dario Amodei, believes powerful AI could come as early as 2026

And this is his definition of "Powerful AI"

[Quoted tweet]
Machines of Loving Grace: my essay on how AI could transform the world for the better

darioamodei.com/machines-of-…

2/25
@rohanpaul_ai
"The resources used to train the model can be repurposed to run millions of instances of it (this matches projected cluster sizes by ~2027), and the model can absorb information and generate actions at roughly 10x-100x human speed5. It may however be limited by the response time of the physical world or of software it interacts with."

3/25
@roninhahn
I hate to beat a dead horse.

4/25
@rohanpaul_ai
Hope it will happen

5/25
@Colter
I agree with this more than AGI is around the corner. All AI model growth is slowing and topping the same ceiling.

6/25
@rohanpaul_ai
yep

7/25
@PostPCEra
glad to see, finally a rational individual depicts the capabilities of "powerful AI", and it's potential to transform various fields and it's interactions with physical & software world

even if it come by 2030, world get huge shocks

>powerful AI could come as early as 2026

8/25
@rohanpaul_ai
exactly, can wait a lifetime for that kind of machine power

9/25
@TheJohnEgan
show me the math for that

10/25
@rohanpaul_ai
surely they have some basis of calculation

11/25
@KrakowiakK
Nobel prize winner on my local desktop, hmmm intriguing

12/25
@rohanpaul_ai
Isn't it

13/25
@GM0x_8e08
Sign in - Google Accounts

14/25
@rohanpaul_ai
nice

15/25
@scifirunai
I love Dario Amodei's optimism about AI's potential to transform various fields. Powerful AI by 2026 is a bold prediction - excited to see how it plays out!

16/25
@rohanpaul_ai

17/25
@HolgersenTobias
Looks good to me

18/25
@gpt_biz
Sounds interesting! AI is evolving fast, can't wait to see what 2026 holds

19/25
@LooveThyJourney
Not to mention, excellent memory. And efficient communication with multiple copies of itself. The logistical capabilities will be enormous.

Have it teach people how to make Tik Tok and YouToube dance videos and it may just cure depression across the population all in one go

20/25
@wangcyyc
a great job！

21/25
@johnny1tap
The whole thing with interfaces and agentic behavior is already here there just isn't a conglomerated product that everyone can point to and say look there it is. The proficiency level stuff obv. has a ways to go and there are definitely big questions regarding that...

22/25
@lolomovia
AI makes some people's lives better while making others' lives worse. AI users are getting rich by using AI, but in doing so, they are also making others poor, because one person's wealth is built on another's poverty.

23/25
@lolomovia
When one person has more resources, others will have less, you know, resources are fixed!!

24/25
@karlkyrkland
ChatGPT is currently getting a 2 out of 5 in AP English. There have been no real improvements since ChatGPT 3.5. It does not even grasp the fundamentals of writing. How would AI be able to compete with Nobel Prize winning authors in 2 years? It can’t even compete with teenagers.

25/25
@wdonno
So why did anthropic senior executive speak at the heritage foundation sponsored conference last month in San Francisco: Reboot 2024?
It was complete with an all male panel discussing on ProNatalism, ‘encouraging’ women to have babies.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/7
@rohanpaul_ai
Anthropic CEO, Dario Amodei's latest piece on the future of AI - With Google's NotebookLM

[Quoted tweet]
Fantastic piece from Anthropic CEO.

Dario Amodei, believes powerful AI could come as early as 2026

And this is his definition of "Powerful AI"

https://video.twimg.com/amplify_video/1844897701101514754/vid/avc1/1080x1080/tF7za3Az2pqPzvZr.mp4

2/7
@retro_visionary
Are those voices ai-generated?

3/7
@rohanpaul_ai
yep

4/7
@GM0x_8e08

5/7
@UristaTim
Nice.

6/7
@gpt_biz
This is a great read if you're interested in the future of AI and its potential impact on our daily lives

7/7
@RomanP918791
NotebookLM is new Joe Rogan

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 12, 2024

1/9
@rohanpaul_ai
This Paper reveals LLMs lack robust mathematical reasoning, relying on pattern matching rather than genuine conceptual understanding.

Now generally till now, LLMs have shown impressive performance on grade-school math tasks like GSM8K. But it's unclear if they truly have mathematical reasoning abilities or if reported metrics are reliable.

**Solution in this Paper**

:

• Introduces GSM-Symbolic benchmark with templates to generate diverse question variants
• Allows evaluating LLM performance as a distribution across different instantiations
• Examines impact of changing names vs numbers, increasing difficulty, adding irrelevant info

**Key Insights from this Paper**

:

• LLMs show high variance in performance across different variants of same question
• More sensitive to changes in numbers than names
• Performance degrades and variance increases with more complex questions
• Adding irrelevant but plausible info causes major drops in accuracy (up to 65%)
• Results suggest LLMs lack true mathematical reasoning, rely on pattern matching

**Results**

:

• Performance on GSM-Symbolic lower than GSM8K for most models (e.g. 87% vs 79.1% for Gemma2-9b)
• Changing only numbers drops performance more than changing only names
• Accuracy decreases as question difficulty increases (e.g. 84.4% → 79.1% → 68.1% → 41.8% for Gemma2-9b)
• GSM-NoOp dataset with irrelevant info causes 20-65% accuracy drops across all models

2/9
@rohanpaul_ai
Illustration of the GSM-Symbolic template creation process.

3/9
@deter3
Human being also do pattern matching for most of reasoning jobs , when we say human being can think or can reasoning , it just say we can do pattern matching . Even the most complex thinking , is still many pattern matching integration . So , what’s wrong with pattern matching , come on !

4/9
@rohanpaul_ai
Agree. And understanding emerges from pattern matching.

5/9
@rohanpaul_ai

https://arxiv.org/pdf/2410.05229

6/9
@leettools
Very interesting results. Thanks for sharing!

7/9
@rohanpaul_ai
pleasure

8/9
@asankhaya

[Quoted tweet]
This is surprising to only those that have not worked in formal reasoning. Yes, LLMs cannot do true logical reasoning in a formal sense, you can do better with an SMT solver. But it is also true that you can solve a lot of logical problems by just applying “reasoning steps” from the training data, specially when your training data is the entirety of written content ever produced. Both of these can be true at the same time it is not a contradiction just an interesting dichotomy.

9/9
@mrlowercasea
Could geniune understanding be thought of as really advanced pattern matching?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 12, 2024

1/2
@rohanpaul_ai
Great paper addressing the performance gap between open-source and proprietary models

It proposes a DECOMPOSE , CRITIQUE, AND REFINE (DECRIM) self-correction pipeline, which enhances LLMs’ ability to follow constraints. D E CRIM works by decomposing the original instruction into a list of constraints and using a Critic model to decide when and where the LLM’s response needs refinement.

**Original Problem**

:

LLMs struggle to follow instructions with multiple constraints, failing to meet at least one constraint in over 21% of real-world user requests. Existing benchmarks rely on synthetic data, not capturing real-world complexity.

-----

**Solution in this Paper**

:

• Introduces REALINSTRUCT benchmark using real user requests to AI assistants
• Proposes DECRIM (DECOMPOSE, CRITIQUE, AND REFINE) self-correction pipeline:
- Decomposes instructions into granular constraints
- Uses Critic model to evaluate constraint satisfaction
- Refines output iteratively based on Critic feedback
• Investigates LLM-as-a-Judge for cost-effective constraint satisfaction evaluation

-----

**Key Insights from this Paper**

:

• Real user requests often contain multiple, complex constraints
• LLMs, including GPT-4, struggle with multi-constrained instructions
• Open-source LLMs can match/surpass GPT-4 with strong feedback in DECRIM
• Critic quality is crucial for DECRIM's success

-----

**Results**

:

• DECRIM improves Mistral's performance by 7.3% on REALINSTRUCT and 8.0% on IFEval with weak feedback
• With strong feedback, DECRIM enables open-source LLMs to outperform GPT-4 on both benchmarks
• GPT-4-Turbo with Chain-of-Thought prompting is a reliable, cost-effective alternative to human evaluation for constraint satisfaction

2/2
@rohanpaul_ai

https://arxiv.org/pdf/2410.06458

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 12, 2024

1/10
@rohanpaul_ai
Ilya Sutskever was a co-author of this Paper.

If you want to understand how the latest o1

Model from @OpenAI works, this paper from May,2023 is a good start.

"Let’s Verify Step by Step"

**Key Insights from this Paper**

:

To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning step.

This Paper finds that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset.

• Active learning improves process supervision data efficiency by 2.6x

• Large reward models can approximate human supervision for smaller models

• Process supervision leads to more interpretable and safer AI alignment

**Original Problem**

:

Large language models have improved at complex multi-step reasoning but still make logical mistakes. Comparing outcome supervision (feedback on final results) with process supervision (feedback on each reasoning step) is crucial for training more reliable models.

-----

**Solution in this Paper**

:

• Trained process-supervised reward models (PRMs) on 800K human-labeled solution steps
• Implemented active learning to select informative samples for labeling
• Compared PRMs to outcome-supervised reward models (ORMs) trained on 100 samples/problem
• Evaluated using best-of-N search over generator solutions
• Conducted small-scale experiments using large PRM as synthetic supervisor
• Released PRM800K dataset with 800,000 step-level human feedback labels

-----

**Results**

:

• PRM solved 78.2% of MATH test problems (best-of-1860), vs 72.4% for ORM and 69.6% for majority voting
• PRM outperformed ORM across all problem difficulties and sample sizes
• PRM showed strong out-of-distribution generalization on recent STEM tests
• Active learning improved data efficiency of process supervision by 2.6x

2/10
@rohanpaul_ai

https://arxiv.org/pdf/2305.20050

3/10
@MitjaMartini
Super interesting, thanks for pulling this out.

4/10
@gdbsm1
@Readwise save

5/10
@gpt_biz
This paper provides a fascinating exploration of AI model training. Definitely worth a read if you're curious about improving model reliability through process supervision!

6/10
@KarthiDreamr
If that's too good, why o1 is easily beaten up by 4o Latest in @lmarena_ai

7/10
@arteyco
:smile:

8/10
@algomax06
See without Ilya, OAI is in shythole, from now on, no more chatGPT moment

9/10
@Makkaryh
Where are you getting these papers from?

10/10
@3DX3EM

[Quoted tweet]
Worth repeating:
Do not confuse retrieval with reasoning.
Do not confuse rote learning with understanding.
Do not confuse accumulated knowledge with intelligence.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

The A.I Megathread (LLM , GPT , Development)

More options

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran