The A.I Megathread (LLM , GPT , Development)

bnew · Sep 9, 2024

1/8
@rohanpaul_ai
This is ABSOLUTELY incredible to see -

AI outperforms humans in research ideation novelty

**Key Insights from this Paper

**:

• LLM-generated ideas are judged as more novel than human expert ideas
• AI ideas may be slightly less feasible than human ideas
• LLMs cannot reliably evaluate research ideas yet

-----

**Solution in this Paper**

:

• Recruited 100+ NLP researchers for idea generation and blind review
• Developed an LLM ideation agent with:
- RAG-based paper retrieval
- Overgeneration of ideas (4000 per topic)
- LLM-based idea ranking
• Implemented strict controls:
- Standardized idea format and writing style
- Matched topic distribution between human and AI ideas
• Evaluated ideas on novelty, excitement, feasibility, effectiveness, and overall quality

-----

**Results**

:

• AI ideas rated significantly more novel than human ideas (p < 0.05)
- Novelty score: 5.64 (AI) vs 4.84 (Human)

• AI ideas slightly less feasible than human ideas (not statistically significant)
- Feasibility score: 6.34 (AI) vs 6.61 (Human)

• Only 5% of AI-generated ideas were non-duplicates

• LLM evaluators showed lower agreement with human reviewers than inter-human agreement
- Best LLM evaluator accuracy: 53.3% vs Human inter-reviewer consistency: 56.1%

2/8
@rohanpaul_ai
Overview of study:

- recruit 79 expert researchers to perform blind review of 49 ideas from each of the three conditions: expert-written ideas, AI-generated ideas, and AI-generated ideas reranked by a human expert.

- standardize the format and style of ideas from all conditions before the blind review. We find AI ideas are judged as significantly more novel than human ideas (p < 0.05 )

[2409.04109] Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers

3/8
@ChromeSub
They certainly can brainstorm and use other creativity techniques better than most people who are hung up by social norms and being judged by the perceived worth of their ideas. We tend to play it safe and offer only incremental changes.

4/8
@rohanpaul_ai
exactly, isn't it

5/8
@rohanpaul_ai
Comparison of the three experiment conditions across all review metrics.

Red asterisks indicate that the condition is statistically better than the Human baseline with two-tailed Welch’s t-tests and Bonferroni correction.

All scores are on a 1 to 10 scale.

6/8
@rohanpaul_ai
The “experts” in this study are the 'REAL' experts, some of the best people in the field.

Coming from 36 different institutions, our participants are mostly PhDs and postdocs.

As a proxy metric, their idea writers have a median citation count of 125, and their reviewers have 327.

7/8
@deter3
you might want to check my paper"Automating psychological hypothesis generation with AI: when large language models meet causal graph" Automating psychological hypothesis generation with AI: when large language models meet causal graph - Humanities and Social Sciences Communications , we have approved our workflow/algo can even generate novel hypothesis better than GPT4 and Phd .

8/8
@maledorak
Badge of “Type 1 Civilization of Kardashev scale” we are coming for you

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 9, 2024

1/5
@rohanpaul_ai
The most advanced AI models are trained on 1 petabyte of data, Dell Technologies has 120,000 petabytes.

---

Video Credit - Original full video from "Yahoo Finance" YouTube Channel . Link in Comment.

2/5
@rohanpaul_ai
Original full video from "Yahoo Finance" YouTube Channel .

How Dell is riding the AI wave: Insights from CEO Michael Dell

3/5
@anushkmittal
120k petabytes of excel spreadsheets

4/5
@ddebowczyk
Exactly:

5/5
@fatherailover
I've often wondered what happens when you train AI models on a massive dataset, but that's nothing compared to having 120,000 petabytes at your disposal - I'm curious, what's the plan to utilize this data at Dell Technologies?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/2
@DaveShapi
The shortage of human experts is the greatest bottleneck of human progress.

So what happens when we have billions of AI agents with an IQ of 250 running around?

2/2
@ddebowczyk
We have millions of human experts working in companies of all sizes generating a constant stream of quality information every year (which could be translated into billions / trillions of training tokens). The opportunity awaits.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
@ddebowczyk
Expect new incentives for the knowledge workers and new UX designs to maximize cooperation between human and language models.

With the arrival of LLM powered business workflows human's role in business processes is going to change, rather then getting eliminated.

Users will become more efficient with some work completed by LLM tech (e.g. agents), but key actions will still require their review, feedback and approval.

Human operators won't just process stuff, they will become instructors, managers and conductors of individual AI agents (or their swarms). This has the potential to multiply the value of employee outputs.

Human input will provide high value data used to improve the agents and amplify automation. Their judgement and feedback will be important company IP, shaping the way AI processes future cases and essential in quality and pace of automation.

Companies will need to revamp their systems with UX making AI-human collaboration effective and better incentivize people to contribute to long term improvement of their AI "disciples".

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 9, 2024

1/25
@corbtt
I am working with @mattshumer_ to get to the bottom of what happened with Reflection. He is providing access to all serving code and weights with the goal of replicating the strong reasoning performance @ArtificialAnlys was able to see over the weekend.

I will report back.

2/25
@far__el
huggingface-cli upload-large-folder <repo-id> <local-path> --repo-type=model

3/25
@corbtt
few understand

4/25
@HxHasNoMemory
Can you comment on the Claude hypothesis?

5/25
@corbtt
I think the easiest way to disprove is by running the actual weights and getting the reported performance, so that's my goal. Want to try that first before speculating further.

6/25
@MaziyarPanahi
It cannot be simpler:

1. Upload the very same model that was used to create that table, claiming to be the “world’s best open-source LLM”

2. Share the exact lm-eval command line that was used to create that eval table

Anything else is making excuses and misleading public!

7/25
@airesearch12
glad to hear there is some effort being made to clarify the situation!

vibes are getting worse and worse on the topic, to an extent where many imply fraud.

the most important action to calm the tides is independent weight testing, everything else can be sorted out later.

8/25
@mysticaltech
Folks, ONLY EXPLANATION is that the inference server matters... Weird huh!

9/25
@jpalioto
IMHO, at this point, anything short of the model itself available (not an API) is not going to convince anyone

10/25
@morgymcg
Happy to help with instrumenting the evals with Weave, ye can share public links to the evaluation results plus every question - answer pair

@capetorch and @ayushthakur0 both have a lot of experience instrumenting evals

11/25
@samarknowsit
Just call it a close, who is worried about it anymore.

12/25
@Uncensored_AI
test the provided serving code inference against that of the live api

should see little to no difference w max. deterministic configs (e.g. temp=0)

13/25
@intervitens

14/25
@lukeNukemAI
Credibility has sunk.. let it be, enough damage has been done.. rabbit r1 part 2!

15/25
@Samhanknr
Why not upload the weights to bit torrent ? There’s no reason to keep the weights private if this is real. If it’s real the community will find a way to make it work.

16/25
@DrCMcMaster
This makes no sense. Just publish the code to GitHub and the weights to huggingface. They should be no reason to share these privately if the intention was always to release openly. Heck, just share the dataset!

17/25
@Whatevercrypto
I was under the impression that the strong API performance on the private API or whatever it was, was due to making API calls to sonnet

18/25
@DaKulchur
Bad judgment to get involved, unless one has to ..

19/25
@mailupule25785
Sorry to ask you but why is it so hard to upload a model on HF if it already runs on API.

20/25
@_flashman_harry
Ask him to confess. Thanks in advance bro

21/25
@david1lz
The performance was so strong that @ArtificialAnlys deleted the tweet. By the way, that "evaluation" was based on the famous private API. Xd

22/25
@Oli82817545
thanks much appreciated i dont think matt is a scammer but all of this is definetely really strange hope you two can finally get to the bottom of this the reflection model i was able to try was truly something special

23/25
@smith_ahma357
You're a glaive investor who is also in on this grift. Clearly, @mattshumer_ is a fraud and is trying his best to backpedal in hopes of appeasing his investors and avoiding a fraud charge. Bad day for you guys. D:

24/25
@NoBSMattress
Who the hell are you? Lol.

25/25
@LlamaDebussy
Ask him to apologize and delete twitter. His grift has wasted thousands of people's time and money. He should be completely ostracized and shamed.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 9, 2024

1/3
@rohanpaul_ai
Quite incredible code generation result using the new "PLANSEARCH" search algorithm

optillm lib implements the core idea of "PLANSEARCH" here. This lib is for optimizing inference proxy. It implements several SOTA technique to improve the accuracy and performance of LLMs.

optillm lib is
- OpenAI API compatible
- They currently focus on improving reasoning over coding, logical and mathematical queries.

---

So its possible to beat the frontier models using these techniques across diverse tasks by doing additional compute at inference time.

---

They just implemented the core idea from the paper "Planning In Natural Language Improves LLM Search For Code Generation" to push gpt-4o-mini with PLANSEARCH to 63.5% on the LiveCodeBench at pass@10.

PLANSEARCH, is a novel search algorithm proposed in the above paper.

By searching over plans in natural language rather than directly over code solutions, PLANSEARCH explores a significantly more diverse range of potential solutions compared to baseline search methods.

2/3
@rohanpaul_ai

The github lib - GitHub - codelion/optillm: Optimizing inference proxy for LLMs

The Paper "Planning In Natural Language Improves LLM Search For Code Generation" - [2409.03733] Planning In Natural Language Improves LLM Search For Code Generation

3/3
@anushkmittal
impressive. we're always pushing the boundaries of what's possible with ai at shapes inc.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/9
@asankhaya
A recent paper from @hughbzhang , @evanzwangg and other folks from @scale_AI titled "Planning In Natural Language Improves LLM Search For Code Generation" ([2409.03733] Planning In Natural Language Improves LLM Search For Code Generation) showed a novel search algorithm that searches over candidate plans for solving a problem in natural language.

I have implemented the core idea from their paper in our open-source optimizing proxy optillm - GitHub - codelion/optillm: Optimizing inference proxy for LLMs

I was able to replicate the core idea, in fact, I was able to push gpt-4o-mini with plansearch to 63.5% on the LiveCodeBench at pass@10.

Also, I found that plansearch-gpt-4o-mini at pass@5, already gives almost 10% better results when compared to pass@5 gpt-4o-mini.

2/9
@n0riskn0r3ward
That was fast! What's the date range of the live code bench problem subset these pass@1 - 10 stats come from?

3/9
@asankhaya
Whatever is the default? I thought it is run based on the datetime in the lm_styles.py file.

The only change I did in LiveCodeBench was to modify the lm_styles.py file and add a model for plansearch-gpt-4o-mini, then I ran it with base_url as the optillm proxy's url. So, the numbers are for the following:

LanguageModel(
"gpt-4o-mini-2024-07-18",
"GPT-4O-mini-2024-07-18",
LMStyle.OpenAIChat,
datetime(2023, 4, 30),
link="https://openai.com/index/spring-update",
),
LanguageModel(
"plansearch-gpt-4o-mini-2024-07-18",
"PlanSearch-GPT-4O-mini-2024-07-18",
LMStyle.OpenAIChat,
datetime(2023, 4, 30),
link="https://openai.com/index/spring-update",
)

4/9
@UserMac29056
How would the results be if we fine tuning GPT-4o-mini with distilled dataset from GPT-4o plansearch?

5/9
@asankhaya
You will likely see further improvements, we did something similar for a bug fixing task recently - Static Analysis Evaluation Benchmark with OpenAI's GPT-4o Fine-Tuning

6/9
@carterleffen
great job! do you plan on opening up your fine tuning dataset and how you generate it?

7/9
@asankhaya
This is without any fine tuning, the optillm proxy applies inference only techniques that trade extra compute at inference time for better results. The entire implementation for plansearch in optillm is right here - optillm/plansearch.py at main · codelion/optillm

8/9
@hughbzhang
this is insanely fast. super impressed!

9/9
@algxtradingx
Damn, homie, good work. 🫡

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 9, 2024

GitHub - zhentingqi/rStar

Contribute to zhentingqi/rStar development by creating an account on GitHub.

github.com

rStar: Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers

This repository contains necessary scripts to run rStar's generator and discriminator.

Link to paper: Paper page - Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers, [2408.06195] Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers

Intro

We propose rStar, a Self-play muTuAl Reasoning approach that significantly improves reasoning capabilities of small language models (SLMs) without fine-tuning or superior models. rStar decouples reasoning into a self-play mutual generation-discrimination process.

Ghost Utmost · Sep 9, 2024

Okay now I'm scared. I call myself a writer although I never sold anything.

Just fukcing around with Gemini and it wrote this is one second.

The Showdown
The sun was beginning its descent, casting long, dancing shadows across the dusty plains. A lone cowboy, his weathered face etched with lines of hardship and determination, stood facing off against a fearsome pirate, his eyes glinting with a savage intensity.
The pirate, a hulking figure with a cutlass in hand, lunged forward, his blade a blur of steel. The cowboy, quick as a flash, sidestepped the attack, his own hand reaching for the butt of his revolver. With a flick of his wrist, he drew his weapon and fired, the thunderous roar of the shot echoing across the desolate landscape.
The pirate grunted, stumbling back as the bullet grazed his shoulder. He raised his cutlass again, a snarl twisting his lips. The cowboy, unyielding, met him with a flurry of punches and kicks, his movements fluid and precise. The pirate, momentarily stunned, retaliated with a powerful swing of his cutlass, but the cowboy, anticipating the move, ducked beneath the blade and delivered a vicious blow to the pirate's ribs.
With a cry of pain, the pirate staggered backward, his grip on his cutlass loosening. Seizing the opportunity, the cowboy rushed forward, his revolver leveled at the pirate's chest. The pirate, desperate, lunged at the cowboy, but his movement was slow and labored. The cowboy's finger tightened on the trigger, and with a final, decisive shot, the pirate fell to the ground, his life extinguished.

bnew · Sep 10, 2024

1/4
@deepseek_ai

Exciting news! We’ve officially launched DeepSeek-V2.5 – a powerful combination of DeepSeek-V2-0628 and DeepSeek-Coder-V2-0724! Now, with enhanced writing, instruction-following, and human preference alignment, it’s available on Web and API. Enjoy seamless Function Calling, FIM, and Json Output all-in-one!

Note: Due to significant updates in this version, if performance drops in certain cases, we recommend adjusting the system prompt and temperature settings for the best results!

2/4
@deepseek_ai
DeepSeek-V2.5 outperforms both DeepSeek-V2-0628 and DeepSeek-Coder-V2-0724 on most benchmarks.

3/4
@deepseek_ai
In our internal Chinese evaluations, DeepSeek-V2.5 shows a significant improvement in win rates against GPT-4o mini and ChatGPT-4o-latest (judged by GPT-4o) compared to DeepSeek-V2-0628.

4/4
@deepseek_ai
DeepSeek-V2.5 is now open-source on HuggingFace!
Check it out: deepseek-ai/DeepSeek-V2.5 · Hugging Face

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/3
Wake up y'all! DeepSeek v2.5 is out! Merged model - weights on HF, 128K context, MoE - 238B param/ 21B active! - Optimised for coding

> Combined DS v2 & DS v2 Coder Merged - Beats GPT4o
> Arena Hard - 68.3% -> 76.3%
> Alpaca Eval - 46.61% -> 50.52%
> MT Bench - 8.84 -> 9.02
> Optimised for Coding - HumanEval 89%, LiveCodeBench - 41%
> Enhanced writing, instruction-following, and human preference alignment
> Function Calling, Fill-In-Middle, and JSON Output
> Available on Hugging Face
> Compatible with Transformers

Kudos to the @deepseek_ai team. I'm a big fan of their work; their models are just of primo quality!

2/3
Here's how they made this bad boi:

3/3
DeepSeek v2.5 - Model weights here:

deepseek-ai/DeepSeek-V2.5 · Hugging Face

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/2
DeepSeek V2.5 (21B active / 238B total) runs pretty nicely in MLX on an M2 Ultra.

Checkout the 4-bit model in the HF MLX community: mlx-community/DeepSeek-V2.5-MLX-AQ4_1_64 · Hugging Face

With prompt caching in MLX LM could be a pretty nice local coder (mlx-examples/llms/README.md at main · ml-explore/mlx-examples)

2/2
Speeds up the prompt / prefill time considerably since it's just loading a file from disk. Especially for MOEs for which prompt processing is kind of slow right now. I haven't measured exactly but for long prompts its orders of magnitude faster.

Generation time is unchanged.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/4
@kimmonismus
DeepSeek silently released their DeepSeek-Coder-V2-Instruct-0724, which ranks #2 on Aider LLM Leaderboard, and it beats DeepSeek V2.5 according to the leaderboard
https://teddit.net/r/LocalLLaMA/comments/1fd6z0v/deepseek_silently_released_their/

deepseek-ai/DeepSeek-Coder-V2-Instruct-0724 · Hugging Face

2/4
@olafgeibig
This is their V2 model, which was great but they also just released their newest model DeepSeek-V2.5. DeepSeek V2.5 is a merge of their V2 chat and coder model.
deepseek-ai/DeepSeek-V2.5 · Hugging Face

3/4
@relentless4o
these are awesome news

4/4
@BotChadX
they just keep shipping

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 10, 2024

1/10
LLMs still can't plan.
Llama-3.1-405b and Claude can plan a bit on Blocksworld.
GPT4 and Gemini not so much.
Performance is abysmal for everyone on Mystery Blocksworld.

2/10
So they can't do simple planning but you think they should be reviewing scientific papers? Your stance on LLMs seems contradictory.

3/10
I absolutely never said "LLMs should be reviewing scientific papers."
I said *reviewers* (as in human reviewers) should be able to use the tools they want to help them write reviews.
The quality of their reviews should be assessed on the result, not the process.

4/10
This is the same clown, who back in 2021 before GPT 3.5, claimed “even GPT 5000 won’t be as capable as a high schooler”

The current tech we have today has already far surpassed what Yann LeClown predicted was possible, now he’s moving the goalposts and won’t admit he was wrong.

5/10
You are the clown who didn't understand what I said.

But since you're talking about high schoolers: high schoolers can learn to drive a car in about 20 hours of practice.
Yet, we still don't have level-5 self-driving cars, and the best ones we have require a huge amount of training, detailed engineering, fancy sensors, and detailed maps.
They don't use LLMs because LLMs are hopeless for real-world data.

The point I was making is that LLMs do not understand the physical world and have no common sense. The biggest LLMs have less understanding of the world than a house cat, let alone a high schooler.

I hope you enjoyed this lesson, which your misplaced insult made you unworthy of receiving.

6/10
I mean, I can't parse this either.

55-60% on not-whatever-this-is planning seems pretty darn good to me.

7/10
The problem with LLMs is people expect too much out of them (because of the hype). The deficiencies maybe have to do with architectures themselves (honestly they all boil down to the same thing), or maybe the training process, or the data (which its deficiencies come from us, humans).

8/10
Curious if this is just a fundamental limitation of LLMs.

Seems like the way around this wouldn't be brute forcing the single shot output, but baking in (even on a lower level) the ability to plan and forecast before producing the final user facing output?

9/10
don’t think it’s accurate to say they can plan a bit, more like they are better at simulating planning

10/10
What are your thoughts on this reflection 70b saga

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/15
The latest LLM performance on PlanBench ([2206.10498] PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change) via @karthikv792. While they all bite the dust on Mystery BW, it is notable that the 400bn free open source model LLaMA3.1 does quite well compared to the costly closed models marketed as masters of reasoning..

2/15
Reminder that (1) the simple STRIPS planners have 100% accuracy on Mystery BW and (2) The human baseline is also 100% if the humans are incentivized to use their System 2. (For the record, we tried to offer $$ to LLMs too but they can't seem to use the System 2 they don't have

)

3/15
If you are interested in getting the whole skinny, checkout the /search?q=#ICML2024 tutorial..

On the Role of LLMs in Planning (ICML 2024 Tutorial)

4/15
Why do these models seem to do worse one shot vs single shot

5/15

6/15
Waiting for the results of the novel Reflection LLM

7/15
Not sure which specific magical LLM you have in mind, but self-verification capabilities of LLMs are a mirage (at least as far as correctness is concerened)

8/15
Is there any data on how reliably reproducible these results are for each model?

9/15
would love for there to be a human playable subset of the planbench challenges, a bit like the Arc-Agi website

10/15
Interesting.. aligns very well with our internal benchmarks. Claude performs best at Zero shot planning, while Gemini among the worst. However never tried Llama 3.1, going to give this a go now

11/15
How would you explain the improved Performance? RLHF? Some BW in the training Data?

12/15
I'll believe the hype when I see an LLM actually solving Mystery Blocks World. Until then, let's not get too carried away with the Super Duper Reasoning and Planning Skillz

13/15
Would be interesting to see how Reflection Llama 405b fares when it releases this week. Maybe it will indeed be a breakthrough

14/15
Smells fishy… how does zero shot do better than one shot?

15/15
lol Gemini is such dogshyt

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 10, 2024

1/18
@mattshumer_
You can use Reflection on OpenRouter now!

Weights are up but we're still testing to make sure they're right before we're going to call it a day.

2/18
@eztati
This is the first time I have seen a model answering @karpathy prompt engineering challenge on the first try. Nice work.

3/18
@jdaveis
lol its actually Claude ( @mattshumer_ / @GlaiveAI scam becoming embarrassingly obvious):

4/18
@eztati
Really? Try the prompt above and see if Claude can solve the challenge :smile:

5/18
@Evan0x51
care to show this result?

6/18
@eztati
Sure: here is the prompt

7/18
@ZlpdmZdbrg
How to short @mattshumer_ and @GlaiveAI ?

8/18
@MrExocomp
Was this all a scam? Say it ain’t so Matt, say it ain’t so

9/18
@rimomaguiar
My GPT can solve it, but it takes a hell of a prompt lol: ChatGPT

10/18
@ada_stefn
Matt shumer.. exposed as a fraud lol. What did you think was going to happen?

11/18
@NicoTechDream
Absolutely remarkable to see the prompt engineering challenge nailed on the first try! Thrilled for more such advancements! /search?q=#AI /search?q=#PromptEngineering

12/18
@emicovi
ask it to “write a sentence that contains the letter f exactly three times” ahahaha

13/18
@AngrYoungBhaloo
@teortaxesTex this might be the real deal

14/18
@GViela
@DotCSV otra prueba para el test de futuros modelos de LLM que es interesante

15/18
@Jotavix_

16/18
@Eyc_C_you
this guy is a scammer , check reddit posts about it . its using sonnet and gpt api on their backend

17/18
@MS23840062
This is crap

18/18
@komrade_nikola
You make AIs do the impossible? Lying is definitely within the scope of Claude's ability. Nice switch back to 4o, btw. Also, this is fraud, my man. As in... Felony.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
Screaming and swearing at your wife because she talked to black people in a public restaurant

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 10, 2024

1/23
@abacaj
Could not reproduce the 91% humaneval score for reflection (ref_70_e3), run locally using bf16 with vLLM. Used the "recommended" system prompt + extracting from output tags: 81.1%

meta-llama-3.1-70b instruct using the same setup (no sys prompt) gave me: 79.88%

2/23
@abacaj
Reflection result left, meta instruct result right

3/23
@abacaj
Model is not any better in vibe checks locally - never tried the API version... prefer to have weights to do apples to apples runs locally (same precision, sampling, prompts etc). I would largely say it is worse (fit to a single format, repeats instructions)

4/23
@abacaj
Example:

5/23
@abacaj
I honestly have no idea how they got 91% on humaneval or other benchmarks? The only thing I could think of is high temperature sampling X multiple times and using pass@k but not sure how they ran them in the first place

6/23
@abacaj
Model generates *a lot* more tokens estimate > 50% more compared to regular llama for each response, given the poor performance I would think this method is largely a deadend or executed poorly (reflection)

7/23
@gfodor
Can’t believe how much this guy wasted the time of so many smart people maliciously - obviously it’s a scam at this point and he should be apologizing but I doubt he will

8/23
@abacaj
The part that ticked me was publishing insane numbers without having anyone else verify

9/23
@ivan_bezdomny
yep -- no one can...

the story was fishy from the start

10/23
@dkardonsky_
not great

11/23
@almmaasoglu
I was waiting for this

12/23
@MaziyarPanahi
It’s not our job to figure it out how to evaluate it! It’s their job to get out of the hole they are hiding and show that lm-eval command line they used to claim “world’s best open-source LLM”!
He has no idea how disrespectful this sentence is to the rest of us!

13/23
@samarknowsit
Likewise, I already tried this on all versions uploaded, best I could get was 78.2%. Btw all the hashes matched,

what weights were mixed up ?

14/23
@qtnx_
unsurprising but disappointing

15/23
@TheSeaMouse
If you round to the nearest integer they both round up to 100% so close enough

16/23
@doodlestein
Rare to see someone completely torch all their credibility in a bonfire like that. Don’t really see the point of it.

17/23
@taomindos
Interesting results! Me.bot - Your Inspiring Companion

18/23
@mattwallace
Doing the same tests atm. I had 79.8% at 0.6 temp, 75.6% at 0.7, single pass pass@1 (on e3)

anecdotally, I'll say on the first set of weights, I feel like self-correction was very common (but hadn't tested humaneval yet); on the e3 weights I think I have yet to see it correct.

19/23
@ShivamKumar212
What are the results of llama-3.1-70b with the same system prompt?

20/23
@Samhanknr
I’m still stumped how they think they can get away with “open sourcing” a fake result. What did they think would happen ?

21/23
@MasoudMaani
weights are broken while loading to RAM. lol.

22/23
@_EyesofTruth_
I don't understand why this isn't everyones method.

Tech twitter is starting to look more like a highschool girls brunch club.

23/23
@amr__mo

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 10, 2024

bnew · Sep 10, 2024

bnew · Sep 10, 2024

bnew · Sep 10, 2024

1/11
I got ahead of myself when I announced this project, and I am sorry. That was not my intention. I made a decision to ship this new approach based on the information that we had at the moment.

I know that many of you are excited about the potential for this and are now skeptical. Nobody is more excited about the potential for this approach than I am. For the moment, we have a team working tirelessly to understand what happened and will determine how to proceed once we get to the bottom of it. Once we have all of the facts, we will continue to be transparent with the community about what happened and next steps.

2/11
More details here:

3/11
impulsive and unchecked product delivery is a mistake in leadership. it should be, once 'you' have all the facts, then you can be more transparent. don't add 'we' to a 'you' situation. take ownership!

4/11
is this what happened?

5/11
Hi Matt, we spent a lot of time, energy, and GPUs on hosting your model and it's sad to see you stopped replying to me in the past 30+ hours, I think you can be more transparent about what happened (especially why your private API has a much better perf)

6/11
<thinking>

7/11
Matt, we haven't directly interacted before. But my instinct is, you are doing a great job. And who trying something ambitious hasn't faced setbacks.
It's a bit unfortunate to see premature accusations and conclusions you've had to face.
We hope you are able to demonstrate that what you've achieved is not fakery but something substantial. From my previous following of yours, I think you will. Wish you the best.

8/11
I get it. We live in a world where people rush. They say build and release fast. Build it, break it, don't care, just put it in production.

I disagree with this. Yes, build and iterate fast, but this is a prime example of what happens when we go to production too early.

I've been there. I get it.

Joe Rogan has advice that might be good in cases like this. Ignore the twitter replies. Stop reading it.

I know there's also not 100% possible when you're developing and pushing here on X, connecting with leaders and losers.

I'm also certain the stress is out of this world.

Put it aside. I can guess the training was broken, maybe broken in a way that led to an interesting breakthrough.

Retrace your steps and be careful to not confuse it from the noise of RLHF which isn't always the best feedback.

Take your time. You'll figuring out.

For those who doubt you, fukk 'em. Keep swimming. The sharks are thick, but the oceans are massive.

9/11
Deep down, even the harshest critics are hoping to be proven wrong about Reflection.

Thanks for this message, and all the best for the busy days ahead.

10/11
Claude is somewhat skeptical so far. Looking forward to hearing the complete story.

11/11
This sounds like the response from a government bureaucrat taking “full responsibility” while taking none

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 10, 2024

1/1
New model added to the leaderboard!

Model Name
mattshumer/Reflection-Llama-3.1-70B · Hugging Face

Overall rank: 109
Rank in 60B category: 29

Benchmarks
Average: 26.56
IFEval: 65.63
BBH: 42.39
MATH Lvl 5: 0.0
GPQA: 1.45
MUSR: 10.02
MMLU-PRO: 39.88

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

The A.I Megathread (LLM , GPT , Development)

More options

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

GitHub - zhentingqi/rStar

rStar: Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers

Intro

Ghost Utmost

The Soul of the Internet

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

The A.I Megathread (LLM , GPT , Development)

Veteran

Veteran

Veteran

Veteran

Veteran

rStar: Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers​

Intro​

The Soul of the Internet

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

rStar: Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers

Intro