The A.I Megathread (LLM , GPT , Development)

bnew · Sep 8, 2024

1/2
as expected - the independent re-evaluation of Reflection 70B by @mattshumer_ and @csahil28 shows much better results then the tests on the accidentally broken weights.
what a huge win to have a two people project, based on a 70B model, score 2nd in GPQA.

kudos guys!

2/2
more on what makes this model special

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/4
"Given that total inference compute is proportional the product of total tokens and parameter count, this means that Reflection 70B uses substantially less total compute to achieve its GPQA score than Llama 3.1 405B."

2/4
we shall see!

3/4
oh, didn't know that! very interesting

4/4
5 including myself

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 8, 2024

mattshumer/ref_70_e3 · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

mattshumer/ref_70_e3 · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Reflection 70B - API, Providers, Stats

Reflection Llama-3.1 70B is trained with a new technique called Reflection-Tuning that teaches a LLM to detect mistakes in its reasoning and correct course. Run Reflection 70B with API

openrouter.ai

bnew · Sep 8, 2024

bnew · Sep 8, 2024

bnew · Sep 9, 2024

bnew · Sep 9, 2024

1/40
Automating AI research is exciting! But can LLMs actually produce novel, expert-level research ideas?

After a year-long study, we obtained the first statistically significant conclusion: LLM-generated ideas are more novel than ideas written by expert human researchers.

2/40
In our new paper: [2409.04109] Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers

We recruited 49 expert NLP researchers to write novel ideas on 7 NLP topics.

We built an LLM agent to generate research ideas on the same 7 topics.

After that, we recruited 79 experts to blindly review all the human and LLM ideas.

2/

3/40
When we say “experts”, we really do mean some of the best people in the field.

Coming from 36 different institutions, our participants are mostly PhDs and postdocs.

As a proxy metric, our idea writers have a median citation count of 125, and our reviewers have 327.

3/

4/40
We specify a very detailed idea template to make sure both human and LLM ideas cover all the necessary details to the extent that a student can easily follow and execute all the steps.

We paid $300 for each idea, plus a $1000 bonus to the top 5 human ideas.

4/

5/40
We also used an LLM to standardize the writing styles of human and LLM ideas to avoid potential confounders, while preserving the original content.

Shown below is a randomly selected LLM-generated idea, as an example of how our ideas look like.

5/

6/40
Our 79 expert reviewers submitted 298 reviews in total, so each idea got 2-4 independent reviews.

Our review form is inspired by ICLR & ACL, with breakdown scores + rationales on novelty, excitement, feasibility, and expected effectiveness, apart from the overall score.

6/

7/40
With these high-quality human ideas and reviews, we compare the results.

We performed 3 different statistical tests accounting for all the possible confounders we could think of.

It holds robustly that LLM ideas are rated as significantly more novel than human expert ideas.

7/

8/40
Apart from the human-expert comparison, I’ll also highlight two interesting analyses of LLMs:

First, we find LLMs lack diversity in idea generation. They quickly start repeating previously generated ideas even though we explicitly told them not to.

8/

9/40
Second, LLMs cannot evaluate ideas reliably yet. When we benchmarked previous automatic LLM reviewers against human expert reviewers using our ideas and reviewer scores, we found that all LLM reviewers showed a low agreement with human judgments.

9/

10/40
We include many more quantitative and qualitative analyses in the paper, including examples of human and LLM ideas with the corresponding expert reviews, a summary of experts’ free-text reviews, and our thoughts on how to make progress in this emerging research direction.

10/

11/40
For the next step, we are recruiting more expert participants for the second phase of our study, where experts will implement AI and human ideas into full projects for a more reliable evaluation based on real research outcomes.

Sign-up link: Interest Form for Participating in the AI Researcher Human Study (Execution Stage)

11/

12/40
This project wouldn't have been possible without our amazing participants who wrote and reviewed ideas. We can't name them publicly yet as more experiments are ongoing and we need to preserve anonymity. But I want to thank you all for your tremendous support!!

12/

13/40
Also shout out to many friends who offered me helpful advice and mental support, esp. @rose_e_wang @dorazhao9 @aryaman2020 @irena_gao @kenziyuliu @harshytj__ @IsabelOGallegos @ihsgnef @gaotianyu1350 @xinranz3 @xiye_nlp @YangjunR @neilbband @mertyuksekgonul @JihyeonJe

13/

14/40
Finally I want to thank my supportive, energetic, insightful, and fun advisors @tatsu_hashimoto @Diyi_Yang

Thank you for teaching me how to do the most exciting research in the most rigorous way, and letting me spend so much time and $$$ on this crazy project!

14/14

15/40
Great work and cool findings. I have to say there is a huge confounding factor that is intrinsic motivation. Experts ( here PhD students) might not be sharing their best novel ideas because the incentive is monetary instead of publication which brings in long term benefits for students such as job offers/ prestige/ visibility. Your section 6.1 already mentions this.

Another confounding factor is time. Do you think best research / good work happens in 10 days ? I personally have never been able to come up with a good idea in 10 days. But at the same time I don’t know if an LLM can match my ideation ( at-least not yet)

16/40
this is a really cool study!

did the LLM idea's get ran past a Google search? i see the novelty scores for LLM idea's are higher but i've personally noticed asking for novel idea's sometimes results in copy-pasta from the LLM from obscure blog posts/research papers

17/40

18/40
This is cool! Want to come on my podcast and chat about it?

19/40
Great read thank you!

20/40
awesome study! but whats the creativity worth if it’s less feasible?

in my own experience it often suggests ideas that are flawed in some fundamental regards that it misses.

theres only so much facts and constraints the attention mechanism can take into account

21/40
Your thread is very popular today! /search?q=#TopUnroll Thread by @ChengleiSi on Thread Reader App

@vankous for

unroll

22/40
in my paper"Automating psychological hypothesis generation with AI: when large language models meet causal graph" Automating psychological hypothesis generation with AI: when large language models meet causal graph - Humanities and Social Sciences Communications , we have approved our workflow/algo can even generate novel hypothesis better than GPT4 and Phd .

23/40
What is the best rated LLM generated idea?

24/40
looks like we're out of a job

25/40
Skynet when tho?

26/40
Interesting! I may be doing he exact same experiment, only I let both ChatGPT and Claude participate in a language experiment I choose.

27/40
Interesting work, but would be interesting to know which LLM and prompts were used for this.

28/40
I'd love to see some examples of these novel research ideas. How do they hold up in peer review and actual experimentation?

29/40
All words are arbitrary, yet NLP is a field that treats language and words as statistically significant, excluding the social semiotic and status gain illusions inherent to language. So NLP begins as a notably degenerate field. A novel idea in NLP is technically oxymoronic.

30/40
@threadreaderapp unroll

31/40
Question, if you believe this result, are you going to switch to primarily using LLMs to pick research ideas for yourself?

32/40
Great work! Novelty can be subjective, varying with a topic’s maturity and reviewers’ perspectives. Rather than fully automating research, building practical LLM research assistants could be exciting. Looking forward to the next stage, making LLM research agents more powerful!

33/40
Nice job eager to read. One question, what if you change the topic… biology, math, arts?

34/40
kudos, super refreshing to see people invest in long term and interesting questions! gg

35/40
asking chatgpt to come up with a product, "an angry birds delivery dating app", VCs are jumping through my window, slapping my face with wads of cash

36/40
@Piniisima

37/40
Hahahahahahahahahahah, no.

38/40
Did LLMs write this derivative drivel?

39/40
大佬牛逼

40/40
The infinite unknown is a temptation, and in the face of the limits of our feelings we always have room to manoeuvre. Understanding is only the starting point for crossing over.Creation is the temptation to go beyond the unknown.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 9, 2024

1/58
We've created a demo of an AI that can predict the future at a superhuman level (on par with groups of human forecasters working together).
Consequently I think AI forecasters will soon automate most prediction markets.

demo: FiveThirtyNine | Forecasting AI
blog: Superhuman Automated Forecasting | CAIS

2/58
The bot could have been called "Nate Gold," but I didn't get permission from @NateSilver538 in time; hence it's FiveThirtyNine instead

3/58
@PTetlock @tylercowen @ezraklein @kevinroose

4/58
This is the prompt that does the heavy lifting

5/58
Some ways in which it is subhuman in delimited ways:

* If it's given an invalid query it will still forecast. I didn't bother to implement a reject option.

* If something is not in the pretraining distribution and if no articles are written about it, it doesn't know about it. (So if it's something that's just on X, it won't know about it, even if a human could.)

* For forecasts for very soon-to-resolve events, it does worse. (It finished pretraining a while ago so by default thinks Dean Phillips in the race and need to see articles to appreciate this.)

6/58
Hard to accept this number as meaningful when it doesn’t seem to understand that a legislative veto override is unlikely because it did not pass with a veto-proof majority

7/58
CA congress is different from US congress.
To my understanding even if it passed unanimously, it can be vetoed; CA congress can vote on it again to override the veto. However, this hasn't happened in ~40 years. So I think its understanding is correct.

8/58
are you worried about automation bias in your vision of having it embedded inline? seems like seeing a number could be pretty self reinforcing (whether it’s correct or not, or even mostly correct)

9/58
We mention automation bias in the blog post. I agree it's an issue to counteract.

10/58
Is there any pattern to when it did better or worse than the human forecasters?

11/58
For "will X happen by date Y" questions, it would do better to create a CDF with Y varying and then make a prediction consistent with that. I think its predictions on these types of questions could be improved with fine-tuning or more prompt engineering.

12/58
I asked it what the probability is that FiveThirtyNine would turn out to be flawed in some way, it did a bunch of google searches for FiveThirtyEight and gave me an answer about the Nate Silver feud. Maybe an unlucky first try, but so far this does not feel superhuman.

13/58
FiveThirtyNine is not in its pretraining distribution nor in any news articles. It doesn't have access to the twitter API.

14/58
I think there will be some substantive issue with this. Seems wrong.

I’ve put a limit order at 10% here for $50

Will there be substantive issues with Safe AI’s claim to forecast better than the Metaculus crowd, found before 2025?

15/58
Very cool!

Would you be open to sharing the dataset so others can try?

16/58
Don’t prediction markets rely on aggregated opinion based on infomation asymmetries among a large number of actors? How does this simulate that?

17/58
Very cool! Do you guys have a public dashboard or something to record historical predictions and their accuracy?

18/58
The linked repo is empty.

19/58
Reminds me of perplexity with the sources shown when processing the answer

20/58
The model doesn't seem to account for rapid AI progress, like when you ask it about the chances of a disease being cured by 2040, nothing it considers relates to AGI.

21/58
🫵

22/58
Is there any calibration ongoing or planned to be added? E.g., it said Trump 50 %, Harris 55 %.

23/58
Estimates vary a bit between different runs. We tried adding some consistency checks, which took 3x longer, and didn't help performance overall. My guess is a form of consistency fine-tuning could help though.

24/58
Paper?

25/58
... but why is it called FiveThirtyNine, did ya'll invade a microstate again

26/58
I think the second claim certainly does not follow, and frankly is ridiculous.

27/58
Mmmm

28/58
Jeez you could ask it to guess a dice roll and went for this instead

29/58
can you share more around the methodology for how news/data is sourced for the queries?

30/58
Does your AI make the same prediction?

31/58
FiveThirtyNine versus ElectionBettingOdds(.)com

This is will be an interesting AI/human divergence to monitor

32/58
sounds like a cool @perplexity_ai feature

33/58
This may be helpful for inspecting the prediction performance, superhuman or not: [2306.09983] Evaluating Superhuman Models with Consistency Checks

tldr: Using consistency checks to get an idea of how coherent the predictor is without needing any ground truth.

34/58
LOLz

35/58
Dan, this is a good start, but it’s dicey, it’s dicey.

I like what you’re doing, but the output is going to vary from instance to instance.

If you wanna try something, have GPT output an explanation for what the Vader analysis is. And then copy that description into 100 different new instances of GPT and then ask it to analyze an article that you copy and paste in there as well. Have it give it a Vader score, and then you’ll see what I’m talking about.

36/58
Hey Dan, I hear Berkeley's STATS 101 is quite a good course, maybe you should consider it?

37/58
I would like to see if it actually performs superhuman level on forecasts that are not already closed. you should try it at Q4 Tournament AI Benchmarking to see how accurate it is against the pro forecasters & other bots Q3 AI Forecasting Benchmark Tournament

38/58
Doesn't seem to be working for me. Stuck on "Verifying..."

39/58
Will you be backing this claim with real money? Skin in the game or it’s just hype

40/58
are you going to release which 177 events they were and its predictions for each?

41/58
how does it follow that prediction markets will be automated? if this system performs well already, won't a market where participants have (or are) these systems perform better?

(or is that what you mean?)

42/58
It's definitely neat. I found that the phrasing of the question or even slightly leading it vastly changes the outcome though.

43/58
What pdoom does it predict?

44/58
Doubtful, same as the rest of your claims

45/58
It doesn't seem to work as it doesn't have latest information because it uses gpt 4o model.

46/58
I would beat this AI forecaster in a contest with 95% certainty

47/58
Do you want to put money on this?

48/58
Interesting concept! But I'll take the over on my first try

49/58
Which? the image is not loading for me..

50/58
Could CN benefit from something like this, @_jaybaxter_ ?

51/58
Bruh come on

52/58
All those time travel movies I’ve seen are going to pay off finally. If you can accurately predict the future, you’re effectively traveling to the past and can affect it.

53/58
Will Trump win the 2024 US presidential election?
Answer: 51%

Will Harris win the 2024 US presidential election?
Answer: 58%

No serious forecaster will use a model that doesn't have basic statistical knowledge built in.

54/58
unreal!!!!!!!!!!!!!

what is its AGI timeline ?

55/58
I would never use a product made by a scammer like you who makes hypocritical laws. I hope it is a total failure!

56/58
Do you think human forecasters with access to the bots will outperform bots alone?

57/58
This is fukking stupid.

58/58
Imagine being dumb enough to think this works

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 9, 2024

1/11
@joshim5
We’re excited to introduce @ChaiDiscovery and release Chai-1, a foundation model for molecular structure prediction that performs at the state-of-the-art across a variety of drug discovery tasks

We're releasing inference code, weights & a web interface: Chai Discovery

2/11
@joshim5
We tested Chai-1 across a number of benchmarks, and found that the model achieves a 77% success rate on the PoseBusters benchmark (vs. 76% by AlphaFold3).

3/11
@joshim5
Unlike many existing structure prediction tools which require multiple sequence alignments (MSAs), Chai-1 can be run in single sequence mode without MSAs while preserving most of its performance.

4/11
@joshim5
In addition to its frontier modeling capabilities directly from sequences, Chai-1 can be prompted with new data, e.g. restraints derived from the lab, which boost performance by double-digit percentage points.

5/11
@joshim5
We are releasing Chai-1 via a web interface for free, including for commercial applications such as drug discovery. We are also releasing the code for Chai-1 for non-commercial use as a software library.

Web interface: Chai Discovery
Code: GitHub - chaidiscovery/chai-lab: Chai-1, SOTA model for biomolecular structure prediction

6/11
@joshim5
Prior to founding @chaidiscovery, our team collectively helped advance the development of 20 drug programs.

Many of us were Heads of AI at leading AI Drug Discovery companies and we’ve worked at companies like @OpenAI, @AIatMeta, @Stripe, and @GoogleAI.

7/11
@joshim5
We're well funded and grateful for the strong partnership of @ThriveCapital, @_DimensionCap, @OpenAI, @saranormous, @Neo, @lachygroom, and @Amplify, as well as angel investors @gdb, @ByersBlake, @JuliaHartz, @KevinHartz, @FEhrsam, @gaybrick, @dafrankel, @RMartinChavez.

8/11
@joshim5
Read more about our launch in @Bloomberg: Bloomberg - Are you a robot?

9/11
@apartovi
Let's goooo! I'm so proud of you Josh,
@_JackDent, & @ChaiDiscovery!

It's been seven wonderful years since we started supporting you as @Neo Scholars, and I've loved every step.

Glad to be your investor, and here's to the journey ahead.

10/11
@joshim5
Thanks so much for your support, @apartovi! And what a throwback... that photo on-stage is from the first @Neo reunion back in 2017. I think we can guess what the topic was :smile:

11/11
@jennywang01
Congrats!!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 9, 2024

bnew · Sep 9, 2024

1/1
"SkillMimic" uses just 35 minutes of video and motion capture data of human demos to train simulated humanoids in basketball skills like dribbling, shooting, and layups through imitation learning.

It enables skill reuse and smooth transitions for continuous scoring ability.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/11
From learning individual skills to composing them into a basketball-playing agent via hierarchical RL -- introducing SkillMimic, Learning Reusable Basketball Skills from Demonstrations

: SOCIAL MEDIA TITLE TAG

: [2408.15270v1] SkillMimic: Learning Reusable Basketball Skills from Demonstrations

: GitHub - wyhuai/SkillMimic: Official code release for the paper "SkillMimic: Learning Reusable Basketball Skills from Demonstrations"

Work led by @NliGjvJbycSeD6t , Qihan Zhao, and Runyi Yu.

2/11
SkillMimic leverages human demonstration data extracted from video to learn specific basketball skills like dribbling and layup

3/11
Skills are learned via contact-graph-powered HOI imitation learning

4/11
Simulated humanoid sports, rise up!

5/11
Interesting work

6/11
Thanks!

7/11
The paper proposes a novel approach called SkillMimic that enables physically simulated humanoids to learn a variety of basketball skills from human-object demonstrations. The key idea is to define skills as collections of Human-Object Interaction (HOI) state transitions and then use imitation learning to train a single policy that can acquire diverse skills.

SkillMimic can effectively learn diverse basketball skills, including shooting, layups, and dribbling, within a single policy using a unified configuration. Compared to baseline methods, SkillMimic exhibits superior performance and robustness. The HLC can quickly learn complex tasks, such as continuous scoring, by reusing the skills acquired through SkillMimic.

full paper: SkillMimic: Learning Reusable Basketball Skills from Demonstrations

8/11
Dark mode for this paper

SkillMimic: Learning Reusable Basketball Skills from Demonstrations

9/11
turnaround layup looks nice!

10/11
So cool!

11/11
Amazing results and amazing to see how it's evolved after PhysHOI!
Looking forward to adding these data and skills into ***********.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 9, 2024

1/16
Announcing the most comprehensive comparison of AI chatbots: ChatGPT, Claude, Meta AI, Gemini and more

We have tested the features of every major AI chatbot and leveraged our extensive benchmarking data on the underlying models to present the most comprehensive analysis of AI chatbots yet. We’ve tested chatbots from @OpenAI , @AnthropicAI, @GoogleAI, @AIatMeta, @poe_platform, @perplexity_ai, @MicrosoftAI, @MistralAI, @huggingface, @character_ai and @xai.

On our chatbots comparison page, you’ll find:
‣ The winners of our six highlight categories (see below tweet for these)
‣ A detailed comparison table with everything from intelligence scores to features
‣ A bunch of charts (of course there are charts!)

We’ve tested everything from context window to PDF uploads to code interpreters and creating charts. We hope this work will be a helpful resource for the community to compare chatbot applications and make decisions about which chatbots suit certain use-cases.

Background on some definitions if you’re not used to all these terms: each of the major AI labs has a series of foundation models (models are called things like GPT-4o, Claude 3.5 Sonnet, Llama 3.1 70B etc) and a consumer chatbot application (eg. ChatGPT, Claude, Gemini, Meta AI). On the main Artificial Analysis leaderboards, we benchmark the foundation models themselves. The new chatbot comparison that we’re launching today is the first time we have extended our benchmarking to the consumer chatbot applications.

Please get in touch with any questions, feedback or requests on this new analysis!

See below for more.

2/16
Artificial Analysis Chatbot Comparison Category Winners - September 2024

To simplify comparison, we developed criteria for six highlight categories. For details of our criteria and reasoning, click through to the page!

Best Overall: ChatGPT Plus

Best Free: ChatGPT Free

Best for Images: Poe Pro

Best for Coding: Claude Pro

Best for Long Context: Claude Pro

Best for Data: ChatGPT Pro

3/16
We manually tested the Effective Context Window of every chatbot in our comparison - and we found some big differences!

Claude Pro is the only chatbot supporting input context greater than 40k tokens.

4/16
See all the details at: Compare AI Chatbots: ChatGPT, Claude, Meta AI, Gemini and more

5/16
yea there is one bug on the website :/ just reporting :D

6/16
I have a sub for all of them and more and just select a random one for an answer.

My goto AIs are mainly ChatGPT and Monica.

7/16
Super helpful!

A few updates/corrections for your chart though. (And I'd love to know more info on some.)

Claude limits -- I'd love more info on your testing for the rate limit for Claude, as it's a bit more fluid than other providers. (They admit this in their help docs)

Anthropic uses a lot of factors to determine your rate limit.

In our (unofficial) testing, we're normally getting about 18-25 messages. The more complexity, the fewer messages. I've never seen/heard of anyone getting more than ~35.

- ChatGPT context window -- Also, ChatGPT is VERY consistent in 31-32K memory. We've tested dozens of times over the past year-ish (including on livestreams) and have never seen anything below 30.

-- ChatGPT image generation - right now, there's limited free Dall-E 3 in the free version of ChatGPT.

8/16
This is super clear and helpful, and I love the comparison table. Will you also show the results of other smaller open source models?

9/16
Wasn't Gemini supposed to have that big 1m context window? Or am I missing something??

10/16
Does Claude Pro really have a 200k context? Its output length seems like 4k

11/16
Gemini has Artifacts, which for most people's context means the same as the Code Interpreter in practice

12/16
wow cool, seems pretty accurate too with my personal findings

13/16
Is the recent chatgpt esperimental with longer delay included in this?

14/16
You all haven't tried Maximo AI. It's free and even gives access to Claude 3.5 sonnet, too. For free.

Link: Maximo AI - Unleash the Power of AI for Trading & Content Creation

/search?q=#AI

15/16
Are you sure they were the right models though?

16/16
You are completely missing out the FREE Google ai studio here with access to gemini 1.5 pro and 2M Context. Why??

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 9, 2024

1/5
regardless of the outcome and results of the reflection-70b dive-in, the amazing thing about open-weights and open-source is how the whole community can dive in these questions together and study the model

transparency in AI is becoming so crucial as the field grow

2/5
Problem is that it is being constantly undermined by "open weights are worse than our local model/hosted API".

something has been opened but the result can not be reproduced... very straightforward situation (LK99)

3/5
I don't know. It seems to be the combination of open source hype and social media + no testing standards which allowed for these shenanigans.

I love open source, but we need to create mandatory, independent testing and evaluation standards and hold model creators accountable.

4/5
yeah, glad that the thesis applied to proprietary frontier model providers too — everyone can get answers to arbitrary questions too.

5/5
I completely agree, the community's collective scrutiny is what drives progress in AI research. Transparency is crucial, and open-weights are a huge step in the right direction.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/11
Reflection 70B update: Quick note on timeline and outstanding questions from our perspective

Timeline:
- We tested the initial Reflection 70B release and saw worse performance than Llama 3.1 70B.

- We were given access to a private API which we tested and saw impressive performance but not to the level of the initial claims. As this testing was performed on a private API, we were not able to independently verify exactly what we were testing.

- Since then, there have been additional HF releases which some providers have hosted. The latest version appears to be: mattshumer/ref_70_e3 · Hugging Face. We are seeing significantly worse results when benchmarking ref_70_e3 than what we saw via the private API.

Outstanding questions:
- We are not clear on why a version would be published which is not the version we tested via Reflection’s private API.

- We are not clear why the model weights of the version we tested would not be released yet.

As soon as the weights are released on Hugging Face, we plan to re-test and compare to our evaluation of the private endpoint.

2/11
The free version on openrouter is 100% just an api wrapping claude

3/11
I also uploaded GGUF versions to unsloth/Reflection-Llama-3.1-70B-GGUF · Hugging Face for those interested

I tried calling both the open weights & the API (API is now broken), and the uploaded weights definitely seem "worse" qualitatively

I forced the sys prompt, and used temp=0.7, top_p=0.95 as suggested

4/11
Folks, be careful calling sketchy private APIs. All calls might be intercepted and now they have your holdout test set (or worse sensitive information).

5/11
It's obviously a grift. I don't understand why any credible people are giving him the benefit of the doubt any longer. You only make yourselves look worse when you accept the excuse "we have the correct model on the API but we're retraining because we don't know how to upload it"

6/11
Well, at this point the best thing that could happen to them is for GPT-5 to come out and we all forget about the issue. They got themselves into a mess in pursuit of fame and don't know how to get out of it. Hahahahahaha

7/11
Happy to support the rigorous evaluation process

8/11
Clarify who proposed using the questionable private API for benchmarks and demonstrate your impartiality. Otherwise, the suspicion remains that you are part of the hype cycle, manipulating scores for undisclosed motives.

9/11
Thanks for your take on this. Just got back from a 2-week vacation and Reflection 70B was one of the things I bookmarked to read up on.
Is it still something worth looking into, or was this largely a fad?

10/11
This is interesting, but should private API results really be considered or taken seriously? Hmmmm...

11/11

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 9, 2024

1/18
@shinboson
A story about fraud in the AI research community:

On September 5th, Matt Shumer, CEO of OthersideAI, announces to the world that they've made a breakthrough, allowing them to train a mid-size model to top-tier levels of performance. This is huge. If it's real.

It isn't.

2/18
@shinboson
They get massive news coverage and are the talk of the town, so to speak.

*If* this were real, it would represent a substantial advance in tuning LLMs at the *abstract* level, and could perhaps even lead to whole new directions of R&D.

But soon, cracks appear in the story.

3/18
@shinboson
On September 7th, the first independent attempts to replicate their claimed results fail. Miserably, actually. The performance is awful.

Further, it is discovered that Matt isn't being truthful about what the released model actually is based on under the hood.

4/18
@shinboson
Matt starts making claims that there's something wrong with the API. There's something wrong with the upload. For *some* reason there's some glitch that's just about to be fixed.

5/18
@shinboson
Proof points are needed and so Matt hits back. He provides access to a secret, private API that can be used to test "his model". And it performs great! For an open source model of that size, anyway.

He even releases a publicly available endpoint for researchers to try out!

6/18
@shinboson
But the thing about a private API is it's not really clear what it's calling on the backend. They could be calling a more powerful proprietary model under the hood. We should test and see. Trust, but verify.

And it turns out that Matt is a liar.

7/18
@shinboson
Their API was a Claude wrapper with a system prompt to make it act similar to the open source model.

Amusingly, they appear to be redeploying their private API in response to distinctive tells sneaking through, playing whack-a-mole to try to not get found out.

8/18
@shinboson
tl;dr
Matt Shumer is a liar and a fraud. Presumably he'll eventually throw some poor sap engineer under the bus and pretend he was lied to.

Grifters shyt in the communal pool, sucking capital, attention, and other resources away from people who could actually make use of them.

9/18
@shinboson
check out mythbuster extrordinaire @RealJosephus's great thread on this

10/18
@shinboson
Since some people are saying this is premature and they want to wait for data and replications, I grabbed an API key, added support for OpenRouter to my eval script, and compared Reflection 70B to other leading models on an *unseen* test set.

The results were bad.

11/18
@shinboson
The test set was an algorithmically generated set of 200 multiple choice puzzles. They're unique every time they're generated so they can't be cheesed. There's no way to perform well on this test except intelligence.

12/18
@shinboson
Since I don't have a substack to promote instead I'll share a preview of my next effortpost. You *can* actually get SOTA on unseen benchmarks ... if you are willing to be more liberal about what constitutes a "model". Hybrid systems here are any amalgam of models and code.

13/18
@gazorp5
Now it depends what time you tested Reflection 70b (free), because it changed across 3 different models haha.

14/18
@shinboson
as of 04:01:26 UTC Monday, September 9, 2024 they're both still scoring about 35% on this test so they both seem to have stabilized for now at "dogshyt"

15/18
@port_man2
still this has yet to be 100% confirmed to be fixed, skeptical as you may be.
When they give that green light, then I think all this is all warranted.
And yes a better explanation of the weird filtering aspect, hopefully.

16/18
@johnlu0x
yep - it's a scam API and we won't ever see working weights released.

17/18
@CleohrN
lmfao cook those frauds

hope you've all learned the lesson that if it doesn't come from OpenAI, Anthropic, or DeepMind, it's FRAUDULENT.

you think some randos in a basement can outsmart top tier researchers, let me laugh.

life is not a feelgood underdog wins movie

18/18
@shinboson
I'm gonna be generous here and assume you just didn't read what I wrote.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

https://pbs.twimg.com/media/GXABvzKWsAAbncy.jpg

https://pbs.twimg.com/media/GXABvzNW8AAvO4f.jpg

https://pbs.twimg.com/media/GXAUPiEXkAAAJ38.png

bnew · Sep 9, 2024

bnew · Sep 9, 2024

1/30
@rohanpaul_ai
Reflection 70B is all over Reddit today.

2/30
@KuittinenPetri
This thread might share more light to the Reflection 70B, which now seems an obvious scam attempt (maybe trying to gain attention + funding?).

Key points:
1. The first independent attempts to replicate the outrageous claims failed. It performed far worse than many existing models, including Llama-3.1 70B and it lost by a huge margin to gpt-4o-mini (which is actually good for its size + price!)

2. "Reflection API" is a Anthrophic Claude Sonnet 3.5 wrapper with prompt. And they are currently disguising it by filtering out the string 'claude'.

3. Matt Shumer, the CEO & co-founder of OthersideAI (Hyperwrite) has not been honest through all of this. I think it also caused significant harm. I certainly would not trust him nor his company after this.

The thread with lots screenshots:

/search?q=#Reflection70B

3/30
@rohanpaul_ai
wowww

4/30
@bartczernicki
GenAI is maturing...we have had our first big GenAI company (for all intents and purposes) go under Stable Diffusion and now scams :smile:

How did he get 1.4k GitHub stars (more than Microsoft, Cohere etc.) other than with "GitHub star farming"?

5/30
@rohanpaul_ai

6/30
@itaybachman
All of this makes 0 sense, the only thing i think might explain this is that he gave someone else to do the actual training and launching the API, and this someone tricked him or something, and now he is too afraid to admit he lied about being the one who trained the model.

7/30
@rohanpaul_ai

8/30
@sidhuram

9/30
@rohanpaul_ai

10/30
@janekm
Reflection 70b reflection!

11/30
@rohanpaul_ai

12/30
@AI_GPT42
cooking benchmarks is too easy

13/30
@rohanpaul_ai

14/30
@victor_explore
This is a great example where we should not trust anyone just by their words

15/30
@rohanpaul_ai
absolutely, and the community needs to be a little bit more patient before judging fast in the 1st 24 hours of a release.

16/30
@itsAiwa
Matt is not having a great Monday. Deservingly so

17/30
@rohanpaul_ai

%

18/30
@Olney1Ben
FFS, just set this up to download and popped out for lunch

19/30
@_coenen
It's AI's LK99 moment

20/30
@anushkmittal
open source benchmarks go brr

21/30
@JeffreyH630
I’ve seen so many discussions about it! What do you think makes it stand out this time?

22/30
@gpt_biz
This looks interesting, I'll check it out too

23/30
@nvn_osto
HuggingChat’s 70B works just fine, has tool access, and if you’re really hard up and you want near full fat 405B without long context there’s always Meta AI

Dude should have focused on massaging 8B or lower parameter.

24/30
@MSZeghdar
Dude, the files got f’ed up during upload was an excuse i used to give to my advisor when sending him trash files hhhhh

25/30
@mailupule25785
@mattshumer_ i think your career is done

26/30
@aspelund
Well the old if it looks too good to be true it probably is still stands…

27/30
@lm_tldr
Yeah, a scam.

28/30
@Not_a_botski
Anyone actually tested the model on open router? It felt it was pretty decent based on half an hour of toying around

29/30
@gpuforthepoor
It would be naive to think the space will not see scammers and grifters. Look at what happened in the crypto space (nfts, coin offerings, etc). Where technology offers opportunity, dishonest people will always try to take advantage.

30/30
@stormbreezer__
Then U r living in an echo chamber on reddit

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

The A.I Megathread (LLM , GPT , Development)

More options

bnew

Veteran

bnew

Veteran

mattshumer/ref_70_e3 · Hugging Face

mattshumer/ref_70_e3 · Hugging Face

Reflection 70B - API, Providers, Stats

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran