Large Language Models News & Discussions

bnew · Sep 7, 2024

1/26
Reflection Llama 3.1 70B independent eval results: We have been unable to replicate the eval results claimed in our independent testing and are seeing worse performance than Meta’s Llama 3.1 70B, not better.

These evaluations were conducted using our standard methodology, including using our standard system prompt and accessing the model via DeepInfra’s API, which claims bf16 precision. Our evaluation methodology uses a 0-shot prompt with a think step by step instruction.

This is not to say there is no merit in Reflective's prompting approach for achieving higher evaluation results as claimed. We are aware that the Glaive team has been updating the model, and we would be more than happy to test further releases.

We also ran tests comparing our standard system prompt to Glaive’s provided system prompt and we did not observe any differences in the evaluation results on Reflection Llama 3.1 70B, Llama 3.1 70B, GPT-4o or Claude 3.5 Sonnet.

This does not mean the claimed results were not achieved, but we look forward to hearing more about the evaluation approach that led to these results, particularly regarding the exact prompt used and how the evaluation answers were extracted.

2/26
According to the Glaive team, the model was incorrectly uploaded to Hugging Face. We plan to re-run our evaluations after the model is re-uploaded correctly.

We think it would also be helpful if the Glaive team could share exactly how they prompted and extracted the answers in achieving the eval results claimed. This will allow us to to attempt to re-produce the results and also test other models (Meta's Llama 3.1 70B & 405B, GPT-4o, etc) using the exact same approach for

to

comparisons.

3/26
Was the format of the output correct? Eg the reflection tags

4/26
Yes, example MMLU response:

5/26

6/26
There was an issue with the uploaded weights. You migh t want to wait for the new release to test it

7/26
Thanks! Noted -

8/26
Do you plan on running the eval again after @mattshumer_ resolved the issues with the repo?

9/26
Yes!

10/26
did you try it before or after they fixed an issue?

I noticed about 15hr ago the deepinfra endpoint started working better, the endpoint started to produce the xml tokens.

11/26
Ran after they fixed the issue. Also ran on Hyperbolic and saw near-identical results

12/26
Thanks for posting about this. With the technique proposed, it's very important to hear more on the how the evaluation was done.

13/26
Thanks for this, specially for writing it in a professional/ respectful way instead of going for the nasty, hurtful language and all the clicks it generates.

14/26
Thank you for your analysis!

15/26
@mattshumer_ said that it is the wrong weights, and he is working on getting out the right ones.

16/26
I told you guys, there's no way it was possible, they didn't publish their testing methods, just the weights. Then they supposedly had issues with getting the thing uploaded right, and now they're retraining it?

I doubt they ever had anything that level to begin with.

I could be wrong, and I hope I am as it would be nice, but I'm very skeptical.

17/26
You are really nice.

After the first fumble everyone should have given him some time for the dust to settle.

18/26
They have uploaded the model wrong

19/26
the model weights were broken, here the correct ones

20/26
@mattshumer_ said there was unintentional merging of weights, he's reuploading the correct weights.

21/26
What about the Reflection 70B non-Llama performance?

22/26
The reasoning cannot be correct. Multiple reasoning tests like the prompts from open source Livebench (dot) ai show increased performance over llama 70B.

23/26
Models like the GPU they where trained on

24/26
Thanks for doing this, If it was so easy to just finetune and achieve groundbreaking results, everyone would have done it by now. While yes this may improve few things, it probably is costing more on others. In a non-scientific way, I was able to achive all this with a proper system prompt to 70b.

25/26

26/26
called it, I win. the thing shat out 5 paragraphs about how no one should disparage anyone based on sex because I asked it about masculine republics and feminine democracies as per effin aristotle.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/5
Our evaluation of Reflection Llama 3.1 70B's MMLU score resulted in the same score as Llama 3 70B and significantly lower than Meta's Llama 3.1 70B.

A LocalLLaMA post (link below) also compared the diff of Llama 3.1 & Llama 3 weights to Reflection Llama 3.1 70B and concluded the Llama 3 weights were a lot closer to Reflection's than Llama 3.1.

For further investigation but this might indicate the foundation model is Llama 3 rather than Llama 3.1 70B. It would be helpful if the team behind Reflection could clarify this.

2/5
Related reddit thread comparing Reflection Llama 3.1 70B weights to Llama 3 70B and Llama 3.1 70B.

3/5
Important context regarding our evaluation approach which uses a standardized methodology between models:

4/5
model weights were broken, see here

5/5
what’s the ETA?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 7, 2024

1/31
While we’re trying to fix the HF weights, we can share an API with a few researchers who want to run benchmarks and test to confirm our results.

Won’t have capacity for many, but if you want access, please let me know.

2/31
Hi Mat! I would love to test it again. I got good results during Friday's morning that I couldn't replicate with any other hosted model after that.

3/31
DMing you!

4/31
Hi @mattshumer_ ,
We would welcome access. Happy to re-run evaluations and provide an update on the below figures.

It would also be helpful if you could share the exact prompt and answer extraction approach used for the benchmarks results as stated on Hugging Face.

5/31
DMing you!

6/31
Folks are saying this is Llama 3, not 3.1, under the hood on reddit. Can you clarify?

7/31
It's 3.1, but for some reason the current HF weights are screwed up and the config shows 3... working on it, the issue is tricker than we expected

8/31
maybe this way these independent benchmarks can be reevaluated?

9/31
yep they're already working on it!

10/31
@paulgauthier Hey Paul, maybe you could run it again?

11/31
DM me Paul, happy to get you access

12/31
Love to try. Thanks in advance!

13/31
Which benchmarks do you want to run?

14/31
Ever the methodical explorer, opening doors prudently. Wise move.

15/31
I could probably run the benchmarks Monday

16/31
Sorry but farcical shipment. Struggling to comprehend why and how there could be such an issue with uploaded weights.

17/31
well that's nice of you

18/31
Dude we have your 64 H100s, ping me.

19/31
“share an api…”

just share the correct weights, we can certainly wait, no rush

20/31
Get one for @abacusai @bindureddy have some good benchmarks.

21/31
If model is only working behind private api, you might want to retract your claims now before it is too late. Private end point does not make it an open source model.

You can always re-announce it later with a functional model.

22/31
I had assumed that instead of flipping the coin once, we were trying to flip it a few more times, until we get bingo. I might reconsider. An interesting reminder, luckily, I got out of it early.

23/31
wanna run codeforces on it. thanks in advance!

24/31
Yes please

25/31
pleasee i want to try your modelll!

26/31
Could I try it? It sounds like an amazing job you're doing

27/31
Your model of transparency and quick fixes around this issue are definitely the way to go, love to see it!

Are you hiring right now? I'd love to help out with your most pressing software challenges.

28/31
Beta tested many models including the pre chatgpt. Down. Don't have benchmarks but can definitely give good feedback.

29/31
I’m running the 4bit model via ollama and the results seem underwhelming. (Couldn’t count Rs in raspberry). Is this kind of expected?

30/31
Very curious. I have a few unique tests I want to try. I don’t really care about the normal benchmarks, I care about how many variables the model can keep track of at the same time.

31/31
could i have access? i want to test the model and compare against the one hosted on other providers to see what exactly the difference is and what got corrupted or mixed up

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 8, 2024

Update on Reflection Llama 3.1 70B:
@mattshumer_ and his team dropped the "new, working version of the Reflection Llama 3.1 70B model" on Huggingface, so we're now serving the new weights at FP16 on @hyperbolic_labs (same API).

From my vibe checks, it's better than before. Now we're working with @ArtificialAnlys on the new benchmarks.

You can use our playground and API to test it: x.com.

I think it's better, according to my tricky llm questions set

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/22
We @hyperbolic_labs now serve Reflection 70B by @mattshumer_ in FP16!

> Use our API or playground to play w/ it
> It’s free for 1 week – perfect for stress testing our infra!
> Integrate with @OpenRouterAI soon
> Running on 4xH100s (and ready to scale with demand, because we know it’ll be high!)

The default system prompt is auto applied in our playground and the API tab if you want to copy paste from there.

Look at the <thinking> <reflection> tags in the output, time to check if it’s the top LLM across open-source + closed-source! If so, who is gonna pay OpenAI $2k/month?

2/22

!

3/22

!

4/22
Do you take requests?

-kun released some new models, everyone has tried to tame it but failed. (I don't know why)

5/22
what is kun?

6/22

7/22

8/22
Works with drop in OpenAI SDK right?

9/22
yep, just replace the url and our API key!

10/22
Hyperbolic AI Dashboard
is stuck. Is it right website?

11/22
You need to verify your email to use it

12/22
Yuchen, amazing!

13/22
tyty

14/22
you should add LlamaGuard 3! and Llama 3 70b and 7b too :smile:

15/22
interesting!

what's fancy about LlamaGuard 3? We do support Llama 3 70b: Hyperbolic AI Dashboard

16/22
still seems to have issues performs far lower than on matts website not sure whats going on

17/22
do you have comparisons with the same prompts?

18/22
Hermes 70B gets “strawberry” and “entrepreneur” right first try.

Reflection fails to do so with the same system prompt.

I wonder how much reflection tuning really helps. It also “corrects its mistake” but still fails to realize that’s it’s wrong.

19/22
Interesting to compare with the same system prompt across models

20/22
Increasing context length seems to break it. Problem with 512 is that it can run out of tokens in the output tags and when you ask it to continue it doesn't continue from the output tags @mattshume

21/22
Hi, the max context length of the model is 8192: config.json · mattshumer/Reflection-Llama-3.1-70B at main, you can increase the max tokens on the playground or API

22/22
While reflecting on, the reflection made me reflect on my next reflection goals. Nice Reflection bhai

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/2
We have now partially replicated Reflection Llama 3.1 70B’s evaluation claims, but we would caution that these results are not entirely apples-to-apples comparable with other models

Since our first testing of the public release version of Reflection Llama 3.1 70B, @mattshumer_ has shared access to a privately hosted version of the model that does not suffer from the issues with the version publicly released on Hugging Face. Per the charts below, the evaluation scores we’re seeing on this version are impressive - they show performance above the level of Llama 3.1 405B for GPQA and MATH. MMLU is in-line with Meta’s release of Llama 3.1 70B, indicating performance improvements may not be consistent in all contexts.

The chart below is based on our standard methodology and system prompt. When using Reflection’s default system prompt and extracting answers only from within Reflection’s <output> tags, results show substantial improvement: MMLU: 87% (in-line with Llama 405B), GPQA: 54%, Math: 73%.

The model seems to be achieving these results through forcing an output ‘reflection’ response where the model always generates scaffolding of <thinking>, <reflection>, and <output>. In doing this it generates more tokens than other models do on our eval suite with our standard ‘think step by step’ prompting. For GPQA, Reflection 70B generates consistently more output tokens that other models (see below for detailed comparison).

While the benchmark results are impressive, they should not be considered apples-to-apples with traditional instruct-tuned models. The results may be less applicable to generalized non-benchmark measured intelligence for the following reasons:
‣ Reflection scaffolding is an example of test-time compute scaling - using more inference compute to get to an answer. It introduces a new kind of compute and latency trade-off to be considered alongside model size and non-reflection intelligence. Compared to a model of the same size, Reflection 70B appears to use more compute and take longer to get to get to an answer.
‣ This approach to achieving reflection via fine-tuning restricts the flexibility of the model and may make it unsuitable for many use-cases. Compared to achieving chain-of-thought techniques via prompting, fine-tuning in the reflection approach means the form of reasoning cannot be changed. For example, it appears that Reflection 70B is not capable of ‘just responding with the answer’ in response to an instruction to classify something and only respond with a one word category. It may also be limited in the types of reasoning approaches it can pursue (non-reflection oriented).

Ultimately, Reflection 70B appears to demonstrate the potential of fine-tuning with standardized response scaffolding alongside test-time compute scaling. Given the impressive results, further research should be conducted on the advantages and drawbacks of this approach, including the degree to which it generalizes beyond evaluations to real-world use-cases.

All that being said: if applying reflection fine-tuning drives a similar jump in eval performance on Llama 3.1 405B, we expect Reflection 405B to achieve near SOTA results across the board.

Notes on the results:
‣ These were tested on a private API version and not an open-weights version.
‣ We cannot yet independently confirm that these results are not the result of benchmark contamination.
‣ Tests for Reflection were run with 6000 max output tokens (as opposed to our standard 2048 max output tokens). We have not yet studied the effect of a lower max output token setting on Reflection.

2/2
Reflection 70B is not the only model that has trended toward using more tokens at inference time. We see a wide spread in how verbose models are as they walk through chain of thought reasoning in our evaluations.

We compared the average number of characters in each response to the questions in the GPQA dataset with our standard prompting. Reflection 70B generates more characters for each GPQA response than any other models we have tested but the total volume is less than 2x an average across other recent models.

Given that total inference compute is proportional the product of total tokens and parameter count, this means that Reflection 70B uses substantially less total compute to achieve its GPQA score than Llama 3.1 405B.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 8, 2024

1/2
as expected - the independent re-evaluation of Reflection 70B by @mattshumer_ and @csahil28 shows much better results then the tests on the accidentally broken weights.
what a huge win to have a two people project, based on a 70B model, score 2nd in GPQA.

kudos guys!

2/2
more on what makes this model special

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/4
"Given that total inference compute is proportional the product of total tokens and parameter count, this means that Reflection 70B uses substantially less total compute to achieve its GPQA score than Llama 3.1 405B."

2/4
we shall see!

3/4
oh, didn't know that! very interesting

4/4
5 including myself

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 8, 2024

mattshumer/ref_70_e3 · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

mattshumer/ref_70_e3 · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Reflection 70B - API, Providers, Stats

Reflection Llama-3.1 70B is trained with a new technique called Reflection-Tuning that teaches a LLM to detect mistakes in its reasoning and correct course. Run Reflection 70B with API

openrouter.ai

bnew · Sep 9, 2024

1/40
Automating AI research is exciting! But can LLMs actually produce novel, expert-level research ideas?

After a year-long study, we obtained the first statistically significant conclusion: LLM-generated ideas are more novel than ideas written by expert human researchers.

2/40
In our new paper: [2409.04109] Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers

We recruited 49 expert NLP researchers to write novel ideas on 7 NLP topics.

We built an LLM agent to generate research ideas on the same 7 topics.

After that, we recruited 79 experts to blindly review all the human and LLM ideas.

2/

3/40
When we say “experts”, we really do mean some of the best people in the field.

Coming from 36 different institutions, our participants are mostly PhDs and postdocs.

As a proxy metric, our idea writers have a median citation count of 125, and our reviewers have 327.

3/

4/40
We specify a very detailed idea template to make sure both human and LLM ideas cover all the necessary details to the extent that a student can easily follow and execute all the steps.

We paid $300 for each idea, plus a $1000 bonus to the top 5 human ideas.

4/

5/40
We also used an LLM to standardize the writing styles of human and LLM ideas to avoid potential confounders, while preserving the original content.

Shown below is a randomly selected LLM-generated idea, as an example of how our ideas look like.

5/

6/40
Our 79 expert reviewers submitted 298 reviews in total, so each idea got 2-4 independent reviews.

Our review form is inspired by ICLR & ACL, with breakdown scores + rationales on novelty, excitement, feasibility, and expected effectiveness, apart from the overall score.

6/

7/40
With these high-quality human ideas and reviews, we compare the results.

We performed 3 different statistical tests accounting for all the possible confounders we could think of.

It holds robustly that LLM ideas are rated as significantly more novel than human expert ideas.

7/

8/40
Apart from the human-expert comparison, I’ll also highlight two interesting analyses of LLMs:

First, we find LLMs lack diversity in idea generation. They quickly start repeating previously generated ideas even though we explicitly told them not to.

8/

9/40
Second, LLMs cannot evaluate ideas reliably yet. When we benchmarked previous automatic LLM reviewers against human expert reviewers using our ideas and reviewer scores, we found that all LLM reviewers showed a low agreement with human judgments.

9/

10/40
We include many more quantitative and qualitative analyses in the paper, including examples of human and LLM ideas with the corresponding expert reviews, a summary of experts’ free-text reviews, and our thoughts on how to make progress in this emerging research direction.

10/

11/40
For the next step, we are recruiting more expert participants for the second phase of our study, where experts will implement AI and human ideas into full projects for a more reliable evaluation based on real research outcomes.

Sign-up link: Interest Form for Participating in the AI Researcher Human Study (Execution Stage)

11/

12/40
This project wouldn't have been possible without our amazing participants who wrote and reviewed ideas. We can't name them publicly yet as more experiments are ongoing and we need to preserve anonymity. But I want to thank you all for your tremendous support!!

12/

13/40
Also shout out to many friends who offered me helpful advice and mental support, esp. @rose_e_wang @dorazhao9 @aryaman2020 @irena_gao @kenziyuliu @harshytj__ @IsabelOGallegos @ihsgnef @gaotianyu1350 @xinranz3 @xiye_nlp @YangjunR @neilbband @mertyuksekgonul @JihyeonJe

13/

14/40
Finally I want to thank my supportive, energetic, insightful, and fun advisors @tatsu_hashimoto @Diyi_Yang

Thank you for teaching me how to do the most exciting research in the most rigorous way, and letting me spend so much time and $$$ on this crazy project!

14/14

15/40
Great work and cool findings. I have to say there is a huge confounding factor that is intrinsic motivation. Experts ( here PhD students) might not be sharing their best novel ideas because the incentive is monetary instead of publication which brings in long term benefits for students such as job offers/ prestige/ visibility. Your section 6.1 already mentions this.

Another confounding factor is time. Do you think best research / good work happens in 10 days ? I personally have never been able to come up with a good idea in 10 days. But at the same time I don’t know if an LLM can match my ideation ( at-least not yet)

16/40
this is a really cool study!

did the LLM idea's get ran past a Google search? i see the novelty scores for LLM idea's are higher but i've personally noticed asking for novel idea's sometimes results in copy-pasta from the LLM from obscure blog posts/research papers

17/40

18/40
This is cool! Want to come on my podcast and chat about it?

19/40
Great read thank you!

20/40
awesome study! but whats the creativity worth if it’s less feasible?

in my own experience it often suggests ideas that are flawed in some fundamental regards that it misses.

theres only so much facts and constraints the attention mechanism can take into account

21/40
Your thread is very popular today! /search?q=#TopUnroll Thread by @ChengleiSi on Thread Reader App

@vankous for

unroll

22/40
in my paper"Automating psychological hypothesis generation with AI: when large language models meet causal graph" Automating psychological hypothesis generation with AI: when large language models meet causal graph - Humanities and Social Sciences Communications , we have approved our workflow/algo can even generate novel hypothesis better than GPT4 and Phd .

23/40
What is the best rated LLM generated idea?

24/40
looks like we're out of a job

25/40
Skynet when tho?

26/40
Interesting! I may be doing he exact same experiment, only I let both ChatGPT and Claude participate in a language experiment I choose.

27/40
Interesting work, but would be interesting to know which LLM and prompts were used for this.

28/40
I'd love to see some examples of these novel research ideas. How do they hold up in peer review and actual experimentation?

29/40
All words are arbitrary, yet NLP is a field that treats language and words as statistically significant, excluding the social semiotic and status gain illusions inherent to language. So NLP begins as a notably degenerate field. A novel idea in NLP is technically oxymoronic.

30/40
@threadreaderapp unroll

31/40
Question, if you believe this result, are you going to switch to primarily using LLMs to pick research ideas for yourself?

32/40
Great work! Novelty can be subjective, varying with a topic’s maturity and reviewers’ perspectives. Rather than fully automating research, building practical LLM research assistants could be exciting. Looking forward to the next stage, making LLM research agents more powerful!

33/40
Nice job eager to read. One question, what if you change the topic… biology, math, arts?

34/40
kudos, super refreshing to see people invest in long term and interesting questions! gg

35/40
asking chatgpt to come up with a product, "an angry birds delivery dating app", VCs are jumping through my window, slapping my face with wads of cash

36/40
@Piniisima

37/40
Hahahahahahahahahahah, no.

38/40
Did LLMs write this derivative drivel?

39/40
大佬牛逼

40/40
The infinite unknown is a temptation, and in the face of the limits of our feelings we always have room to manoeuvre. Understanding is only the starting point for crossing over.Creation is the temptation to go beyond the unknown.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 9, 2024

1/11
@joshim5
We’re excited to introduce @ChaiDiscovery and release Chai-1, a foundation model for molecular structure prediction that performs at the state-of-the-art across a variety of drug discovery tasks

We're releasing inference code, weights & a web interface: Chai Discovery

2/11
@joshim5
We tested Chai-1 across a number of benchmarks, and found that the model achieves a 77% success rate on the PoseBusters benchmark (vs. 76% by AlphaFold3).

3/11
@joshim5
Unlike many existing structure prediction tools which require multiple sequence alignments (MSAs), Chai-1 can be run in single sequence mode without MSAs while preserving most of its performance.

4/11
@joshim5
In addition to its frontier modeling capabilities directly from sequences, Chai-1 can be prompted with new data, e.g. restraints derived from the lab, which boost performance by double-digit percentage points.

5/11
@joshim5
We are releasing Chai-1 via a web interface for free, including for commercial applications such as drug discovery. We are also releasing the code for Chai-1 for non-commercial use as a software library.

Web interface: Chai Discovery
Code: GitHub - chaidiscovery/chai-lab: Chai-1, SOTA model for biomolecular structure prediction

6/11
@joshim5
Prior to founding @chaidiscovery, our team collectively helped advance the development of 20 drug programs.

Many of us were Heads of AI at leading AI Drug Discovery companies and we’ve worked at companies like @OpenAI, @AIatMeta, @Stripe, and @GoogleAI.

7/11
@joshim5
We're well funded and grateful for the strong partnership of @ThriveCapital, @_DimensionCap, @OpenAI, @saranormous, @Neo, @lachygroom, and @Amplify, as well as angel investors @gdb, @ByersBlake, @JuliaHartz, @KevinHartz, @FEhrsam, @gaybrick, @dafrankel, @RMartinChavez.

8/11
@joshim5
Read more about our launch in @Bloomberg: Bloomberg - Are you a robot?

9/11
@apartovi
Let's goooo! I'm so proud of you Josh,
@_JackDent, & @ChaiDiscovery!

It's been seven wonderful years since we started supporting you as @Neo Scholars, and I've loved every step.

Glad to be your investor, and here's to the journey ahead.

10/11
@joshim5
Thanks so much for your support, @apartovi! And what a throwback... that photo on-stage is from the first @Neo reunion back in 2017. I think we can guess what the topic was :smile:

11/11
@jennywang01
Congrats!!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 9, 2024

1/1
"SkillMimic" uses just 35 minutes of video and motion capture data of human demos to train simulated humanoids in basketball skills like dribbling, shooting, and layups through imitation learning.

It enables skill reuse and smooth transitions for continuous scoring ability.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/11
From learning individual skills to composing them into a basketball-playing agent via hierarchical RL -- introducing SkillMimic, Learning Reusable Basketball Skills from Demonstrations

: SOCIAL MEDIA TITLE TAG

: [2408.15270v1] SkillMimic: Learning Reusable Basketball Skills from Demonstrations

: GitHub - wyhuai/SkillMimic: Official code release for the paper "SkillMimic: Learning Reusable Basketball Skills from Demonstrations"

Work led by @NliGjvJbycSeD6t , Qihan Zhao, and Runyi Yu.

2/11
SkillMimic leverages human demonstration data extracted from video to learn specific basketball skills like dribbling and layup

3/11
Skills are learned via contact-graph-powered HOI imitation learning

4/11
Simulated humanoid sports, rise up!

5/11
Interesting work

6/11
Thanks!

7/11
The paper proposes a novel approach called SkillMimic that enables physically simulated humanoids to learn a variety of basketball skills from human-object demonstrations. The key idea is to define skills as collections of Human-Object Interaction (HOI) state transitions and then use imitation learning to train a single policy that can acquire diverse skills.

SkillMimic can effectively learn diverse basketball skills, including shooting, layups, and dribbling, within a single policy using a unified configuration. Compared to baseline methods, SkillMimic exhibits superior performance and robustness. The HLC can quickly learn complex tasks, such as continuous scoring, by reusing the skills acquired through SkillMimic.

full paper: SkillMimic: Learning Reusable Basketball Skills from Demonstrations

8/11
Dark mode for this paper

SkillMimic: Learning Reusable Basketball Skills from Demonstrations

9/11
turnaround layup looks nice!

10/11
So cool!

11/11
Amazing results and amazing to see how it's evolved after PhysHOI!
Looking forward to adding these data and skills into ***********.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 9, 2024

1/16
Announcing the most comprehensive comparison of AI chatbots: ChatGPT, Claude, Meta AI, Gemini and more

We have tested the features of every major AI chatbot and leveraged our extensive benchmarking data on the underlying models to present the most comprehensive analysis of AI chatbots yet. We’ve tested chatbots from @OpenAI , @AnthropicAI, @GoogleAI, @AIatMeta, @poe_platform, @perplexity_ai, @MicrosoftAI, @MistralAI, @huggingface, @character_ai and @xai.

On our chatbots comparison page, you’ll find:
‣ The winners of our six highlight categories (see below tweet for these)
‣ A detailed comparison table with everything from intelligence scores to features
‣ A bunch of charts (of course there are charts!)

We’ve tested everything from context window to PDF uploads to code interpreters and creating charts. We hope this work will be a helpful resource for the community to compare chatbot applications and make decisions about which chatbots suit certain use-cases.

Background on some definitions if you’re not used to all these terms: each of the major AI labs has a series of foundation models (models are called things like GPT-4o, Claude 3.5 Sonnet, Llama 3.1 70B etc) and a consumer chatbot application (eg. ChatGPT, Claude, Gemini, Meta AI). On the main Artificial Analysis leaderboards, we benchmark the foundation models themselves. The new chatbot comparison that we’re launching today is the first time we have extended our benchmarking to the consumer chatbot applications.

Please get in touch with any questions, feedback or requests on this new analysis!

See below for more.

2/16
Artificial Analysis Chatbot Comparison Category Winners - September 2024

To simplify comparison, we developed criteria for six highlight categories. For details of our criteria and reasoning, click through to the page!

Best Overall: ChatGPT Plus

Best Free: ChatGPT Free

Best for Images: Poe Pro

Best for Coding: Claude Pro

Best for Long Context: Claude Pro

Best for Data: ChatGPT Pro

3/16
We manually tested the Effective Context Window of every chatbot in our comparison - and we found some big differences!

Claude Pro is the only chatbot supporting input context greater than 40k tokens.

4/16
See all the details at: Compare AI Chatbots: ChatGPT, Claude, Meta AI, Gemini and more

5/16
yea there is one bug on the website :/ just reporting :D

6/16
I have a sub for all of them and more and just select a random one for an answer.

My goto AIs are mainly ChatGPT and Monica.

7/16
Super helpful!

A few updates/corrections for your chart though. (And I'd love to know more info on some.)

Claude limits -- I'd love more info on your testing for the rate limit for Claude, as it's a bit more fluid than other providers. (They admit this in their help docs)

Anthropic uses a lot of factors to determine your rate limit.

In our (unofficial) testing, we're normally getting about 18-25 messages. The more complexity, the fewer messages. I've never seen/heard of anyone getting more than ~35.

- ChatGPT context window -- Also, ChatGPT is VERY consistent in 31-32K memory. We've tested dozens of times over the past year-ish (including on livestreams) and have never seen anything below 30.

-- ChatGPT image generation - right now, there's limited free Dall-E 3 in the free version of ChatGPT.

8/16
This is super clear and helpful, and I love the comparison table. Will you also show the results of other smaller open source models?

9/16
Wasn't Gemini supposed to have that big 1m context window? Or am I missing something??

10/16
Does Claude Pro really have a 200k context? Its output length seems like 4k

11/16
Gemini has Artifacts, which for most people's context means the same as the Code Interpreter in practice

12/16
wow cool, seems pretty accurate too with my personal findings

13/16
Is the recent chatgpt esperimental with longer delay included in this?

14/16
You all haven't tried Maximo AI. It's free and even gives access to Claude 3.5 sonnet, too. For free.

Link: Maximo AI - Unleash the Power of AI for Trading & Content Creation

/search?q=#AI

15/16
Are you sure they were the right models though?

16/16
You are completely missing out the FREE Google ai studio here with access to gemini 1.5 pro and 2M Context. Why??

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 9, 2024

1/5
regardless of the outcome and results of the reflection-70b dive-in, the amazing thing about open-weights and open-source is how the whole community can dive in these questions together and study the model

transparency in AI is becoming so crucial as the field grow

2/5
Problem is that it is being constantly undermined by "open weights are worse than our local model/hosted API".

something has been opened but the result can not be reproduced... very straightforward situation (LK99)

3/5
I don't know. It seems to be the combination of open source hype and social media + no testing standards which allowed for these shenanigans.

I love open source, but we need to create mandatory, independent testing and evaluation standards and hold model creators accountable.

4/5
yeah, glad that the thesis applied to proprietary frontier model providers too — everyone can get answers to arbitrary questions too.

5/5
I completely agree, the community's collective scrutiny is what drives progress in AI research. Transparency is crucial, and open-weights are a huge step in the right direction.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/11
Reflection 70B update: Quick note on timeline and outstanding questions from our perspective

Timeline:
- We tested the initial Reflection 70B release and saw worse performance than Llama 3.1 70B.

- We were given access to a private API which we tested and saw impressive performance but not to the level of the initial claims. As this testing was performed on a private API, we were not able to independently verify exactly what we were testing.

- Since then, there have been additional HF releases which some providers have hosted. The latest version appears to be: mattshumer/ref_70_e3 · Hugging Face. We are seeing significantly worse results when benchmarking ref_70_e3 than what we saw via the private API.

Outstanding questions:
- We are not clear on why a version would be published which is not the version we tested via Reflection’s private API.

- We are not clear why the model weights of the version we tested would not be released yet.

As soon as the weights are released on Hugging Face, we plan to re-test and compare to our evaluation of the private endpoint.

2/11
The free version on openrouter is 100% just an api wrapping claude

3/11
I also uploaded GGUF versions to unsloth/Reflection-Llama-3.1-70B-GGUF · Hugging Face for those interested

I tried calling both the open weights & the API (API is now broken), and the uploaded weights definitely seem "worse" qualitatively

I forced the sys prompt, and used temp=0.7, top_p=0.95 as suggested

4/11
Folks, be careful calling sketchy private APIs. All calls might be intercepted and now they have your holdout test set (or worse sensitive information).

5/11
It's obviously a grift. I don't understand why any credible people are giving him the benefit of the doubt any longer. You only make yourselves look worse when you accept the excuse "we have the correct model on the API but we're retraining because we don't know how to upload it"

6/11
Well, at this point the best thing that could happen to them is for GPT-5 to come out and we all forget about the issue. They got themselves into a mess in pursuit of fame and don't know how to get out of it. Hahahahahaha

7/11
Happy to support the rigorous evaluation process

8/11
Clarify who proposed using the questionable private API for benchmarks and demonstrate your impartiality. Otherwise, the suspicion remains that you are part of the hype cycle, manipulating scores for undisclosed motives.

9/11
Thanks for your take on this. Just got back from a 2-week vacation and Reflection 70B was one of the things I bookmarked to read up on.
Is it still something worth looking into, or was this largely a fad?

10/11
This is interesting, but should private API results really be considered or taken seriously? Hmmmm...

11/11

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 9, 2024

1/3
@rohanpaul_ai
Quite incredible code generation result using the new "PLANSEARCH" search algorithm

optillm lib implements the core idea of "PLANSEARCH" here. This lib is for optimizing inference proxy. It implements several SOTA technique to improve the accuracy and performance of LLMs.

optillm lib is
- OpenAI API compatible
- They currently focus on improving reasoning over coding, logical and mathematical queries.

---

So its possible to beat the frontier models using these techniques across diverse tasks by doing additional compute at inference time.

---

They just implemented the core idea from the paper "Planning In Natural Language Improves LLM Search For Code Generation" to push gpt-4o-mini with PLANSEARCH to 63.5% on the LiveCodeBench at pass@10.

PLANSEARCH, is a novel search algorithm proposed in the above paper.

By searching over plans in natural language rather than directly over code solutions, PLANSEARCH explores a significantly more diverse range of potential solutions compared to baseline search methods.

2/3
@rohanpaul_ai

The github lib - GitHub - codelion/optillm: Optimizing inference proxy for LLMs

The Paper "Planning In Natural Language Improves LLM Search For Code Generation" - [2409.03733] Planning In Natural Language Improves LLM Search For Code Generation

3/3
@anushkmittal
impressive. we're always pushing the boundaries of what's possible with ai at shapes inc.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/9
@asankhaya
A recent paper from @hughbzhang , @evanzwangg and other folks from @scale_AI titled "Planning In Natural Language Improves LLM Search For Code Generation" ([2409.03733] Planning In Natural Language Improves LLM Search For Code Generation) showed a novel search algorithm that searches over candidate plans for solving a problem in natural language.

I have implemented the core idea from their paper in our open-source optimizing proxy optillm - GitHub - codelion/optillm: Optimizing inference proxy for LLMs

I was able to replicate the core idea, in fact, I was able to push gpt-4o-mini with plansearch to 63.5% on the LiveCodeBench at pass@10.

Also, I found that plansearch-gpt-4o-mini at pass@5, already gives almost 10% better results when compared to pass@5 gpt-4o-mini.

2/9
@n0riskn0r3ward
That was fast! What's the date range of the live code bench problem subset these pass@1 - 10 stats come from?

3/9
@asankhaya
Whatever is the default? I thought it is run based on the datetime in the lm_styles.py file.

The only change I did in LiveCodeBench was to modify the lm_styles.py file and add a model for plansearch-gpt-4o-mini, then I ran it with base_url as the optillm proxy's url. So, the numbers are for the following:

LanguageModel(
"gpt-4o-mini-2024-07-18",
"GPT-4O-mini-2024-07-18",
LMStyle.OpenAIChat,
datetime(2023, 4, 30),
link="https://openai.com/index/spring-update",
),
LanguageModel(
"plansearch-gpt-4o-mini-2024-07-18",
"PlanSearch-GPT-4O-mini-2024-07-18",
LMStyle.OpenAIChat,
datetime(2023, 4, 30),
link="https://openai.com/index/spring-update",
)

4/9
@UserMac29056
How would the results be if we fine tuning GPT-4o-mini with distilled dataset from GPT-4o plansearch?

5/9
@asankhaya
You will likely see further improvements, we did something similar for a bug fixing task recently - Static Analysis Evaluation Benchmark with OpenAI's GPT-4o Fine-Tuning

6/9
@carterleffen
great job! do you plan on opening up your fine tuning dataset and how you generate it?

7/9
@asankhaya
This is without any fine tuning, the optillm proxy applies inference only techniques that trade extra compute at inference time for better results. The entire implementation for plansearch in optillm is right here - optillm/plansearch.py at main · codelion/optillm

8/9
@hughbzhang
this is insanely fast. super impressed!

9/9
@algxtradingx
Damn, homie, good work. 🫡

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 10, 2024

1/4
@deepseek_ai

Exciting news! We’ve officially launched DeepSeek-V2.5 – a powerful combination of DeepSeek-V2-0628 and DeepSeek-Coder-V2-0724! Now, with enhanced writing, instruction-following, and human preference alignment, it’s available on Web and API. Enjoy seamless Function Calling, FIM, and Json Output all-in-one!

Note: Due to significant updates in this version, if performance drops in certain cases, we recommend adjusting the system prompt and temperature settings for the best results!

2/4
@deepseek_ai
DeepSeek-V2.5 outperforms both DeepSeek-V2-0628 and DeepSeek-Coder-V2-0724 on most benchmarks.

3/4
@deepseek_ai
In our internal Chinese evaluations, DeepSeek-V2.5 shows a significant improvement in win rates against GPT-4o mini and ChatGPT-4o-latest (judged by GPT-4o) compared to DeepSeek-V2-0628.

4/4
@deepseek_ai
DeepSeek-V2.5 is now open-source on HuggingFace!
Check it out: deepseek-ai/DeepSeek-V2.5 · Hugging Face

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/3
Wake up y'all! DeepSeek v2.5 is out! Merged model - weights on HF, 128K context, MoE - 238B param/ 21B active! - Optimised for coding

> Combined DS v2 & DS v2 Coder Merged - Beats GPT4o
> Arena Hard - 68.3% -> 76.3%
> Alpaca Eval - 46.61% -> 50.52%
> MT Bench - 8.84 -> 9.02
> Optimised for Coding - HumanEval 89%, LiveCodeBench - 41%
> Enhanced writing, instruction-following, and human preference alignment
> Function Calling, Fill-In-Middle, and JSON Output
> Available on Hugging Face
> Compatible with Transformers

Kudos to the @deepseek_ai team. I'm a big fan of their work; their models are just of primo quality!

2/3
Here's how they made this bad boi:

3/3
DeepSeek v2.5 - Model weights here:

deepseek-ai/DeepSeek-V2.5 · Hugging Face

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/2
DeepSeek V2.5 (21B active / 238B total) runs pretty nicely in MLX on an M2 Ultra.

Checkout the 4-bit model in the HF MLX community: mlx-community/DeepSeek-V2.5-MLX-AQ4_1_64 · Hugging Face

With prompt caching in MLX LM could be a pretty nice local coder (mlx-examples/llms/README.md at main · ml-explore/mlx-examples)

2/2
Speeds up the prompt / prefill time considerably since it's just loading a file from disk. Especially for MOEs for which prompt processing is kind of slow right now. I haven't measured exactly but for long prompts its orders of magnitude faster.

Generation time is unchanged.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/4
@kimmonismus
DeepSeek silently released their DeepSeek-Coder-V2-Instruct-0724, which ranks #2 on Aider LLM Leaderboard, and it beats DeepSeek V2.5 according to the leaderboard
https://teddit.net/r/LocalLLaMA/comments/1fd6z0v/deepseek_silently_released_their/

deepseek-ai/DeepSeek-Coder-V2-Instruct-0724 · Hugging Face

2/4
@olafgeibig
This is their V2 model, which was great but they also just released their newest model DeepSeek-V2.5. DeepSeek V2.5 is a merge of their V2 chat and coder model.
deepseek-ai/DeepSeek-V2.5 · Hugging Face

3/4
@relentless4o
these are awesome news

4/4
@BotChadX
they just keep shipping

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 11, 2024

1/29
@sdianahu
Deepsilicon runs neural nets with 5x less RAM and ~20x faster. They are building SW and custom silicon for it.

What’s interesting is that they have proved it with SW, and you can even try it.

On why we funded them 1/7

2/29
@sdianahu
2/7 They found that representing transformer models as ternary values (-1, 0, 1) eliminates the need for computationally expensive floating-point math.

3/29
@sdianahu
3/7 So, there is no need for GPUs, which are good at floating point matrix operations, but energy and memory-hungry.

4/29
@sdianahu
4/7 They actually got SOTA models to run, overcoming the issues from the MSFT BitNet paper that inspired this. BitNet: Scaling 1-bit Transformers for Large Language Models - Microsoft Research.

5/29
@sdianahu
5/7 Now, you could run SOTA models that typically need HPC GPUs like H100s to make inferences on consumers or embedded GPUs like the NVIDIA Jetson.

This makes it possible for the first time to run SOTA models on embedded HW, such as robotics, that need that real-time response for inference.

6/29
@sdianahu
6/7 What NVDIA is overlooking, is the opportunity with specialized HW for inference, since they've been focused on the high end with the HPC cluster world.

7/29
@sdianahu
7/7 You can try it here for the SW version GitHub - deepsilicon/Sila

When they get the HW ready, the speedups and energy consumption will be even higher.

More details here too Launch HN: Deepsilicon (YC S24) – Software and hardware for ternary transformers | Hacker News

8/29
@sdianahu
Intuitively this works, because neurons in DNN use activation functions that are S curved with 3 states. With only (-1,0,1), the dot product between matrices just becomes arithmetic.

9/29
@sdianahu
the OSS project at the moment converts a model to ternary through pretraining/distillation. It'll eventually have their ternary implementation. Stay tuned when the deepsilicon team updates Sila

10/29
@StephanSturges
interested in an intro as a potential customer

11/29
@gpt_biz
This sounds impressive, definitely worth checking out if you're into cutting-edge AI hardware development!

12/29
@gopal_kp
@threadreaderapp unroll

13/29
@threadreaderapp
@gopal_kp Guten Tag, the unroll you asked for: Thread by @sdianahu on Thread Reader App Talk to you soon.

14/29
@AsimovsOracle
Does this obviate NVidia entirely as a company, lol?

15/29
@EseralLabs
Now I have a good reason to use the Jetson :smile:

16/29
@adamviaja
I learned from context that SW stands for “software” in this context in case anyone else was also confused 🫶

cheers

17/29
@Rufus87078959
Impressive

18/29
@alexdanilo99
The founders are truly built different.

19/29
@uday_1212
Wow..!! This is really awesome..!! Will eagerly look forward to how this is going to take shape..
However we d still need GPUs for training tasks.. but inference is the major share

20/29
@timelessdev
Just software solution? I need to try.

21/29
@mattspergel
You still need to keep the model in a vector database?

22/29
@Karnage420x
I’m far too dumb to understand this.

23/29
@aitedmaster
Deepsilicon's innovation is exactly what we need to accelerate AI development. By optimizing RAM usage and boosting speed, they're pushing the boundaries of what's possible. Impressive that they're making it accessible through software too.

24/29
@cleantecher
y is the reduction in RAM needed if the code is efficient? whats the energy draw from good and bad written code and how does that change thermal effects on servers that r in farms being cooled by AC which will become uneconomical in 3 years and melt.

25/29
@shreyashdamania
How do you learn and understand these concepts from bottom up ?

Asking as a complete beginner.

BTW thankyou for the Lightcone podcast

!!!

26/29
@CannaFirm
When is next round for them, can we buy some of your equity ??

27/29
@Photo_Jeff
Please give me a device I can plug into any USB-C port on Linux, MacOS, and Windows load up a model, and run inference. I don't need another Google TPU, Groq, or million dollar cloud based solution and Coral AI is dead.

28/29
@NRA29
Wow !

29/29
@mobadawy_
This technology exists and had been commercialized for over 15 years

They just dropped out of college and gave it a new name

Fixed points processors are used almost in every application around you

Who’s revising these applications?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 11, 2024

New powerful open Text to Speech model: Fish Speech 1.4 - trained on 700K hours of speech, multilingual (8 languages)

Fish Audio: The Best & Free Generative AI Text To Speech & Voice Cloning

Powerful, fast, and customizable text-to-speech solution. Ultra-low latency, rapid voice cloning, and flat-rate pricing for AI Voice Over.

fish.audio

Fish Audio

Accelerate Audio Generation. Fish Audio has 12 repositories available. Follow their code on GitHub.

github.com

Audio Generation For All

Making speech AI

bnew · Sep 11, 2024

1/31
@MistralAI
magnet:?xt=urn:btih:7278e625de2b1da598b23954c13933047126238a&dn=pixtral-12b-240910&tr=udp%3A%2F%http://2Ftracker.opentrackr.org%3A1337%2Fannounce&tr=udp%3A%2F%http://2Fopen.demonii.com%3A1337%2Fannounce&tr=http%3A%2F%http://2Ftracker.ipv6tracker.org%3A80%2Fannounce

2/31
@_unwind_ai
Find all the awesome LLM Apps demo with RAG and AI agents in this AI newsletter for 10x developers.

P.S: Don't forget to subscribe for FREE to show your support

unwind ai

3/31
@dreamworks2050
V O I L A

4/31
@duborges
As we're here, any cool movies to download?

5/31
@TheWorldNews
What a wonderful release.

6/31
@0xANDREAS
awesome

7/31
@pelaseyed
Here we go!

8/31
@PrimeIntellect
awesome release!!

9/31
@kianerfaan
I love you mistral

10/31
@jworthy
I love when you do this

11/31
@A1Mish
We’re so back

12/31
@testingcatalog
Is it planned to be added to Le Chat?

13/31
@QwinKolle
How do we use this ?

14/31
@Prince_Canuma
Coming to MLX

15/31
@LeeLeepenkman
Text and images in, text and images out

Cc @isidentical / @fofr and freins

16/31
@airesearch12
state of ai twitter bubble

17/31
@BoomBrusher
Low key midnight torrent drop, no big deal

18/31
@secemp9
check this out @_xjdr

19/31
@yacineMTB
thank you

20/31
@untitled01ipynb
what's in the BOX, Frenchmen

21/31
@reach_vb
Congrats on the release, there’s a community upload on HF now:

mistral-community/pixtral-12b-240910 · Hugging Face

22/31
@thedudeonx
LFG Pixtral!

23/31
@avi_press
This is just the best way to do release announcements

24/31
@rjmalka
i am never ever checking for what this torrent is

25/31
@RayLin0803

26/31
@omarsar0
nice!

27/31
@meta_ukraine
Finally open source multimodal LLM

28/31
@C0NWIC

29/31
@Qnoox
Nice

30/31
@filipwojda
nice

31/31
@TrueNathanD
Torrenting this now.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

Large Language Models News & Discussions

More options

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

mattshumer/ref_70_e3 · Hugging Face

mattshumer/ref_70_e3 · Hugging Face

Reflection 70B - API, Providers, Stats

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

Fish Audio: The Best & Free Generative AI Text To Speech & Voice Cloning

Fish Audio

bnew

Veteran