bnew

Veteran
Joined
Nov 1, 2015
Messages
55,228
Reputation
8,195
Daps
156,173

1/8
Generative Verifiers: Reward Modeling as Next-Token Prediction

abs: [2408.15240] Generative Verifiers: Reward Modeling as Next-Token Prediction

New paper from Google DeepMind; Instead of training the reward model as a discriminative classifier, train it with next token prediction:

Ask model "Is the answer correct (Yes/No)?", reward score is token probability for "Yes" token.

Naturally supports CoT and majority voting by generating several verbalized rationales before predicting correctness

The approach, referred to as GenRM, outperforms discriminative verifiers and LLM-as-a-Judge, showing a 16−64% improvement in the percentage of problems solved with Best-of-N on algorithmic string manipulation and math reasoning tasks.

2/8
@memdotai mem it

3/8
Saved! Here's the compiled thread: Mem

4/8
The paper introduces "Generative Verifiers" (GenRM), a new approach to training verifiers or reward models for large language models (LLMs). Traditionally, verifiers are trained as discriminative models to score the correctness of LLM-generated solutions. However, this approach does not fully utilize the text generation capabilities of LLMs.

The key idea behind GenRM is that by recasting verification as a next-token prediction problem, LLMs can better leverage their text generation capabilities. This allows for seamless integration with other LLM techniques like CoT reasoning and majority voting, leading to substantial performance gains.

full paper: Generative Verifiers: Reward Modeling as Next-Token Prediction

5/8
Fascinating! This GenRM approach could revolutionize AI verification. I'm excited to see how it integrates with instruction tuning and chain-of-thought reasoning. The potential for breakthroughs in math reasoning tasks is huge!

6/8
Detailed thread here:

7/8
The only theory where things work for a reward is substance abuse theory where humans are supposed to score drugs to satisfy their "reward center".

Implementing it in AI is like creating a junk that craves for the right answer.

Make it behave, so sad ...

@alexgraveley @ylecun

8/8
Generative Verifiers: Reward Modeling as Next-Token Prediction


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GWCLCDMbwAAmMvB.jpg

GWCrrFWbQAAR2w_.jpg

GWLzR9MX0AAQHi-.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,228
Reputation
8,195
Daps
156,173

1/1
Salesforce presents xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

- Competitive perf among open-source models
- Open-sources models, curated large-scale datasets, and fine-tuning codebase

hf: XGen-MM-1 models and datasets - a Salesforce Collection
abs: [2408.08872] xGen-MM (BLIP-3): A Family of Open Large Multimodal Models


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GVT1L5Ya4AAI6Vh.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,228
Reputation
8,195
Daps
156,173

1/5
Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks

Outperforms the SotA method on performing long-horizon tasks in Minecraft and narrows the gap toward human-level performance

proj: Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks
abs: [2408.03615] Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks

2/5
Do u have any reference to transformer architecture for continuous variables?

3/5
AI Summary: The paper introduces Optimus-1, a multimodal agent designed to excel in long-horizon tasks in open-world environments, particularly in Minecraft. It features a Hybrid Multimodal Memory module tha...
Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks

4/5
[QA] Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks

5/5
GitHub - HemanthIITJ/Deep_learning: Holistic understanding of Large Language Models (LLMs) involves integrating NLP, computer vision, audio processing, and reinforcement learning. GNNs capture intricate data relationships. Attention mechanisms, Transformer architectures, vision-language pre-training, audio processing with spectrograms, pre-trained embeddings, and reinforcement .


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GUbI3YkXUAAyEDU.jpg

GUjZJG3a8AYjuln.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,228
Reputation
8,195
Daps
156,173

1/4
Meta presents UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling

- Scaling offers little benefit for reasoning or relations
- Best VLMs struggle on simple digit recognition and counting tasks, e.g. MNIST

repo: GitHub - facebookresearch/unibench: Python Library to evaluate VLM models' robustness across diverse benchmarks
abs: [2408.04810] UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling

2/4
1. Scaling training data or model size offers little benefit for reasoning and relational tasks compared to other benchmark types.
2. Even the best VLMs struggle on simple digit recognition and counting tasks, which can be easily solved by much smaller models.
3. Data quality matters more than data quantity for improving performance, especially on relational and reasoning tasks.
4. Tailored learning objectives can help overcome the limitations of scaling for certain capabilities.

full paper: UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling

3/4
@giffmana new bench just drop

4/4
Surprised that Yann LeCun isn’t part of the authors


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GUvt_sPW0AEOpZ_.png

GUwVo_9asAAIGd5.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,228
Reputation
8,195
Daps
156,173

1/4
Meta presents MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts

- Dividesy expert modules into modality-specific groups
- Achieves better performance than the baseline MoE

abs: [2407.21770] MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts
alphaxiv: https://alphaxiv.org/abs/2407.21770

2/4
Dark mode for this paper for those who read at night 🌙 MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts

3/4
Impressive research Prof. kindly go through this [International Journal of Computer Science and Mobile Applications Journal] submit your research interest related /search?q=#computerscience /search?q=#robotics /search?q=#machinelearning /search?q=#technology DM or give contact details so we can connect for collaboration.

4/4
Glad to see more MoD papers!


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GT3HMybWkAAhA4T.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,228
Reputation
8,195
Daps
156,173

1/4
Google presents ShieldGemma: Generative AI Content Moderation Based on Gemma

- Opensources Gemma2-based content moderation models
- Outperform Llama Guard (+10.8% AU-PRC on public benchmarks) and WildCard (+4.3%)

abs: [2407.21772] ShieldGemma: Generative AI Content Moderation Based on Gemma
alphaxiv: ShieldGemma: Generative AI Content Moderation Based on Gemma | alphaXiv

2/4
AI Summary: ShieldGemma is a suite of LLM-based content moderation models developed by Google LLC, designed to predict safety risks associated with various harm types such as hate speech and harassment in bo...
ShieldGemma: Generative AI Content Moderation Based on Gemma

3/4
if you don't like the impact of woke AI, you have to release your own moderators @elonmusk

4/4
Even some human moderators do more harm than good and they're open sourcing the one based on the weaker freemium model?


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GT3HmEkWUAAsl4q.png

GT-ds8KXcAAdG9i.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,228
Reputation
8,195
Daps
156,173

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,228
Reputation
8,195
Daps
156,173

1/4
MindSearch: Mimicking Human Minds Elicits Deep AI Searcher

MindSearch with open-source models can already deliver a competitive solution to the proprietary AI search engine

proj: MindSearch — Search Engine + LLM Agents = Answer Engine
repo: GitHub - InternLM/MindSearch: 🔍 An LLM-based Multi-agent Framework of Web Search Engine (like Perplexity.ai Pro and SearchGPT)
abs:[2407.20183] MindSearch: Mimicking Human Minds Elicits Deep AI Searcher

2/4
AI Summary: The paper presents MindSearch, an innovative framework designed to enhance web information seeking and integration by mimicking human cognitive processes. It addresses the limitations of existing...
MindSearch: Mimicking Human Minds Elicits Deep AI Searcher

3/4
Dark mode for this paper for those who read at night 🌚 MindSearch: Mimicking Human Minds Elicits Deep AI Searcher

4/4
[QA] MindSearch : Mimicking Human Minds Elicits Deep AI Searcher


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GTzRiByaQAAPQdM.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,228
Reputation
8,195
Daps
156,173

1/2
AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?

- Evaluates the ability of web agents to solve realistic and time-consuming tasks.
- Includes 214 tasks from 258 different websites

proj: AssistantBench
abs: [2407.15711] AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?

2/2
AI Summary: The paper introduces A SSISTANT B ENCH, a new benchmark featuring 214 realistic and time-consuming tasks that assess the capabilities of web agents built on language models. The study reveals sig...
AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GTUfjcMaMAAbH3G.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,228
Reputation
8,195
Daps
156,173

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,228
Reputation
8,195
Daps
156,173





1/6
We've opened the waitlist for General Robot Intelligence Development (GRID) Beta! Accelerate robotics dev with our open, free & cloud-based IDE. Zero setup needed. Develop & deploy advanced skills with foundation models and rapid prototyping

🔗: Scaled Foundations

🧵(1/6)

2/6
GRID supports a wide range of robot form factors and sensors/modalities (RGB, depth, lidar, GPS, IMU and more) coupled with mapping and navigation for comprehensive and physically accurate sensorimotor development.
(2/6)

3/6
Curate data to train Robotics Foundation Models at scale with GRID. Create thousands of scenarios, across form factors, and robustly test model performance on both sim and real data.
(3/6)

4/6
GRID's LLM-based orchestration leverages foundation models to integrate various modalities, enabling robust sensorimotor capabilities, such as multimodal perception, generative modeling, and navigation, with zero-shot generalization to new scenarios.
(4/6)

5/6
Develop in sim, export and deploy skills on real robots in hours instead of months! GRID enables safe & efficient deployment of trained models on real robots, ensuring reliable performance in real-world scenarios. Example of safe exploration on the Robomaster & Go2:
(5/6)

6/6
Enable your robots to be useful with seamless access to foundation models, simulations and AI tools. Sign up for GRID Beta today and experience the future of robotics development!

Docs: Welcome to the GRID platform! — GRID documentation

(6/6)


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GTFOsVAbsAARlao.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,228
Reputation
8,195
Daps
156,173

1/1
Alibaba/Qwen org was deplatformed by Github. Details are slim at the moment. Qwen is one of the best open-model on the planet at the moment. Either US or China submitted claims to Github? Only these two nation states have this power or desire for this. Hopefully I am wrong.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/1
We are fukking back!!! Go visit our github now!


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GWsQYg1aUAACNB-.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,228
Reputation
8,195
Daps
156,173

1/1
As we move towards more powerful AI, it becomes urgent to better understand the risks in a mathematically rigorous and quantifiable way and use that knowledge to mitigate them. More in my latest blog entry where I describe our recent paper on that topic.
https://yoshuabengio.org/2024/08/29...ity-of-harm-from-an-ai-to-create-a-guardrail/


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,228
Reputation
8,195
Daps
156,173




GWuM-L0WsAAp-VN.jpg









1/10
I'm excited to announce Reflection 70B, the world’s top open-source model.

Trained using Reflection-Tuning, a technique developed to enable LLMs to fix their own mistakes.

405B coming next week - we expect it to be the best model in the world.

Built w/ @GlaiveAI.

Read on ⬇️:

2/10
Reflection 70B holds its own against even the top closed-source models (Claude 3.5 Sonnet, GPT-4o).

It’s the top LLM in (at least) MMLU, MATH, IFEval, GSM8K.

Beats GPT-4o on every benchmark tested.

It clobbers Llama 3.1 405B. It’s not even close.

3/10
The technique that drives Reflection 70B is simple, but very powerful.

Current LLMs have a tendency to hallucinate, and can’t recognize when they do so.

Reflection-Tuning enables LLMs to recognize their mistakes, and then correct them before committing to an answer.

4/10
Additionally, we separate planning into a separate step, improving CoT potency and keeping the outputs simple and concise for end users.

5/10
Important to note: We have checked for decontamination against all benchmarks mentioned using @lmsysorg's LLM Decontaminator.

6/10
The weights of our 70B model are available today on @huggingface here: mattshumer/Reflection-Llama-3.1-70B · Hugging Face

@hyperbolic_labs API available later today.

Next week, we will release the weights of Reflection-405B, along with a short report going into more detail on our process and findings.

7/10
Most importantly, a huge shoutout to @csahil28 and @GlaiveAI.

I’ve been noodling on this idea for months, and finally decided to pull the trigger a few weeks ago. I reached out to Sahil and the data was generated within hours.

If you’re training models, check Glaive out.

8/10
This model is quite fun to use and insanely powerful.

Please check it out — with the right prompting, it’s an absolute beast for many use-cases.

Demo here: Reflection 70B Playground

9/10
405B is coming next week, and we expect it to outperform Sonnet and GPT-4o by a wide margin.

But this is just the start. I have a few more tricks up my sleeve.

I’ll continue to work with @csahil28 to release even better LLMs that make this one look like a toy.

Stay tuned.

10/10
We'll release a report next week!


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GWuM-L0WsAAp-VN.jpg

GWuM7xAWoAEIn61.jpg

GWq6-r5XAAAp7k7.jpg

GWq8Av7X0AAO-v3.jpg
 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,228
Reputation
8,195
Daps
156,173



1/4
🚀 Exciting news! We’ve officially launched DeepSeek-V2.5 – a powerful combination of DeepSeek-V2-0628 and DeepSeek-Coder-V2-0724! Now, with enhanced writing, instruction-following, and human preference alignment, it’s available on Web and API. Enjoy seamless Function Calling, FIM, and Json Output all-in-one!

Note: Due to significant updates in this version, if performance drops in certain cases, we recommend adjusting the system prompt and temperature settings for the best results!

2/4
DeepSeek-V2.5 outperforms both DeepSeek-V2-0628 and DeepSeek-Coder-V2-0724 on most benchmarks.

3/4
In our internal Chinese evaluations, DeepSeek-V2.5 shows a significant improvement in win rates against GPT-4o mini and ChatGPT-4o-latest (judged by GPT-4o) compared to DeepSeek-V2-0628.

4/4
DeepSeek-V2.5 is now open-source on HuggingFace!
Check it out: deepseek-ai/DeepSeek-V2.5 · Hugging Face


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GWyqNGHbMAA9nQG.png

GWyqSmMXMAAWRKP.jpg

GWyqUFjWQAA_ENN.jpg

GWyqY5dXYAAfPCq.jpg



1/2
DeepSeek-v2.5 WeChat Blog is out: DeepSeek-V2.5:融合通用与代码能力的全新开源模型

The coding capability is partially degraded, as shown in the table.

The blog notes: Due to the large changes in the model version, such as the effect of some scenes, it is recommended to adjust the System Prompt and Temperature to get the best performance. (translated by WeChat)

@teortaxesTex

2/2
There is no free lunch tbh, and data mixture is a hard problem.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GWyoPPXXMAArkzN.jpg

GWyoRAlWQAATuyJ.jpg

GWyoSUPXgAAVM4F.png

GWvBefuasAAEYKR.png
 
Top