bnew

Veteran
Joined
Nov 1, 2015
Messages
58,206
Reputation
8,623
Daps
161,872








1/8
Mixture of Agents—a framework that leverages the collective strengths of multiple LLMs. Each layer contains multiple agents that refine responses using outputs from the preceding layer.
Together MoA achieves a score of 65.1% on AlpacaEval 2.0.
Together MoA — collective intelligence of open-source models pushing the frontier of LLM capabilities

2/8
Together MoA exhibits promising performance on AlpacaEval 2.0 and MT-Bench.

Together MoA uses six open source models as proposers and Qwen1.5-110B-Chat as the final aggregators with three layers.

3/8
We also evaluate on FLASK which offers more fine-grained evaluation and outperforms original models on most of the dimensions.

4/8
Both Together MoA and Together MoA-Lite are on the Pareto front, indicated by the dashed curve, in the performance vs. cost plot.

5/8
Try Together MoA through our interactive demo. Please note that the TTFT is slow at the moment due to the iterative refinement process of MoA, but we are actively working on optimizations.


6/8
Blog: Together MoA — collective intelligence of open-source models pushing the frontier of LLM capabilities

Paper: [2406.00977] Dragonfly: Multi-Resolution Zoom Supercharges Large Visual-Language Model

Code: GitHub - togethercomputer/MoA

7/8
This work was made possible through the collaborative efforts of several open-source projects. We appreciate @AIatMeta , @MistralAI , @MicrosoftAI , @alibaba_cloud , and @databricks for developing the Llama, Mixtral, WizardLM, Qwen, and DBRX models. We also thank Tatsu Labs,

8/8
Apologies, wrong ArXiv paper linked above. This is the correct one!


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GPzIHgqaIAAr00B.png

GPzIXR7bIAAB_xQ.png

GPzIf4AbcAA3uiB.jpg

GPzIoNla0AA9xhq.jpg

GPzJLBObYAA0lS4.jpg

GQALcGXaoAAQoO_.jpg

GQAB9v8WMAAnxKC.jpg

We introduce Mixture of Agents (MoA), an approach to harness the collective strengths of multiple LLMs to improve state-of-the-art quality. And we provide a reference implementation, Together MoA, which leverages several open-source LLM agents to achieve a score of 65.1% on AlpacaEval 2.0, surpassing prior leader GPT-4o (57.5%).​

6667deae5d7fb05ba4f86c1c_moa.png

Figure 1: Illustration of the Mixture-of-Agents Structure. This example showcases 4 MoA layers with 3 agents in each layer. The agents here can share the same model.

Overview​

We are excited to introduce Mixture of Agents (MoA), a novel approach to harness the collective strengths of multiple LLMs. MoA adopts a layered architecture where each layer comprises several LLM agents. These agents take the outputs from the previous layer as auxiliary information to generate refined responses. This approach allows MoA to effectively integrate diverse capabilities and insights from various models, resulting in a more robust and versatile combined model.

Our reference implementation, Together MoA, significantly surpass GPT-4o 57.5% on AlpacaEval 2.0 with a score of 65.1% using only open source models. While Together MoA achieves higher accuracy, it does come at the cost of a slower time to first token; reducing this latency is an exciting future direction for this research.

Our approach is detailed in a technical paper on arXiv; and the open-source code is available at: togethercomputer/moa, including a simple interactive demo. We look forward to seeing how MoA will be utilized to push the boundaries of what AI can achieve.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,206
Reputation
8,623
Daps
161,872
















1/7
I am absolutely BLOWN AWAY
@LumaLabsAI

2/7
To those asking, this is image to video. The image is an original I made through a variety of techniques (some hand drawings also) and used Luma to animate the image I created.

3/7
Yeah it’s all queued up with sever overload now, my generations are also taking a beat

4/7
I'm so excited for what this means for the future of pitching concepts and ideas!

5/7
Extremely

6/7
its image to video!

7/7
its image to video!


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GP5NwsVXkAAmPrO.jpg




1/2
Image to Video within seconds.
Created with
@LumaLabsAI #LumaDreamMachine

2/2
“Girl gazes with wonder.”


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GP5dgfMWEAA1FVJ.jpg




 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,206
Reputation
8,623
Daps
161,872





Introducing Harmonic: Our Mission and First Results

06.10.2024

Harmonic is building mathematical superintelligence.

When we want to know if an answer to a question is correct, we check the reasoning - the logical steps behind the answer. In order for artificial intelligence systems to be truthful, explainable, and aligned with us, we must endow these systems with powerful and verifiable reasoning capabilities.

The language of reasoning is mathematics, and mathematics is the means through which humans have discovered the fundamental truths about our universe. Mathematical superintelligence will greatly accelerate human advancement in science and engineering.

AI today

Modern AI systems based on large language models appear to understand and use language in a way that emulates humans, and we have seen new and exciting capabilities emerge as these models have grown in scale. However, today's models fall short of human-level reasoning; hallucinations are common and models are often as confident about wrong answers as they are about right ones. They can fail in unpredictable ways, and the consequences of such failures become increasingly significant as their usage grows. In their current state, they are unusable in most safety-critical applications.

The promise of mathematical reasoning

Models capable of formal mathematical reasoning will produce output guaranteed to be correct, with an interpretable chain of reasoning. We believe such models, with transparent and automatically-verifiable reasoning traces, will be fundamentally safe in ways that the current generation of models are not.

This approach will be immediately applicable to critical industries such as aerospace, chip design, industrial systems, and healthcare, where software reliability is paramount. Moreover, the development of such models will push the boundaries of AI research, ultimately driving the creation of more powerful and reliable AI systems across different domains.

Where we are today

Our first research result is Aristotle: an automated theorem prover advancing the state of the art on MiniF2F. This standard benchmark measures problem-solving ability on a range of problem difficulties including the International Mathematical Olympiad. To evaluate our system, we manually reformalized and improved<a href="Harmonic - News">1</a> MiniF2F in Lean 4. To obtain a training set, we re-split the 488 MiniF2F problems (originally evenly divided into validation and test sets) randomly into 392 training, 48 validation, and 48 test problems, where the latter two splits are unbiased random subsets of the corresponding original validation and test sets. <a href="Harmonic - News">2</a>

We evaluate Aristotle in two settings: one where additional external computer algebra systems are permitted to solve simple subproblems in a trusted way, and one where the full proofs are expressed in Lean. Our system currently achieves a 83% success rate in the first setting and a 63% success rate when restricted to Lean. We compare our results on the validation split to two previous state of the art approaches below: <a href="Harmonic - News">3</a> <a href="Harmonic - News">4</a> <a href="Harmonic - News">5</a>

Harmonic MiniF2F Score: 83%

▶
Join us

Our journey to build mathematical superintelligence is just beginning. We are a commercial research lab, and our primary focus is to enable our team of talented researchers, mathematicians, and engineers to build the world's most advanced mathematical reasoning engine.

If our mission resonates with you, join us or follow us

- Tudor, Vlad, and the Harmonic Team

1 We found that many statements' formalizations in MiniF2F were much easier than the original problem, e.g. only containing the easier direction of a biconditional statement. We worked with mathematicians and olympiad competitors to ensure our version of MiniF2F represents each problem fully.

2 We plan to release our corrected datasets in the future.

3 Our Lean results are not based on expert iteration on the validation set and are not cumulative.

4 DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data

5 [2310.00656] LEGO-Prover: Neural Theorem Proving with Growing Libraries
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,206
Reputation
8,623
Daps
161,872








1/3
Artificial Gerbil Intelligence has been achieved internally!

A team at Google DeepMind has built a ‘virtual rodent’, in which an artificial neural network actuates a biomechanically realistic model of the rat. This helps provide a causal, generative model that can reproduce complex animal behaviors, not just correlate with them. The model's internal structure can be analyzed to gain insights that are hard to get from real neural data alone. A virtual rodent predicts the structure of neural activity across behaviors - Nature

2/3
on it equally curious haha

3/3
one way to find out... simulation-facilitated street fights


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GP3TZs9XcAAU8nO.png


Photo of Bence Ölveczky with mouse models.

Bence Ölveczky.

Niles Singer/Harvard Staff Photographer



SCIENCE & TECH

Want to make robots more agile? Take a lesson from a rat.​

Scientists create realistic virtual rodent with digital neural network to study how brain controls complex, coordinated movement

Anne J. Manning

Harvard Staff Writer

June 11, 2024 4 min read

The effortless agility with which humans and animals move is an evolutionary marvel that no robot has yet been able to closely emulate. To help probe the mystery of how brains control and coordinate it all, Harvard neuroscientists have created a virtual rat with an artificial brain that can move around just like a real rodent.

Bence Ölveczky, professor in the Department of Organismic and Evolutionary Biology, led a group of researchers who collaborated with scientists at Google’s DeepMind AI lab to build a biomechanically realistic digital model of a rat. Using high-resolution data recorded from real rats, they trained an artificial neural network — the virtual rat’s “brain” — to control the virtual body in a physics simulator called MuJoco, where gravity and other forces are present. And the results are promising.

Illustration panels showing a virtual rat using movement data recorded from real rats.

Harvard and Google researchers created a virtual rat using movement data recorded from real rats.

Credit: Google DeepMind

Published in Nature, the researchers found that activations in the virtual control network accurately predicted neural activity measured from the brains of real rats producing the same behaviors, said Ölveczky, who is an expert at training (real) rats to learn complex behaviors in order to study their neural circuitry. The feat represents a new approach to studying how the brain controls movement, Ölveczky said, by leveraging advances in deep reinforcement learning and AI, as well as 3D movement-tracking in freely behaving animals.

The collaboration was “fantastic,” Ölveczky said. “DeepMind had developed a pipeline to train biomechanical agents to move around complex environments. We simply didn’t have the resources to run simulations like those, to train these networks.”

Working with the Harvard researchers was, likewise, “a really exciting opportunity for us,” said co-author and Google DeepMind Senior Director of Research Matthew Botvinick. “We’ve learned a huge amount from the challenge of building embodied agents: AI systems that not only have to think intelligently, but also have to translate that thinking into physical action in a complex environment. It seemed plausible that taking this same approach in a neuroscience context might be useful for providing insights in both behavior and brain function.”

Graduate student Diego Aldarondo worked closely with DeepMind researchers to train the artificial neural network to implement what are called inverse dynamics models, which scientists believe our brains use to guide movement. When we reach for a cup of coffee, for example, our brain quickly calculates the trajectory our arm should follow and translates this into motor commands. Similarly, based on data from actual rats, the network was fed a reference trajectory of the desired movement and learned to produce the forces to generate it. This allowed the virtual rat to imitate a diverse range of behaviors, even ones it hadn’t been explicitly trained on.

These simulations may launch an untapped area of virtual neuroscience in which AI-simulated animals, trained to behave like real ones, provide convenient and fully transparent models for studying neural circuits, and even how such circuits are compromised in disease. While Ölveczky’s lab is interested in fundamental questions about how the brain works, the platform could be used, as one example, to engineer better robotic control systems.

A next step might be to give the virtual animal autonomy to solve tasks akin to those encountered by real rats. “From our experiments, we have a lot of ideas about how such tasks are solved, and how the learning algorithms that underlie the acquisition of skilled behaviors are implemented,” Ölveczky continued. “We want to start using the virtual rats to test these ideas and help advance our understanding of how real brains generate complex behavior.”

This research received financial support from the National Institutes of Health.
 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,206
Reputation
8,623
Daps
161,872

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,206
Reputation
8,623
Daps
161,872

1/1
New Paper and Blog!

Sakana AI

As LLMs become better at generating hypotheses and code, a fascinating possibility emerges: using AI to advance AI itself! As a first step, we got LLMs to discover better algorithms for training LLMs that align with human preferences.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196




1/1
Can LLMs invent better ways to train LLMs?

At Sakana AI, we’re pioneering AI-driven methods to automate AI research and discovery. We’re excited to release DiscoPOP: a new SOTA preference optimization algorithm that was discovered and written by an LLM!

Sakana AI

Our method leverages LLMs to propose and implement new preference optimization algorithms. We then train models with those algorithms and evaluate their performance, providing feedback to the LLM. By repeating this process for multiple generations in an evolutionary loop, the LLM discovers many highly-performant and novel preference optimization objectives!

Paper: [2406.08414] Discovering Preference Optimization Algorithms with and for Large Language Models
GitHub: GitHub - SakanaAI/DiscoPOP: Code for Discovering Preference Optimization Algorithms with and for Large Language Models
Model: SakanaAI/DiscoPOP-zephyr-7b-gemma · Hugging Face

We proudly collaborated with the
@UniOfOxford (@FLAIR_Ox) and @Cambridge_Uni (@MihaelaVDS) on this groundbreaking project. Looking ahead, we envision a future where AI-driven research reduces the need for extensive human intervention and computational resources. This will accelerate scientific discoveries and innovation, pushing the boundaries of what AI can achieve.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GMnw5YbasAAABM3.jpg

GPQns46XsAA1Hug.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,206
Reputation
8,623
Daps
161,872

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,206
Reputation
8,623
Daps
161,872









1/9
so this is nuts, if you're cool with the high frequncy details of an image being reinterpreted/stochastic, you can encode an image quite faithfully into 32 tokens...
with a codebook size of 1024 as they use this is just 320bits, new upper bound for the information in an image unlocked

2/9
only 2^320 unique images which is both a lot and a little

3/9
Yeah I see it as a spectrum to how generative each stage can be, but still think it’s a very underrxploited area both in computation and in ways of improving outputs/controllability

4/9
you could probably get this down to fewer tokens with a larger vocab size and if willing to sacrifice a little more fidelity which IMHO is fine! especially in image generation where often we're getting random outputs anyway

5/9
Those kind of high frequency variations are the cost and left up the interpretation of the decoder, I could imagine the image still looking “plausible” but yeah likely things like text or finer identities of people will not hold up. Still very practical

6/9
In the network yes, but for storage and measured information you only need the indices. A 1024 (2^10) codebook requires 10 bits to cover all values. A typical LLM tokenizer of say 65536 ( 2^16 ) vocab puts you at 2 bytes a token

7/9
Why? You can just have a more expressive decoder (like diffusion) and consider the tokens to be a prior rather than a pure reconstruction

8/9
Absolutely not lol, but it’s close to good enough standards

9/9
Generation/reconstruction is pretty different from encoding for other purposes and I think the recent meta paper showed that discrete is worse for features you’d give to a VLM for instance


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GQAwOH6WwAA12zr.jpg

GQAXcvoXEAAIqLj.jpg

GQAXdYAXkAEMhdb.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,206
Reputation
8,623
Daps
161,872




1/4
High-quality out-of-order generation is going to be really fun. Several interesting papers on this topic this month showing promise.

This is from new work led by Subham Sahoo and
@volokuleshov : MDLM Blog post

2/4
Also check out this concurrent work by Jiaxin Shi and company. Somehow we went through 100 titles and came up with nearly the same one as them!

3/4
Sigma GPT also looks at this problem from a slightly different angle.

Submission 1127 - σ-GPTs - Examples

4/4
TIL: All NN libraries have a special secret version of ReLU where you also threshold at 6.

ReLU6 — PyTorch 2.3 documentation


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GP3eEkqWgAAQZp4.jpg

GP3e4IzWsAAynha.png

GP3nN_RWYAASX6T.png

GQAEy-GWEAAiesH.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,206
Reputation
8,623
Daps
161,872

1/1
Google presents Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning

Presents a novel benchmark designed to assess LLMs’ temporal reasoning abilities, which SotA LLMs currently struggle with

data: baharef/ToT · Datasets at Hugging Face
abs: [2406.09170] Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GQAEy-GWEAAiesH.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,206
Reputation
8,623
Daps
161,872







1/7
Meta announces An Image is Worth More Than 16x16 Patches

Exploring Transformers on Individual Pixels

This work does not introduce a new method. Instead, we present an interesting finding that questions the necessity of the inductive bias -- locality in modern computer vision

2/7
architectures. Concretely, we find that vanilla Transformers can operate by directly treating each individual pixel as a token and achieve highly performant results. This is substantially different from the popular design in Vision Transformer, which maintains the inductive bias

3/7
from ConvNets towards local neighborhoods (e.g. by treating each 16x16 patch as a token). We mainly showcase the effectiveness of pixels-as-tokens across three well-studied tasks in computer vision: supervised learning for object classification, self-supervised learning via

4/7
masked autoencoding, and image generation with diffusion models. Although directly operating on individual pixels is less computationally practical, we believe the community

5/7
must be aware of this surprising piece of knowledge when devising the next generation of neural architectures for computer vision.

6/7
paper page:

7/7
daily papers:


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GP_8UrUWEAAsGA3.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,206
Reputation
8,623
Daps
161,872

1/1
Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback

abs: [2406.09279] Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback

"In this work, we aim to systematically investigate these key components of learning from preferences"

Exploration of Tulu 2 13B (Llama-2-13B SFT) with 14 different preference datasets evaluated on 11 different benchmarks.

Several observations:
1. synthetic, diverse data annotated with per-aspect preferences works best for learning from preferences
2. PPO outperforms DPO across varied datasets in our evaluation suite, even when using exactly the same models and initial training data
3. Increasing reward model size and dataset size used to train the reward model results in marginal to no improvements on policy model performance, except on GSM
4. Using unlabelled prompts that better match the test setting during policy can further improve model performance in domain-specific settings


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GP_765ubEAEUy54.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,206
Reputation
8,623
Daps
161,872





1/5
What if we prompt aligned LLMs like Llama-3-Instruct with nothing? Surprisingly, it will decode decent user queries thanks to its auto-regressive nature. In our new preprint, Magpie, we find this is a scalable way to self-synthesize instruction data of high quality & diversity.

arXiv: [2406.08464] Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

@huggingface : Magpie-Align (Magpie Alignment)
website: Magpie|

> Data & Results: We use this prompting-with-nothing method to self-synthesize 4M data from Llama-3-Instruct. After filtering, we use a 300K subset of them for SFT. Results show that using Magpie data for SFT-ing Llama-3-8B-base is better than using other public datasets (e.g., UltraChat, ShareGPT), on alignment benchmarks like AlpacaEval, ArenaHard and WildBench; It can be comparable or even better than SFT+DPO with open data. Notably, on AE2, Magpie models' performance is close to the official Llama-3-8B-Instruct.

> How to self-synthesize alignment data? It’s easy. We only input the pre-query template tokens “<|start_header_id|>user <|end_header_id|>”, sample the completions as queries (i.e., instructions) (before <|eot_id|>). Then, we use Llama-3-8/70B-Instruct to infer the responses for creating Magpie-Air/Pro. Similarly, we can repeat the process to create multi-turn dialogs. Finally, we apply multiple filters to subsample and further ensure the overall data quality.

> Transferability? We also used the Magpie data to tune Qwen 1.5 4B & 7B base, and also get better performance than their official aligned models, suggesting that Magpie data can be used for aligning other model families and smaller models too. We hope our data can help improve AI democratization.

> Limitations? On MMLU-Redux (0-shot), Magpie-tuned LLMs are at 52%-ish (among the top in baselines), while still fall behind official Llama-3-8B-Instruct (58%). Magpie does not have enough math data yet, and we’re still working on synthesizing more math & reasoning data. Meanwhile, we suggest to mix Magpie data and other awesome data (e.g., MAmmoTH2 from @xiangyue96 @WenhuChen ) which excel in math & reasoning. In addition, we’re also working on creating pairwise preference data for *PO.

> Why "Magpie"? "Other birds collect twigs for their nests. Magpies steal jewels for theirs."

Awesome co-authors from @UW & @allen_ai: @zhangchen_xu (lead) @fengqing_jiang Luyao Niu, @yuntiandeng @RadhaPoovendran @YejinChoinka


Thanks @_akhaliq for helping introduce our paper previously! :D

2/5
Btw, we previously observed a similar situation with GPT API: given empty queries, they will return a response to a latent query [1] ; after getting these outputs, one can recover the prompts by inversion [2].

Sadly we find it seems that @OpenAI soon disallowed us inputting

3/5
For Llama-3-Instruct, it is "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n"

4/5
Previously, we thought Llama-3-instruct was only trained on generating responses and masked the loss on input tokens. They claimed they did so in Llama-2-chat report, which is also the common practice of SFT & DPO. So it’s a bit surprising to us that it can recover the inputs

5/5
WebInstruct data (MAmmoTH2) should be better in science QA, reasoning and math; Magpie has fewer instances on these domains (see our limitation part). We believe the two data are complementary and mixed together for a better SFT.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GP-qA_6bEAAPU_R.jpg
 
Top