bnew

Veteran
Joined
Nov 1, 2015
Messages
55,804
Reputation
8,234
Daps
157,328

1/1
@HaHoang411
🌟 Mind-blowing work by the team at @FLAIR_Ox! They've created Kinetix, a framework for training general-purpose RL agents that can tackle physics-based challenges.
The coolest part? Their agents can solve physical reasoning complex tasks zero-shot!
🥳Congrats @mitrma and team.

[Quoted tweet]
We are very excited to announce Kinetix: an open-ended universe of physics-based tasks for RL!

We use Kinetix to train a general agent on millions of randomly generated physics problems and show that this agent generalises to unseen handmade environments.
1/🧵


https://video.twimg.com/ext_tw_video/1856003600159256576/pu/vid/avc1/1280x720/zJNdBD1Yq0uFl9Nf.mp4


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196














1/12
@mitrma
We are very excited to announce Kinetix: an open-ended universe of physics-based tasks for RL!

We use Kinetix to train a general agent on millions of randomly generated physics problems and show that this agent generalises to unseen handmade environments.
1/🧵



https://video.twimg.com/ext_tw_video/1856003600159256576/pu/vid/avc1/1280x720/zJNdBD1Yq0uFl9Nf.mp4

2/12
@mitrma
👾 Kinetix can represent problems ranging from robotic locomotion and grasping, to classic RL environments and video games, all within a unified framework. This opens the door to training a single generalist agent for all these tasks!
2/



https://video.twimg.com/ext_tw_video/1856003839851220992/pu/vid/avc1/640x640/J_w1M8wm8ibiGCAn.mp4

3/12
@mitrma
🎲 By procedurally generating random environments, we train an RL agent that can zero-shot solve unseen handmade problems. This includes some where RL from scratch fails!
3/



https://video.twimg.com/ext_tw_video/1856003979878051840/pu/vid/avc1/720x720/JAcE26Hprn1NXPvU.mp4

4/12
@mitrma
🟩 🟦 🟥 Each environment has the same goal: make 🟩 touch 🟦 while preventing 🟩 touching 🟥. The agent controls all motors and thrusters.

In this task the car has to first be flipped with thrusters. The general agent solves it zero-shot, having never seen it before.
4/



https://video.twimg.com/ext_tw_video/1856004286943002624/pu/vid/avc1/720x720/hjhITONkJiDY9tD2.mp4

5/12
@mitrma
🚗 Our general agent shows emergent physical reasoning capabilities, for instance being able to zero-shot control unseen morphologies by moving them underneath a goal (🔵).
5/



https://video.twimg.com/ext_tw_video/1856004409559306241/pu/vid/avc1/994x540/AA6c6MHpWRkFt3OJ.mp4

6/12
@mitrma
🚀 We also show that finetuning this general model on target tasks is more sample efficient than training from scratch, providing a step towards a foundation model for RL.

In some cases, training from scratch completely fails, while our finetuned general model succeeds 👇
6/



https://video.twimg.com/ext_tw_video/1856004545525972993/pu/vid/avc1/1280x720/jMqgYcCwx-q4tSpm.mp4

7/12
@mitrma
📈 One big takeaway from this work is the importance of autocurricula. In particular, we found significantly improved results by dynamically prioritising levels with high 'learnability'.
7/



GcHacg4WUAAmHTp.jpg


8/12
@mitrma
🍎 The core of Kinetix is our new 2D rigid body physics engine: Jax2D. This is a minimal rewrite of the classic Box2D engine made by @erin_catto. Jax2D allows us to run thousands of heterogeneous parallel environments on a single GPU (yes, you can vmap over different tasks!)
8/



9/12
@mitrma
🔧 Don't take our word for it, try it out for yourself!
Create your own levels in your browser with Kinetix.js and see how different pretrained agents perform: Redirecting...
9/



https://video.twimg.com/ext_tw_video/1856004915501350912/pu/vid/avc1/1422x720/7wj1y_BcHHUnNtwx.mp4

10/12
@mitrma
This work was co-led with @mcbeukman and done at @FLAIR_Ox with @_chris_lu_ and @j_foerst.
Blog: https://kinetix-env.github.io/
GitHub: GitHub - FLAIROx/Kinetix: Reinforcement learning on general 2D physics environments in JAX
arXiv: [2410.23208] Kinetix: Investigating the Training of General Agents through Open-Ended Physics-Based Control Tasks
end/



11/12
@_k_sridhar
Very cool paper! FYI, we recently pretrained a generalist agent that can generalize to unseen atari/metaworld/mujoco/procgen environments simply via retrieval-augmentation and in-context learning. Our work uses an imitation learning approach. REGENT: A Retrieval-Augmented Generalist Agent That Can Act In-Context In New Environments.



12/12
@mitrma
This is really cool! Let's meet up and chat at ICLR if we both end up going?




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,804
Reputation
8,234
Daps
157,328







1/11
@tanishqkumar07
[1/7] New paper alert! Heard about the BitNet hype or that Llama-3 is harder to quantize? Our new work studies both! We formulate scaling laws for precision, across both pre and post-training https://arxiv.org/pdf/2411.04330. TLDR;

- Models become harder to post-train quantize as they are overtrained on lots of data, so that eventually more pretraining data can be actively harmful if quantizing post-training!
- The effects of putting weights, activations, or attention in varying precisions during pretraining are consistent and predictable, and fitting a scaling law suggests that pretraining at high (BF16) and next-generation (FP4) precisions may both be suboptimal design choices!

Joint work with @ZackAnkner @bfspector @blake__bordelon @Muennighoff @mansiege @CPehlevan @HazyResearch @AdtRaghunathan.



GcH1RBoWwAAQp1q.jpg


2/11
@tanishqkumar07
[2/7] We first study the common technique of post-train quantizing model weights, finding that the longer you train/the more data seen during pretraining, the more sensitive the model becomes to quantization at inference-time, explaining why Llama-3 may be harder to quantize.
In fact, this loss degradation is roughly a power law in the token/parameter ratio seen during pretraining, so that you can predict in advance the critical data size beyond which pretraining on more data is actively harmful if you're serving a quantized model. The intuition might be that as more knowledge is compressed into weights as you train on more data, a given perturbation will damage performance more.
Below is a fixed language model overtrained significantly to various data budgets up to 30B tokens, then post-train quantized afterwards. This demonstrates how more pretraining FLOPs do not always lead to better models served in production.



GcH9H_xXMAAwP__.jpg


3/11
@tanishqkumar07
[3/7] We then turn our attention to training in low precision. We study both quantization-aware training (weights only) and low-precision training (everything in low precision). We decompose the model into weights, activations, and KV cache, finding scaling laws for loss when any of these are quantized to any precision, and develop a compositional and interpretable functional form to predict the effect on loss of quantizing any combination of the three during pretraining.



4/11
@tanishqkumar07
[4/7] Our scaling law relies on a notion of "effective parameter count" which we posit is the quantity that is reduced when you lower precision at a fixed number of real parameters, so that a 1 billion parameter model with everything trained in FP4 has a comparable number of "effective parameters" to a 250m model in BF16.

While weights can be trained in low precision without issue, activations and KV cache are sensitive. Below is the normalized "effective parameter count" as a function of precision for each of the (weights, activations, KV cache) as well as when they are all held to the same precision (tied) based on our fits.



GcH7mwpXEAAKOo5.jpg


5/11
@tanishqkumar07
[5/7] Finally, we are able to unify our findings for pre- and post-training into an interpretable functional form that predicts loss from pre- and post-training in any combination of precision. We find that pretraining in low precision "robustifies" a model to post-train quantization in a quantitatively predictable way, but by less than you would intuitively expect, for reasons we outline and test in the paper.



6/11
@tanishqkumar07
[6/7] Our work has several limitations -- we keep a controlled architecture and setup when doing experiments, but in practice architectural tweaks are often deliberately made to accommodate low-precision training. We also fit scaling laws on relatively small language models (up to ~250m) because we train over 450 models on large data budgets (up to over 25b tokens). We are excited for future work to study these effects at larger model scale!



GcH8wvFXsAAGJQX.jpg


7/11
@tanishqkumar07
[7/7] Many thanks to @Tim_Dettmers @chrismdesa @realDanFu for super helpful feedback as well as to the entire @HazyResearch team for their support! Models from our 465+ pretraining runs will soon be on HuggingFace for everyone to play around with, and code will also be released! The preprint is at https://arxiv.org/pdf/2411.04330



8/11
@matdmiller
@threadreaderapp unroll



9/11
@threadreaderapp
@matdmiller Hello, you can read it here: Thread by @Tanishq97836660 on Thread Reader App Have a good day. 🤖



10/11
@itsclivetime
insanely cool work! :D embrace the scaling laws

will be cool to see how sota quantization schemes (mxfp, Pw!=Pkv!=Pa, etc) shift the frontier

also imho - spending half your compute budget on 1 big run to check that the fit generalizes to big models would be worth it



11/11
@omarsar0
Very interesting paper! The findings on the different precisions are really important. It's good to see papers like this that investigate these scaling laws more closely and from a different angle. Congrats!




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



Essential Questions to Capture the Main Points and Core Meaning of the Text​

  1. What is the central theme or argument of the paper, and how does it relate to the quantization of large language models like Llama-3?
    • This question addresses the main idea of the paper, which is the study of scaling laws for precision in pre- and post-training of large language models and the challenges associated with quantizing these models.
  2. How does the amount of pretraining data affect the sensitivity of a model to quantization, and what are the implications for model performance?
    • This question highlights the key finding that longer training or more data seen during pretraining makes the model more sensitive to quantization at inference time, leading to potential performance degradation.
  3. What are the effects of quantizing different components (weights, activations, KV cache) of a model during pretraining, and how do these effects relate to the concept of "effective parameter count"?
    • This question delves into the detailed analysis of how different parts of the model respond to quantization and introduces the concept of "effective parameter count" as a way to understand these effects.
  4. How does pretraining in low precision impact the robustness of a model to post-train quantization, and what are the quantitative predictions from the study?
    • This question explores the findings on how pretraining in low precision can make a model more robust to post-train quantization and the quantitative aspects of these predictions.
  5. What are the limitations of the study, and what future directions are suggested for further research on this topic?
    • This question addresses the limitations mentioned in the paper, such as the controlled architecture and the scale of the models studied, and looks at potential future research directions.

Detailed Answers to the Generated Questions​

1. What is the central theme or argument of the paper, and how does it relate to the quantization of large language models like Llama-3?​

The central theme of the paper is the study of scaling laws for precision in both pre- and post-training of large language models. The authors investigate how the precision of model components (such as weights, activations, and attention mechanisms) during training affects the model's performance when quantized. Specifically, they find that models become harder to quantize post-training as they are overtrained on large amounts of data, which can lead to performance degradation. This is particularly relevant to models like Llama-3, where the paper suggests that overtraining can make these models more sensitive to quantization.

2. How does the amount of pretraining data affect the sensitivity of a model to quantization, and what are the implications for model performance?​

The amount of pretraining data significantly affects the sensitivity of a model to quantization. The study shows that the longer a model is trained or the more data it sees during pretraining, the more sensitive it becomes to quantization at inference time. This sensitivity follows a power law in the token/parameter ratio seen during pretraining, indicating that beyond a certain critical data size, additional pretraining data can be actively harmful if the model is to be served in a quantized form. This means that while more pretraining data generally improves model performance, it can also make the model more vulnerable to performance degradation when quantized.

3. What are the effects of quantizing different components (weights, activations, KV cache) of a model during pretraining, and how do these effects relate to the concept of "effective parameter count"?​

The study decomposes the model into weights, activations, and KV cache and examines the effects of quantizing each component. It finds that weights can be trained in low precision without significant issues, but activations and KV cache are more sensitive to quantization. The concept of "effective parameter count" is introduced to explain these effects; it posits that lowering precision reduces the effective number of parameters, even if the actual number of parameters remains the same. For example, a 1 billion parameter model trained in FP4 precision has an effective parameter count comparable to a 250 million parameter model trained in BF16 precision. This framework helps predict the loss associated with quantizing different components of the model.

4. How does pretraining in low precision impact the robustness of a model to post-train quantization, and what are the quantitative predictions from the study?​

Pretraining in low precision can "robustify" a model to post-train quantization, but this effect is less than one might intuitively expect. The study formulates an interpretable functional form that predicts the loss from both pre- and post-training in any combination of precision. This suggests that while pretraining in low precision can make a model more resilient to post-train quantization, the benefits are quantitatively predictable and not as significant as might be hoped. This finding helps in designing more efficient training strategies that balance precision and robustness.

5. What are the limitations of the study, and what future directions are suggested for further research on this topic?​

The study has several limitations. It was conducted with a controlled architecture and setup, which may not reflect real-world scenarios where architectural tweaks are often made to accommodate low-precision training. Additionally, the experiments were conducted on relatively small language models (up to ~250 million parameters) due to the large number of models trained over extensive data budgets. Future research directions include studying these effects at larger model scales and exploring how different architectural modifications impact the findings. The authors also suggest spending more compute budget on larger runs to verify the generalizability of their fits to bigger models.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,804
Reputation
8,234
Daps
157,328






1/6
@HaHoang411
@NousResearch has just released the Forge Reasoning model.

Try it free here: NOUS CHAT | Talk to Hermes



2/6
@HaHoang411
Blog annoucement: Introducing the Forge Reasoning API Beta and Nous Chat: An Evolution in LLM Inference - NOUS RESEARCH



3/6
@HaHoang411
It utilizes 3 test-time-compute architectures:
- Chain of Code
- Monte Carlo Tree Search
- Mixture of Agents



GcNsJrjW4AA9JvZ.jpg


4/6
@HaHoang411
Testing on an example in Arc-Challenge dataset from
@allen_ai



https://video.twimg.com/ext_tw_video/1856449554243022848/pu/vid/avc1/720x946/X2NDUvqkC9Dw7_9t.mp4

5/6
@HaHoang411
Testing on a NAPLEX (US Pharmacist License Exam). And yes the answer is perfect.



https://video.twimg.com/ext_tw_video/1856449735210524672/pu/vid/avc1/720x946/i_96J8Wt2bUEzZ-l.mp4

6/6
@HaHoang411
Testing with simple coding ability. Seems like it is not really optimized?



https://video.twimg.com/ext_tw_video/1856451124561117184/pu/vid/avc1/720x946/VGMijiyxb6Z764KL.mp4


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 
Top