bnew

Veteran
Joined
Nov 1, 2015
Messages
57,330
Reputation
8,496
Daps
159,976








1/11
@ikristoph
There is much excitement about this prompt with claims that it helps Claude 3.5 Sonnet outperform o1 in reasoning.

I benchmarked this prompt to find out if the this claim is true ( thanks for @ai_for_success for the heads on this last night ) 🧵

[Quoted tweet]
Can @AnthropicAI Claude 3.5 sonnet outperform @OpenAI o1 in reasoning? Combining Dynamic Chain of Thoughts, reflection, and verbal reinforcement, existing LLMs like Claude 3.5 Sonnet can be prompted to increase test-time compute and match reasoning strong models like OpenAI o1. 👀

TL;DR:
🧠 Combines Dynamic Chain of thoughts + reflection + verbal reinforcement prompting
📊 Benchmarked against tough academic tests (JEE Advanced, UPSC, IMO, Putnam)
🏆 Claude 3.5 Sonnet outperformes GPT-4 and matched O1 models
🔍 LLMs can create internal simulations and take 50+ reasoning steps for complex problems
📚 Works for smaller, open models like Llama 3.1 8B +10% (Llama 3.1 8B 33/48 vs GPT-4o 36/48)
❌ Didn’t benchmark like MMLU, MMLU pro, or GPQA due to computing and budget constraints
📈 High token usage - Claude Sonnet 3.5 used around 1 million tokens for just 7 questions


GZOvvHRbYAAGY8T.jpg

GZMalwkWIAAq50E.jpg


2/11
@ikristoph
The TLDR is that this prompt does not improve Claude 3.5 Sonnet to o1 levels in reasoning but it does tangibly improve its performance in reasoning focused benchmarks.

However, this does come at the expense of 'knowledge' focused benchmarks where the model is more directly generating text it has been trained on.



GZO06VqasAAMIz4.jpg


3/11
@ikristoph
The 'formal logic' and 'college mathematics' benchmarks have significant reasoning focus. OpenAi's o1 excels in these. The use of this prompt with Sonnet also tangibly improves these.

The 'global facts' benchmark, like many other subject matter benchmarks, are much less reasoning focused. They're more about what the model knows and doesn't know. A complex prompt can 'confuse' a model so that even though the model can typically provide the correct answer it under performs because of the prompt.

This is what is happening here with this prompt applied.



4/11
@ikristoph
I want to add an additional note here. The use of this prompt means that a user will get an answer after a significant delay.

In fact, it took Sonnet about 50% longer to complete the benchmarks compared to o1 mini and 100-200% longer than when using a simpler prompt.

Token length was similarly impacted ( 100-200% more tokens ) so a significant incremental cost.



5/11
@Teknium1
Can you take this prompt to o1 and maybe llama instruct etc and benchmark those too?



6/11
@ikristoph
o1 doesn’t have system prompts but I could use this text as a test prefix; they don’t recommend it tho

I do plan to test llama early this week.



7/11
@LoganGrasby
I'm surprised. I'm finding exactly the opposite on coding tasks I'm trying today. This prompt is honestly a breakthrough.



8/11
@ikristoph
The tests are consistent with your experience. In general, coding tasks are reasoning tasks and the prompt tangibly improves Sonnet on these.

The prompt does not improve, and in some cases degrades, knowledge tasks. Although that may impact coding it likely does so less than then reasoning improves them.



9/11
@ai_for_success
Thanks Kristoph.



10/11
@ikristoph
I am going to do some llama ones too! I wonder how much of an improvement we get. It might help a great with a local model for coding.



11/11
@ikristoph
If anyone is interested, I also ran this prompt agains Llama 3.1 70B, Quen 2.5 72B, the latest Flash, as well as 4o mini.

[Quoted tweet]
If, like me, you are curious which small LLM open and commercial models have the best reasoning, and if elaborate prompts can make them better, I have some data for Llama, Quen, Flash, and 4o mini.


GZTrZG4asAMDvF3.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196





1/11
@_philschmid
Can @AnthropicAI Claude 3.5 sonnet outperform @OpenAI o1 in reasoning? Combining Dynamic Chain of Thoughts, reflection, and verbal reinforcement, existing LLMs like Claude 3.5 Sonnet can be prompted to increase test-time compute and match reasoning strong models like OpenAI o1. 👀

TL;DR:
🧠 Combines Dynamic Chain of thoughts + reflection + verbal reinforcement prompting
📊 Benchmarked against tough academic tests (JEE Advanced, UPSC, IMO, Putnam)
🏆 Claude 3.5 Sonnet outperformes GPT-4 and matched O1 models
🔍 LLMs can create internal simulations and take 50+ reasoning steps for complex problems
📚 Works for smaller, open models like Llama 3.1 8B +10% (Llama 3.1 8B 33/48 vs GPT-4o 36/48)
❌ Didn’t benchmark like MMLU, MMLU pro, or GPQA due to computing and budget constraints
📈 High token usage - Claude Sonnet 3.5 used around 1 million tokens for just 7 questions



GZMalwkWIAAq50E.jpg


2/11
@_philschmid
Blog:

Prompt:

Github: GitHub - harishsg993010/LLM-Research-Scripts



GZMapCSWYAAYTEq.jpg


3/11
@AndrewMayne
Telling a model, even GPT-4o-mini, to "Act as a pedantic nitpicking logical process-focused thinker" gets similar results with Strawberry, 9.11/9.9, counting horses, etc.



GZPhkHwacAAhZMk.jpg

GZPh9hkbYAA6c9R.jpg

GZPh9h-bsAAjlOY.jpg


4/11
@Teknium1
But what is the benchmark



5/11
@ikristoph
I’ve already done a quick test on this prompt and while it does tangibly improve reasoning it’s still quite a bit below o1. ( More tests to do though certainly. )

[Quoted tweet]
MMLU Formal Logic. 0 shot. Temperature 0, Top P 1.0. There is a tangible increase which is quite an accomplishment!

It's still quite a bit below o1 however. I will do some more tests tomorrow.


GZL3o-qaoAAquYL.jpg


6/11
@zirkelc_
1 million tokens for 7 questions sounds like a lot now, but in a few months it will probably be negligible



7/11
@GozukaraFurkan
It performs better at coding that is for sure



8/11
@beingavishkar
I always thought asking the LLM to give a "confidence" or reward for to its own generation is meaningless because it isn't calibrated on any scale.

Is there any ablation that proves that specifically, i.e. LLM is indeed more wrong when it says reward=0.7 as opposed to 0.9?



9/11
@manojlds
What's dynamic CoT? Any reference?



10/11
@jessyseonoob
great, there was a buzz about Reflection earlier, but you make it work. It remind me the early AI language from project A.l.i.c.e and the AIML tags from dr wallace that was used by @pandorabots

need to try on @MistralAI too



11/11
@AidfulAI
That's super cool, especially as everyone can currently use Claude 3.5 Sonnet for FREE in the editor Zed. 👇

[Quoted tweet]
The secret for unlimited FREE Claude 3.5 Sonnet requests! My latest newsletter reveals a game-changing tool you won't want to miss. 👇

Zed: From Coding Editor to Universal AI Assistant
Imagine having unlimited access to one of the world's most advanced AI models, right at your fingertips, completely free. This isn't a far-off dream – it's the reality offered by Zed, an open-source coding editor that for me rapidly evolved into something much more. It is my central application to work with AI.

At its core, Zed is designed as a tool for developers, offering a fast and efficient coding environment. However, the recent addition of Zed AI has transformed it into a universal assistant capable of tackling a wide range of tasks. One aspect which made me started to use Zed, is that Anthropic's Claude 3.5 Sonnet, which from my point of view is currently the best model to assist you at writing, can be used for free in Zed. It's important to note that the duration of this free access is unclear, and using Zed intensively for non-coding tasks might not be the intended use case. However, the potential benefits are simply too good to ignore.

Zed has four main sections: 1) file tree of current project, 2) open files, 3) assistant panel, 4) terminal.

What truly makes Zed shine is its suite of context-building commands. The /file command allows you to seamlessly incorporate any text file from your disk into the AI conversation, while /fetch can parse and include a webpage directly in your prompts. Furthermore, you can create your own prompt library feature. You can save text snippets and recall them with the /prompt command, providing a neat way to store personal information that helps guide the AI's replies in the direction you need. For example, you could save details about your specific Linux operating system setup in a prompt, ensuring that responses to Linux-related questions are tailored precisely to your environment.

These features are not just powerful, they are also transparent. Every piece of added context remains fully visible and editable, giving you unprecedented control over your AI interactions. And by using the Claude 3.5 Sonnet model, you can make use of up to 200k tokens for your requests, which corresponds to around 150k words or 500 book pages.

As in other chatbot applications, you can organize multiple chats in tabs and access older chats via a history button, allowing you to revisit and build upon previous conversations.

Initially, I used Zed for the intended use case of programming. However, I realized its capabilities for general requests and now have the editor open and ask for assistance with a wide variety of tasks. From simple word translations to complex document analysis, creative writing, and in-depth research. The ability to easily incorporate content from popular file-based note-taking apps like @logseq and @obsdmd has made Zed a valuable asset in my knowledge management workflows as well.

While Zed's primary focus remains on coding, its AI features have opened up a world of possibilities. It's not just a coding editor – it's a gateway to a new era of AI-assisted work and creativity. The context-building commands are really helpful in tailoring the AI responses to your needs. From my perspective, especially as long as you can use Claude 3.5 Sonnet for free in Zed, it is the best way to explore the new possibilities text-based AI models bring to you.

Currently, Zed offers official builds for macOS and Linux. While Windows is not yet officially supported, it can be installed relatively easily using Msys2 instead of building it yourself. MSYS2 is a software distribution and building platform for Windows that provides a Unix-like environment, making it easier to port and run Unix-based software on Windows systems. I successfully installed Zed on a Windows 11 system, following the MSYS2 installation instructions for Windows (links in post below 👇).

Steps to get started with Zed. 1) login, 2) assistant panel, 3) choose model, 4) chat.

If you are now eager to get started with Zed for non-coding tasks, follow these steps after installation (the enumeration corresponds to the numbers shown in the image above):
1. Open the assistant panel with a click on the ✨ button in the lower right (or CTRL/CMD + ?)
2. Choose “Claude 3.5 Sonnet Zed” in the dropdown menu
3. Start chatting in the assistant panel (send message with CTRL/CMD + ENTER)

As Zed is, from my perspective, currently one of the most powerful ways to use text-generating AI, I intend to create some video tutorials to help others unlock Zed's full potential. If you're interested in seeing tutorials on specific Zed features or use cases, please let me know! Your feedback will help shape the content and ensure it's as useful as possible. Have you tried Zed yourself? What has your experience been like? I'm eager to hear your thoughts and suggestions on how we can make the most of this powerful tool. Just hit reply and share your thoughts.


GYq5KQeWIAMzNjb.jpg

GYq5XuaW4AA8GrL.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,330
Reputation
8,496
Daps
159,976


1/3
@maximelabonne
🛞 TxT360: new pre-training dataset with 15T tokens

Impressive release from LLM360 with a new pre-training dataset of 15T tokens. It includes a lot of new sources compared to previous open-sourced pre-training datasets, like FreeLaw, PG-19 (books), etc.

It's really interesting to understand the steps they took to filter out C4:

1/ Seed data: Common Crawl
2/ Text extraction: from WARC files
3/ Language filtering: remove non-English texts
4/ URL filtering: modified UT1 blacklist (see Blacklists UT1) + some redundant sources like Wikipedia
5/ Repetition removal: line, paragraph, n-gram repetitions
6/ Document filtering: tons of heuristics based on n-grams, characters, word count, etc.
7/ Line-level filtering: lots of heuristics based on punctuation, uppercase/numerical characters, etc.

In the end, 97.65% of the data is filtered out. This is complemented by high-quality sources, like scientific papers, Wikipedia, legal corpora, books, etc. Each source has its own pipeline with specific filters.

Once this is done, it's time for the most important step: global deduplication. The Bloom filter takes care of exact deduplication and removes 17% of the input documents. Fuzzy deduplication is a lot more difficult and compute-intensive. This is performed with the traditional MinHash with much work to parallelize the steps and make it more efficient.

Thanks to LLM360 for sharing these details in addition to the dataset. Their codebase (soon to be released) will also be an interesting read.

📝 Post: TxT360: Trillion Extracted Text - a Hugging Face Space by LLM360
🤗 Dataset: LLM360/TxT360 · Datasets at Hugging Face



GZYlvjwXoAA1p8y.jpg


2/3
@maximelabonne
Here's the original post from @llm360

[Quoted tweet]
📢📢
We are releasing TxT360: a globally deduplicated dataset for LLM pretraining
🌐 99 Common Crawls
📘 14 Curated Sources
👨‍🍳 recipe to easily adjust data weighting and train the most performant models

Dataset:
huggingface.co/datasets/LLM3…

Blog:
huggingface.co/spaces/LLM360…


GZSajQ8XQAA6cmY.jpg


3/3
@canarsiecode
extremely useful. thanks for sharing 🙏




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,330
Reputation
8,496
Daps
159,976











1/11
@LunjunZhang
What if your reward model could “think” more and perform better? Even better, what if your LLM policy could also be used as a reward model?

Introducing GenRM, reward models trained as next token predictors, rather than old-fashioned classifier RMs. This enables things that weren’t possible:

🔗 Chain-of-Thought reasoning for RM
🚀 Leveraging test-time compute
🌐 Single policy + reward model

[1/N]



GWLzR9MX0AAQHi-.jpg


2/11
@LunjunZhang
LLM-generated solutions often sound convincing even when they are wrong.

For example, the solution below is incorrect because it ignores the word ‘each’ in the problem.

While standard RMs get fooled easily, GenRM can detect the error by explicitly reasoning (with Chain-of-Thought) about solution correctness.

[2/N]



GWLzo55XsAAciVF.jpg


3/11
@LunjunZhang
On algorithmic and math reasoning tasks, GenRM outperforms classical RM and prompted LLM verifiers (LLM-as-a-Judge), in terms of Best-of-N performance, which uses the reward model to select the best solution among N candidates from a fixed LLM.

On GSM8K, when using a Gemma-9B GenRM to verify the outputs of Gemini 1.0 Pro, we observe a 20% improvement with Best-of-N (73% → 92.8%), beating direct generation from GPT-4 and Gemini 1.5 Pro.



GWLz_I_XoAAGWmL.jpg


4/11
@LunjunZhang
So how does GenRM work? Previously, reward models (RMs) and verifiers were trained as binary classifiers. They do not utilize the text generation capabilities of LLMs.

Now, given a question and an answer, GenRM simply finetunes an LLM to answer "Is the answer correct (Yes/No)?" with a " Yes" or " No" token.
Surprisingly, this itself can do better than standard RMs.



GWL0MFzWsAAvICo.jpg


5/11
@LunjunZhang
We can *train* the verifier to use chain-of-thought (CoT) reasoning before giving a final "Yes" / "No" answer.

GenRM-CoT: ‘Let’s verify step by step’, literally.

This can enable the generative verifier to catch subtle mistakes that the discriminative RM misses.



GWL0fiyXwAAu-We.jpg


6/11
@LunjunZhang
Using Chain-of-Thought in the reward model unlocks the possibility of leveraging additional inference-time compute to improve verification.

GenRM-CoT can utilize majority voting by sampling multiple verification CoTs and computing the average correctness scores, to turn more test-time compute into more problems solved.

Fine-tuned GenRM models even outperform the LLM (Gemini 1.0 Pro) we used for generating verification CoTs on training problems!



GWL0tmyXoAAI-LJ.jpg


7/11
@LunjunZhang
Now that we are doing next-token prediction, we can unify generation and verification, by simply adding the generation and verification tasks to the same data mixture. As expected, generation and verification are synergistic.

Teaching the verifier to imitate correct solutions improves verification.

Teaching the generator to verify solutions also improves generation performance itself.



GWL0-arWsAAXRvo.jpg


8/11
@LunjunZhang
How does GenRM scale with model size?

On GSM8K, we scale model capacity using Gemma 2B, 7B, 9B, and observe positive scaling trends both for GenRM (direct) and GenRM-CoT.



GWL1JvEWEAAzGdn.jpg


9/11
@LunjunZhang
How do we generate synthetic verification CoT data given only solution correctness labels? This is important, as collecting such data from humans is expensive and eventually infeasible as LLMs surpass human reasoning.

To get synthetic data to work, we provided a reference solution in addition to the problem and solution to verify. This improves the rationale data quality to be high enough such that GenRM-CoT can work well.



GWL1ccxXsAAjKAD.jpg


10/11
@LunjunZhang
Training on synthetic rationales also leads to interesting emergent behaviors.

Even if GenRM-CoT does not catch the correct step-level mistake, it may still consider the answer problematic, and then attempt the problem from a different angle, before giving the final Yes/No verification answer.



GWL1ohRWIAAQjmw.png


11/11
@LunjunZhang
We hope that Generative RMs can pave the way for better reward models and self-improving reasoners that can verify their own outputs.

[2408.15240] Generative Verifiers: Reward Modeling as Next-Token Prediction

Fun collaboration with @arianTBD , @hbXNov , @kazemi_sm , @aviral_kumar2 , @agarwl_ at Google DeepMind.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,330
Reputation
8,496
Daps
159,976











1/11
@LunjunZhang
What if your reward model could “think” more and perform better? Even better, what if your LLM policy could also be used as a reward model?

Introducing GenRM, reward models trained as next token predictors, rather than old-fashioned classifier RMs. This enables things that weren’t possible:

🔗 Chain-of-Thought reasoning for RM
🚀 Leveraging test-time compute
🌐 Single policy + reward model

[1/N]



GWLzR9MX0AAQHi-.jpg


2/11
@LunjunZhang
LLM-generated solutions often sound convincing even when they are wrong.

For example, the solution below is incorrect because it ignores the word ‘each’ in the problem.

While standard RMs get fooled easily, GenRM can detect the error by explicitly reasoning (with Chain-of-Thought) about solution correctness.

[2/N]



GWLzo55XsAAciVF.jpg


3/11
@LunjunZhang
On algorithmic and math reasoning tasks, GenRM outperforms classical RM and prompted LLM verifiers (LLM-as-a-Judge), in terms of Best-of-N performance, which uses the reward model to select the best solution among N candidates from a fixed LLM.

On GSM8K, when using a Gemma-9B GenRM to verify the outputs of Gemini 1.0 Pro, we observe a 20% improvement with Best-of-N (73% → 92.8%), beating direct generation from GPT-4 and Gemini 1.5 Pro.



GWLz_I_XoAAGWmL.jpg


4/11
@LunjunZhang
So how does GenRM work? Previously, reward models (RMs) and verifiers were trained as binary classifiers. They do not utilize the text generation capabilities of LLMs.

Now, given a question and an answer, GenRM simply finetunes an LLM to answer "Is the answer correct (Yes/No)?" with a " Yes" or " No" token.
Surprisingly, this itself can do better than standard RMs.



GWL0MFzWsAAvICo.jpg


5/11
@LunjunZhang
We can *train* the verifier to use chain-of-thought (CoT) reasoning before giving a final "Yes" / "No" answer.

GenRM-CoT: ‘Let’s verify step by step’, literally.

This can enable the generative verifier to catch subtle mistakes that the discriminative RM misses.



GWL0fiyXwAAu-We.jpg


6/11
@LunjunZhang
Using Chain-of-Thought in the reward model unlocks the possibility of leveraging additional inference-time compute to improve verification.

GenRM-CoT can utilize majority voting by sampling multiple verification CoTs and computing the average correctness scores, to turn more test-time compute into more problems solved.

Fine-tuned GenRM models even outperform the LLM (Gemini 1.0 Pro) we used for generating verification CoTs on training problems!



GWL0tmyXoAAI-LJ.jpg


7/11
@LunjunZhang
Now that we are doing next-token prediction, we can unify generation and verification, by simply adding the generation and verification tasks to the same data mixture. As expected, generation and verification are synergistic.

Teaching the verifier to imitate correct solutions improves verification.

Teaching the generator to verify solutions also improves generation performance itself.



GWL0-arWsAAXRvo.jpg


8/11
@LunjunZhang
How does GenRM scale with model size?

On GSM8K, we scale model capacity using Gemma 2B, 7B, 9B, and observe positive scaling trends both for GenRM (direct) and GenRM-CoT.



GWL1JvEWEAAzGdn.jpg


9/11
@LunjunZhang
How do we generate synthetic verification CoT data given only solution correctness labels? This is important, as collecting such data from humans is expensive and eventually infeasible as LLMs surpass human reasoning.

To get synthetic data to work, we provided a reference solution in addition to the problem and solution to verify. This improves the rationale data quality to be high enough such that GenRM-CoT can work well.



GWL1ccxXsAAjKAD.jpg


10/11
@LunjunZhang
Training on synthetic rationales also leads to interesting emergent behaviors.

Even if GenRM-CoT does not catch the correct step-level mistake, it may still consider the answer problematic, and then attempt the problem from a different angle, before giving the final Yes/No verification answer.



GWL1ohRWIAAQjmw.png


11/11
@LunjunZhang
We hope that Generative RMs can pave the way for better reward models and self-improving reasoners that can verify their own outputs.

[2408.15240] Generative Verifiers: Reward Modeling as Next-Token Prediction

Fun collaboration with @arianTBD , @hbXNov , @kazemi_sm , @aviral_kumar2 , @agarwl_ at Google DeepMind.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,330
Reputation
8,496
Daps
159,976


1/2
@_philschmid
Big Distilabel release! Distilabel is an open source framework for creating synthetic datasets and generating AI feedback, designed to provide fast, reliable, and scalable pipelines based on verified research papers for engineers! 👀

And just got its 1.4 release with:
🧩 New Steps for better dataset sampling, deduplication (embeddings and minhash), truncation of inputs and better combining outputs
💰 50% Cost Savings by pausing pipelines and using OpenAI Batch API
⚡️ Caching for step outputs for maximum reusability—even if the pipeline changes.
📝 Steps can now generate and save artifacts, automatically uploaded to the Hugging Face Hub.
🆕 New Tasks with CLAIR, APIGen, URIAL, TextClassification, TextClustering, and an updated TextGeneration task.



https://video.twimg.com/ext_tw_video/1844053895963639814/pu/vid/avc1/1280x720/YMKBqh95eHOKxELK.mp4

2/2
@_philschmid
Full Release: Release 1.4.0 · argilla-io/distilabel

Docs: Distilabel

Release 1.4.0 · argilla-io/distilabel




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,330
Reputation
8,496
Daps
159,976


1/6
@_philschmid
Can we improve retrieval for RAG by learning from neighboring contexts? Contextual Document Embedding shows how neighboring document information, during training and encoding, can create "context-aware" embeddings that significantly improve retrieval performance, especially in out-of-domain scenarios.

Implementation
1️⃣ Cluster similar documents to identify neighboring documents for each one.
2️⃣ Extend Encoder to include information from these neighboring documents when generating embeddings.
3️⃣ Train the model using a contrastive learning objective that incorporates neighboring documents into the loss function.

Insights
🥇 Best open embedding Model (< 250M) on MTEB with 65.00
🥈 Without ahead of time information achieves 63.8 on MTEB
📚 SOTA on MTEB without hard negative mining or massive batch sizes.
🔍 Using in-domain contextual documents for training/clustering works best
🚀 Can be applied to embedding models to improve performance
🤯 Outperforms traditional biencoders, especially out-of-domain tasks
🛠️ Filtering false negatives boosts performance further.



GZbblpNWUAA1HGp.jpg


2/6
@_philschmid
Paper: [2410.02525] Contextual Document Embeddings
Notebook: Google Colab
Model: https://huggingface.co/jxm/cde-small-v1
https://huggingface.co/jxm/cde-small-v1



3/6
@SameerReddy0
@memdotai mem it



4/6
@memdotai
Saved! Here's the compiled thread: Mem



5/6
@markopolojarvi
The only showstopper for me personally is that 512 token context but fingers crossed they can extend it easily. If they could add nomic style variable dimensions on v2, that would be amazing because 768 dims is a bit of waste of space with something like 50 tokens.



6/6
@IanTimotheos
In addition to that benefit, it is only through context that gives rise to the ability for understanding and comprehension.

Context is all you need, attention is helpful.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,330
Reputation
8,496
Daps
159,976


1/2
@_philschmid
What is a Mixture of Experts (MoE), and why are they successful? @MaartenGr just published a new visual guide on the Mixture of Experts (MoE) to explain the two main components of MoE: Experts and the Router. 👀

TL;DR:
🧠 MoE consists of multiple "expert" neural networks and a router that directs inputs to the most suitable experts
🔄 Experts aren't domain specialists, but rather learn to handle specific tokens in specific contexts
⚖️ Load balancing is crucial to ensure all experts are utilized effectively during training
🚂 The router uses probability distributions to select which experts process which tokens
📊 MoE allows models to have more parameters overall while using fewer during actual inference
🖼️ MoE isn't limited to language models - it's also being adapted for vision models
🔢 Mixtral 8x7B demonstrates the power of MoE, loading 46.7B parameters but only using 12.8B during inference

Big Kudos to @MaartenGr! Recommend taking a look!



GZTQP6XXIAEoOGm.jpg


2/2
@_philschmid
A Visual Guide to Mixture of Experts (MoE)




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,330
Reputation
8,496
Daps
159,976


1/10
@_philschmid
Can we build more capable AI agents by learning from cognitive science? Cognitive Architectures for Language Agents (CoALA) introduces a structured approach to design AI Agents by integrating cognitive architecture principles with modern LLMs.

CoALA describes three key components: 1) modular memory, 2) a structured action space for internal/external interactions, and 3) a generalized decision-making process to select actions.

CoALA Implementation
1️⃣ Define memory components (working, long-term, Procedural)
- Define memory components
- Working Memory: For temporary information during tasks.
- Long-Term Memory: To store knowledge and experiences.
Procedural Memory: For skills and action sequences.

2️⃣ Define a Structured Action Space (internal and external actions).
- Internal Actions: Reasoning steps, memory updates.
- External Actions: Interacting with tools, APIs, or environments.

3️⃣ Implement a decision-making process (propose, evaluate, select).
- Propose: Generate possible actions.
- Evaluate: Assess actions based on goals and context.
- Select: Choose the optimal action to execute.

4️⃣ Add safety mechanisms and monitoring systems

5️⃣ Test and iterate to refine the agent's components and behavior.



GZWOdlxWAAAkskT.jpg


2/10
@_philschmid
Paper: Paper page - Cognitive Architectures for Language Agents

This paper provides a theoretical framework for integrating cognitive principles with large language models to create more capable AI agents with no code/examples.



3/10
@adridder
Co-intelligence unlocks amplified reasoning. Cognitive insights enrich AI capabilities. Exciting frontier



4/10
@viksit
thinking of this as a scalable engineering project is going to be a very complex undertaking in this form.

episodic memory is going to exponentially balloon. pruning strategies will be needed. a centralized semantic memory seems over generalized?



5/10
@KingMumboJumbo
Very cool - made a notebookLM podcast on it: Sign in - Google Accounts



6/10
@CohorteAI
CoALA is a transformative approach that bridges cognitive science with modern LLMs to create more capable AI agents. By implementing modular memory systems, structured action spaces, and a robust decision-making process, CoALA enhances an agent's ability to manage information, interact with tools, and make informed decisions. This integration not only improves performance on complex tasks but also incorporates essential safety mechanisms. Leveraging cognitive architecture principles, CoALA sets the foundation for AI agents that are more intelligent, adaptable, and aligned with human-like reasoning. Exciting developments ahead for intelligent multi-agent systems!



7/10
@yaelmendez
Love it. Neural networks are only beginning to take shape. @neuromitosis



8/10
@SaikiK66287209
Interesting approach



9/10
@andysingal
Exciting times ahead for all of us 😊



10/10
@AppyPieInc
Great insights! Integrating cognitive science with AI design like CoALA is a game changer for creating smarter agents. Excited to see how this evolves!




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,330
Reputation
8,496
Daps
159,976


1/2
@_philschmid
Are LLMs really good at Math? A new paper reveals that LLMs have strong performance on individual math problems but struggle with chained problems where the answer to one informs the next. This reasoning gap is larger in smaller, specialized models. 👀

The reasoning gap is the difference between an LLM's expected performance (based on individual question accuracy) and its actual performance on chained reasoning tasks.

Tested Models include @GoogleDeepMind Gemini Series (1.5 Pro, 1.0 Pro, 1.5 Flash), @OpenAI GPT-4o and mini, @AIatMeta Llama 3 (70B and 8B versions), specialized math models (Mathstral-7B, NuminaMath-7B-CoT), @MistralAI, @Microsoft Phi and more.

1️⃣ Create pairs of grade-school math problems where the answer to the first (Q1) is needed for the second (Q2), Compositional GSM dataset.
2️⃣ Evaluate LLMs on both individual problems (Q1, Q2 separately) and the combined pairs.
3️⃣ Compare the combined accuracy (accuracy of Q1 * accuracy of Q2) with the expected individuall accuracy ⇒ reasoning gap

Insights
💡 LLMs struggle with multi-hop reasoning, leading to a "reasoning gap" in chained math problems.
🤔 The reasoning gap might come from distraction and too much additional context (indicating missing training data?).
📈 Larger LLMs generally perform better than smaller, specialized models
📚 Fine-tuning on grade-school math can lead to overfitting, hindering generalization to chained problems.
💡 Instruction tuning and code generation improvements differ between model sizes.
📊 High scores on standard benchmarks don't reflect true reasoning abilities in multi-step problems.
❌ OpenAI o1 preview or o1 mini were not tested (probably weren't released at that time)



GZHbFc1WYAAfWSO.jpg


2/2
@_philschmid
Paper page - Not All LLM Reasoners Are Created Equal




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,330
Reputation
8,496
Daps
159,976

1/1
@_philschmid
I took a closer look at the new Realtime API from @OpenAI and how it works! It supports both text and audio as inputs and outputs, as well as function calling capabilities using websockets. 👀

How it Works:
> The Realtime API uses WebSockets for a persistent connection, enabling real-time communication between client and server.

> It's stateful, maintaining context throughout a session. The API supports sending and receiving text and audio, managing conversations, handling function calls, and gracefully managing interruptions.

> Client and server communicate through JSON-formatted events. Payloads are JSON objects specifying event types and related data (e.g., text, audio chunks, function calls). Audio gets resampled to 24kHz mono pcm16 and then sent as base64 encoded string

> The server can simultaneously return multimodal output, with, e.g. for moderation.

> The API supports 9 client events and 28 server event types, including 14 different response events alone, such as audio, deltas, function call arguments, and more.

> Function calling is integrated directly into the API. You can define functions (tools) for the model to use. When the model decides to call a function, it sends a response.function_call event instead of “audio”.

> Long conversations are automatically truncated and are based on heuristics designed to preserve the most essential context.

> Clients can send conversation history or additional context alongside audio input, e.g. (RAG, Function calling)

> The server maintains an "Input Audio Buffer." Audio chunks which are not yet committed to the conversation. Server-side VAD is used to detect end of speech and when to process.

> The server can be interrupted during a response, either automatically by server-side VAD or by a client's response.cancel event.

Seamlessly mixing text, audio, and function calls in real-time might come at the cost of increased complexity - building good UX will be harder in the beginning. But I'm excited. 🚀



GZDiOfEWkAAWwR6.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,330
Reputation
8,496
Daps
159,976


1/5
@dvilasuero
Turns out 40% of the Reflection-v1 dataset had duplicated rows.

I’ve shared a deduplicated version and started a discussion here:

glaiveai/reflection-v1 · Duplicates

Let’s invest more time in understanding data and build better open datasets together.



https://video.twimg.com/ext_tw_video/1842184509686513664/pu/vid/avc1/852x720/3Ky9qhNBI-yPqfsQ.mp4

2/5
@AymericRoucher
With which software do you make this cool video @dvilasuero ?



3/5
@dvilasuero
Screen Studio — Professional screen recorder for macOS by @pie6k !



4/5
@MaziyarPanahi
Thanks @dvilasuero for spending time and resources to provide some insights about the dataset. 🙌🏼



5/5
@calebfahlgren
In DuckDB it's only one line to find the number of duplicates for each column!



GZDXixYW0AAq-l1.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,330
Reputation
8,496
Daps
159,976


1/11
@_philschmid
ReAct like prompting? @AIatMeta Llama 3.1 70B using this new Dynamic Chain of Thoughts, reflection, and verbal reinforcement prompt



GZMdWRpXgAAW3X6.jpg


2/11
@josephpollack
is this result of the prompt format you shared earlier ?



3/11
@_philschmid
If you check the gist there is the system prompt I used. It’s from the repository



4/11
@RasputinKaiser
{
"Objective": "Please create an AI Art Tutorial and Keyword hub. It needs sections on colors, lists of colors. Art styles, lists of art styles. Art Mediums, lists of Art Mediums. Textures, lists of textures. Materials, lists of materials. Patterns, lists of patterns..",
"Instructions": {
"1": {
"Tag": "<thinking>",
"Description": "Enclose all thoughts and explorations within this tag.",
"Varentropy": "Explore multiple angles and approaches to prompting to encourage diversity and creativity."
},
"2": {
"Tag": "<step>",
"Description": "Break down your analysis into clear steps within this tag.",
"StepBudget": {
"InitialSteps": 10,
"AllowRequestMore": true
},
"UseCountTag": {
"Tag": "<count>",
"Description": "Show the remaining steps after each step."
}
},
"3": {
"Tags": [
"<formal>",
"<informal>",
"<summary>",
"<detailed>",
"<example>",
"<definition>"
],
"Description": "Apply appropriate tags to adjust the style or focus."
},
"4": {
"Tag": "<reflection>",
"Description": "Regularly evaluate your progress using this tag. Be critical and honest about the effectiveness of the prompting strategies you've explored."
},
"5": {
"Tag": "<reward>",
"Description": "After each reflection, assign a quality score between 0.0 and 1.0 using this tag.",
"Criteria": [
"Effectiveness: How well the prompts elicit desired responses.",
"Clarity: The understandability of the prompts.",
"Creativity: The uniqueness and innovation in your approaches."
]
},
"6": {
"Guidance": "Use your self-assigned scores to guide your next steps.",
"ScoreActions": {
"0.8+": "Continue refining the current approach.",
"0.5 - 0.7": "Consider minor adjustments for improvement.",
"<0.5": "Re-evaluate and consider alternative prompting strategies."
}
},
"7": {
"Tag": "<thinking>",
"Description": "If unsure or if the reward score is low, backtrack and explore different approaches. Document your decisions and reasoning within this tag."
},
"8": {
"Description": "Explore multiple prompting strategies individually. Compare and contrast these approaches in your reflections."
},
"9": {
"Description": "Use the <thinking> and <step> tags as a scratchpad to write out all reasoning and insights explicitly."
},
"10": {
"Tag": "<summary>",
"Description": "Summarize your key findings and insights within this tag. Provide clear guidelines or best practices for effective AI prompting."
},
"11": {
"Tag": "<final_reflection>",
"Description": "Conclude with a final reflection within this tag. Discuss the overall effectiveness of the prompting strategies explored, challenges faced, and solutions found."
},
"12": {
"Tag": "<reward>",
"Description": "Assign a final reward score after the <final_reflection>, summarizing your overall performance based on the criteria."
}
},
"Guidelines": {
"EncourageVariedExploration": {
"Description": "Use <thinking> tags to freely explore different prompting methods without self-censorship. Embrace varentropy by considering unconventional or creative prompting techniques."
},
"BeStructuredAndMethodical": {
"Description": "Clearly delineate each step of your exploration to maintain organization. Keep track of your step budget to ensure a thorough yet efficient analysis."
},
"UtilizeTagsEffectively": {
"Description": "Adjust the tone and depth of your explanations using the appropriate tags. Provide definitions and examples to clarify complex concepts."



5/11
@anshulcreates
this is honestly game changing. wow!



6/11
@WhatNextBTW
They, llms, don't know what they don't know, still. 1st & 2nd counts methods must be different, w/o codes. Then compare both results. In fact counting repetitive thing is up against proper languages. "Indirect count" is the solution to its "r"s in strawberry or similar.



7/11
@filoynavaja
En qwen2.5 14b



GZNec8jXcAAhgVi.jpg


8/11
@ko_kay_o
Share HuggingChat prompt bro 😺 or is it the Llama one?



9/11
@Hiswordxray
Have you tried it with Llama 3.1 8B?
And did it get the strawberry question correctly?

Also try Llama 3.1 3B with the strawberry question.



10/11
@CoderHarish
Thanks for testing my prompt and reading my blog
Thanks @philschmid



11/11
@andysingal
adding here: prompt-docs/Dynamic-COT.md at main · andysingal/prompt-docs




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,330
Reputation
8,496
Daps
159,976



1/9
@DynamicWebPaige
🤯⚡️ Gemini 1.5 Flash-8B + code execution is insane.

I just uploaded the text for a short story - Harlan Ellison's "Repent, Harlequin!" - and it was able to generate Python not only to tell me how many times Harlequin occurred in the text (✅ correct)

but also to generate a dictionary for how many times each word occurred.



https://video.twimg.com/ext_tw_video/1841996492728606720/pu/vid/avc1/1146x720/vXDRyFG9c1ahpvq-.mp4

2/9
@DynamicWebPaige
📽️And reminder: this also works for videos!

I just asked Gemini 1.5 Flash 8B to look at this 10-minute long video clip, extract out the text, and then used code execution to add all of the funny lines into a Python dictionary.

...all for a single, shiny penny. 🤯



https://video.twimg.com/ext_tw_video/1842005147955933184/pu/vid/avc1/1146x720/cvH6-SrtGueGZlVJ.mp4

3/9
@tosho
where do you run this?



4/9
@DynamicWebPaige
This is Google AI Studio, where you can access the newest Gemini and Gemma models:

Google AI Studio



5/9
@simonw
I did not realize the Gemini API had code execution support! Code execution | Gemini API | Google AI for Developers



6/9
@slow_developer
the speed 🚀



7/9
@Prashant_1722
Classic Word count problem



8/9
@drseuss1692888
But was it accurate?



9/9
@Turist
Took me 3 tries to get here, but it finally worked.
The other few times it went into a dumb loop, printing the same thing again and again.
Also - it won't work with text longer than 8k tokens, as the output is capped at 8k.
I started with a transcript 15k token long and failed.



https://video.twimg.com/ext_tw_video/1842048130294136832/pu/vid/avc1/728x720/mcLPbxYXuAzJpfjE.mp4


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,330
Reputation
8,496
Daps
159,976


1/4
@_philschmid
How does @AIatMeta's Llama 3.2 Vision compare to @OpenAI GPT-4o? Meta's first multimodal Llama shows promise in general image understanding but falls short in complex tasks compared to GPT-4o, but offers a cost-effective alternative for basic vision applications. 👀

Llama 3.2 Vision:
🔍 Shows general image understanding capabilities, competitive with GPT-4o for basic tasks
🏥 Limited accuracy in medical prescription analysis, but surprisingly good at some medical report interpretations
📊 Struggles with complex financial chart analysis, prone to hallucination
📝 Text extraction capabilities present but less reliable than GPT-4o
💰offers a strong cost-performance for basic vision tasks



GY6DljEXMAACMlK.jpg


2/4
@_philschmid
Meta Llama 3.2: A Deep Dive into Vision Capabilities



3/4
@Prince_Canuma
Finetuning and distillation (i.e., NVLM-D as a teacher) can bridge the gap.

Meta did the hard work (pre-training)

Now it’s time for the community to shine! ✨



4/4
@dmitrig33k
I've worked with both Llama 3.2 Vision and GPT-4, and I agree that Meta's multimodal model shows promise, but struggles with complex tasks. For basic vision applications, it's a great cost-effective alternative, but for medical or financial analysis, I'd still rely on GPT-4.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,330
Reputation
8,496
Daps
159,976


1/11
@_philschmid
Attention! 🚨 @OpenAI's "meta" prompt for optimizing GPT prompts might have already leaked!

Just hours after they launched a new playground feature for automatically generating optimized system prompts—similar to Antropic's prompt generator—people might already have extracted the meta prompt used. 👀

I tested it in the playground, and the results look very similar. You use the “meta” prompt as a system prompt and then provide your task/query as user input. 🔥

The prompt is attached to the thread. Credits to @amebagpt



GY3cBoNXYAAgsNO.jpg


2/11
@_philschmid




GY3cM_TW4AAwJa3.jpg


3/11
@ibrinzila
Nothing special, neither about the so called meta leak or the so called generated prompt



4/11
@j6sp5r
The "might" is interesting. There isn't really a way to be sure, is there?



5/11
@ArbitorofOZ
@elder_plinius



6/11
@kuehne_jan
@memdotai mem it



7/11
@memdotai
Saved! Here's the compiled thread: Mem



8/11
@kihote
@memdotai mem it



9/11
@memdotai
Saved! Here's the compiled thread: Mem



10/11
@MitjaMartini
Interesting and good to know that good and elaborate prompt engineering seems to be still relevant.



11/11
@Steven_Strauss
@readwise save




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 
Top