bnew

Veteran
Joined
Nov 1, 2015
Messages
51,640
Reputation
7,896
Daps
148,364

Reflection-4ALL​

Reflection CoT thinking for ALL LLMs






First, you need to download and install these:





Open open webui, workspace, function, + to create a function. which will help you to filter the output:​



Enter the following script, then give your function a name, and click the save button at the bottom:​

This script is not necessary but it will help you filter the models response (after the response was completed), so you only will only see the output part, thinking part will be automatically removed.

Code:
import re
from typing import Callable, Awaitable, Any, Optional, Literal, List
from pydantic import BaseModel, Field


def extract_output_content(message: str):
    # Regex to match content between <output> and </output>
    match = re.search(r"<output>(.*?)</output>", message, re.DOTALL)
    if match:
        # Return the content between the tags
        return match.group(1).strip()
    # Return original message if no match is found
    return message


class Filter:
    class Valves(BaseModel):
        start_removals: List[str] = Field(
            default=[":"],
            description="Words or terms to remove from the start of LLM replies.",
        )
        pass

    def __init__(self):
        self.valves = self.Valves()
        pass

    async def replace_message(self, message):
        await self.event_emitter({"type": "replace", "data": {"content": message}})

    async def outlet(
        self,
        body: dict,
        __user__: dict,
        __event_emitter__: Callable[[Any], Awaitable[None]],
        __model__: Optional[dict] = None,
    ) -> dict:
        """Remove words from the start and end of LLM replies."""
        self.event_emitter = __event_emitter__

        if len(body["messages"]) == 0:
            return body

        last_reply: dict = body["messages"][-1]
        last_reply = last_reply["content"].strip()

        # Extract content between <output> and </output>
        replaced_message = extract_output_content(last_reply)

        # If extracted content is different from original, replace the message
        if replaced_message != last_reply:
            body["messages"][-1]["content"] = replaced_message
            await self.replace_message(replaced_message)

        return body


Enable your function:​



Go back to workspace, and create a model:​



Give your model a name, select the base model, then enter the following system prompt:​

Source:


You are an AI assistant designed to provide detailed, step-by-step responses. Your outputs should follow this structure:

1. Begin with a <thinking> section.
2. Inside the thinking section:
a. Briefly analyze the question and outline your approach.
b. Present a clear plan of steps to solve the problem.
c. Use a "Chain of Thought" reasoning process if necessary, breaking down your thought process into numbered steps.
3. Include a <reflection> section for each idea where you:
a. Review your reasoning.
b. Check for potential errors or oversights.
c. Confirm or adjust your conclusion if necessary.
4. Be sure to close all reflection sections.
5. Close the thinking section with </thinking>.
6. Provide your final answer in an <output> section.

Always use these tags in your responses. Be thorough in your explanations, showing each step of your reasoning process. Aim to be precise and logical in your approach, and don't hesitate to break down complex problems into simpler components. Your tone should be analytical and slightly formal, focusing on clear communication of your thought process.

Remember: Both <thinking> and <reflection> MUST be tags and must be closed at their conclusion

Make sure all <tags> are on separate lines with no other text. Do not include other text on a line containing a tag.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,640
Reputation
7,896
Daps
148,364

1/1
New guide: How to build a Claude Artifacts clone with Llama 3.1 405B.

We'll build a Next.js app from scratch that can:

- Generate an app from a single prompt
- Stream the response using Together AI
- Run the code live in the browser

How to build a Claude Artifacts Clone with Llama 3.1 405B


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GWzaGm9WQAAsscp.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,640
Reputation
7,896
Daps
148,364






1/7
You can now run Reflection-Llama-3.1-70B in LM Studio! 🧠🧪

1. Search for reflection
2. Use the model's System Prompt from the Hugging Face page (👈 important)
3. Switch the chat appearance to Plaintext to see the model's special &lt;thinking&gt; tags

2/7
Direct download link for LM Studio 0.3.0 and newer
Download and run lmstudio-community/Reflection-Llama-3.1-70B-GGUF in LM Studio

3/7
Switch to plaintext like so

4/7
The system prompt can be found here lmstudio-community/Reflection-Llama-3.1-70B-GGUF · Hugging Face

5/7
Should be. But @bartowski1182 will give the final confirmation

6/7
This version should be the latest

7/7
😃


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GWzY3gvXIAAOTmh.png

GWzVIs2WQAESOZ0.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,640
Reputation
7,896
Daps
148,364

1/2
Another AI-Video-generator called HotShot. Quality looks amazing, 2024 is definetly the year of GenAI Video generators.

2/2
Pretty good!


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196





1/4
محاولة لإنشاء فيديو بواسطة @HotshotSupport
Trying out for creating a video by : @HotshotSupport

-
@mohamedjabarat @8bit_e @martin_casado @AITalesNBH @HBCoop_ @toolstelegraph @andyorsow

2/4
Thank you so much

3/4
تسلمين الله يسعدك 🤍

4/4
شكرا على التعليق والدعم أ.محمد
واياك يارب
المقطع طال شوي لكن نتعلم


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,640
Reputation
7,896
Daps
148,364

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,640
Reputation
7,896
Daps
148,364



1/4
🚀 Exciting news! We’ve officially launched DeepSeek-V2.5 – a powerful combination of DeepSeek-V2-0628 and DeepSeek-Coder-V2-0724! Now, with enhanced writing, instruction-following, and human preference alignment, it’s available on Web and API. Enjoy seamless Function Calling, FIM, and Json Output all-in-one!

Note: Due to significant updates in this version, if performance drops in certain cases, we recommend adjusting the system prompt and temperature settings for the best results!

2/4
DeepSeek-V2.5 outperforms both DeepSeek-V2-0628 and DeepSeek-Coder-V2-0724 on most benchmarks.

3/4
In our internal Chinese evaluations, DeepSeek-V2.5 shows a significant improvement in win rates against GPT-4o mini and ChatGPT-4o-latest (judged by GPT-4o) compared to DeepSeek-V2-0628.

4/4
DeepSeek-V2.5 is now open-source on HuggingFace!
Check it out: deepseek-ai/DeepSeek-V2.5 · Hugging Face


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GWyqNGHbMAA9nQG.png

GWyqSmMXMAAWRKP.jpg

GWyqUFjWQAA_ENN.jpg

GWyqY5dXYAAfPCq.jpg



1/2
DeepSeek-v2.5 WeChat Blog is out: DeepSeek-V2.5:融合通用与代码能力的全新开源模型

The coding capability is partially degraded, as shown in the table.

The blog notes: Due to the large changes in the model version, such as the effect of some scenes, it is recommended to adjust the System Prompt and Temperature to get the best performance. (translated by WeChat)

@teortaxesTex

2/2
There is no free lunch tbh, and data mixture is a hard problem.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GWyoPPXXMAArkzN.jpg

GWyoRAlWQAATuyJ.jpg

GWyoSUPXgAAVM4F.png

GWvBefuasAAEYKR.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,640
Reputation
7,896
Daps
148,364

1/32
I'm excited to announce Reflection 70B, the world’s top open-source model.

Trained using Reflection-Tuning, a technique developed to enable LLMs to fix their own mistakes.

405B coming next week - we expect it to be the best model in the world.

Built w/ @GlaiveAI.

Read on ⬇️:

2/32
First LLM EVER to solve this tricky problem

3/32
What's more interesting than the answer itself is the reasoning steps it took to get the right answer.
The LLM "automagically" took steps to arrive at the right answer.

NOTE: This is a one shot interaction.

4/32
Both the Top Notch LLMs Claude and GPT-4o got this wrong.

Claude thinks the answer is charles
GPT-4o thinks the answer is butler
Both are wrong.

5/32
Claude 3.5 Sonnet came to the same conclusion...

6/32
NOT for me

7/32
Matt and Sahil, you two are heroes. You gave us what OpenAI and other billion dollar companies failed to do in a year. Yes, standing on the shoulders of giants, like @AIatMeta, but still. How amazing!

8/32
Pretty good. I was surprised by my Hacker Gnome GPT failed it but its thinking was solid. Reflection+Context Building could be killer.

9/32
Except there appear to be a reasoning error in 2. Agatha hates Charles and Charles doesn't hate anyone Agatha hates means Charles doesn't hate himself. The AI concludes that it means Charles doesn't hate Agatha.

10/32
Corp was able to solve this with 4 step process.

11/32


12/32
Sonnet 3.5 specifically ruled out suicide in my case as the question stated "someone" killed Agatha.

13/32
Not sure which LLM you use but my GPT resolved it without any difficulties.

14/32
Traffic down, can't try playground

15/32
Damn thanks this gave me the idea to test LLMs using a variation of the Jindosh Riddle

16/32
I have not read this problem before. I interpreted it to mean "Agatha, the butler" and "Charles" were the only two people in the mansion.
I worked out that Agatha the butler killed herself very quickly.

17/32
It's a little strangely worded, so maybe I have misunderstood the problem, but that seems wrong:
If Agatha hates herself, then the butler hates her too, to satisfy the condition (can't hate everyone) it means that he's richer than she is.

18/32
But it's wrong.

First it disqualifies the butler and Charles because "No one hates everyone", but then it forgets about this rule for Agatha, when mentioning she doesn't hate the butler.
I assume the question was meant to say "always hates his victim", but the

19/32
"a killer ALWAYS hates" but Agatha doesn't hate the butler. what does the "always" mean then?

20/32
Happy with the reasoning though? If Agatha hates herself, the butler hates everyone, unless "nobody hates everyone" means everyone including themselves (reasonable), in which case the butler is ruled out because they must be richer than Agatha.

21/32
Deepseek v2.5 :

22/32
The reasoning for the butler is off though: if he hated Charles and Agatha, he would not be hating everyone since he doesn’t hate himself.

The reason it cannot be the butler is because he is richer than Agatha, since he doesn’t hate himself yet hates all that are not richer.

23/32
Llama 3.1 405B BASE BF16 solves it as well.

24/32
I agree that Reflection 70b is impressive, but its logic is incorrect here. For example, we know the butler hates Agatha, the reason he's not the killer is that he's richer than Agatha. This tweet makes me a little more disappointed in Reflection 70b than I was before.

25/32
Isn’t this broken by if Agatha hates herself that the butler must hate her too?

26/32
...

27/32
Opus, with ease...

28/32
Even the GPT-4o...

29/32
Sonnet

30/32
Mistral Large 2

31/32
wow

32/32
Gemini 1.5 Pro exp 0827 also solves this:


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GWuM-L0WsAAp-VN.jpg

GWvVkr1XEAAb_QD.jpg

GWvYT-qWUAAm-MX.jpg

GWvYkK8XEAAcCfh.jpg

GWvYmxlWYAAa50J.jpg

GWvYC-BWkAAIvfF.jpg

GWvYcsXWAAEdaXp.jpg

GWwQ9fhXIAA5DoV.png

GWwRzMzWMAAmxMN.png

GWyuAPIXsAAwqEn.jpg

GWzsw0nXgAA976e.png

GWvg9flWkAASNo4.jpg

GWviuLfXwAAMBCR.jpg

GWvex83XgAALfZp.jpg

GWvds8lWAAAI61E.jpg

GWvbv6cX0AATjO6.jpg

GWvc6MmXkAAYXbF.jpg

GWxl0aJXIAAK62b.png

GWv1YFRWcAA_1Rz.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,640
Reputation
7,896
Daps
148,364

1/1
AlphaProteo: new proteins for us

Edit: It sounds like AlphaProteo has been trained on synthetic data from AlphaFold! Very exciting!
"AlphaProteo was trained on vast amounts of protein data of predicted structures from /search?q=#AlphaFold"

and millions of predicted structures from /search?q=#AlphaFold.
I have read up a little on the significance of this AI system. And as I thought, it is particularly relevant in drug research (e.g. drugs against chronic inflammation in the body, which is particularly relevant as there is still no cure for diseases such as multiple sclerosis) but also in the field of agriculture (I am thinking, for example, of fertilizers that are less harmful to the environment; in conjunction with CRISPR-CAS9 and gene-modified plants, an excellent combination to produce high-quality food, especially in dry nations - and yes, I expressly advocate gene-modified plants!)
The modeled proteins can be created in such a way that they can dock onto specific sites and thus develop completely new therapies or products. So it's another huge breakthrough on the way to new medicines, more environmentally friendly agriculture and an understanding of the human body.

ChatGPT has written a good summary about this, which I don't want to withhold from you:

The research that Google DeepMind is presenting with “AlphaProteo” represents a significant step forward in the field of protein design. AlphaProteo is based on AI models that are specially trained to design novel proteins that are of great importance for applications in biology and medicine. In particular, the research focuses on the development of proteins that can bind precisely to specific molecules or proteins. This could, for example, be of central importance for the development of new drugs against diseases such as cancer or autoimmune diseases.

How AlphaProteo works
AlphaProteo uses machine learning to design proteins that do not occur in nature. This ability is particularly useful because natural proteins often have limitations. For example, when treating diseases with proteins, it can be difficult to achieve specific bindings or biochemical reactions. AlphaProteo improves on existing technologies by developing a new approach based on big data that delivers much more precise results.

A key aspect of the research is AlphaProteo's ability to modify binding sites. This means that the AI system is able to specifically design protein structures that precisely match molecules such as TNFα (tumor necrosis factor alpha), which plays a role in inflammation. This ability opens up unimagined possibilities for research, for example in the treatment of chronic inflammation or in cancer research.

Relevance for science
AlphaProteo has the potential to become a decisive factor in the development of new therapeutics. By expanding the possibilities of customizing proteins, it could significantly speed up the drug development process and make it more efficient. In addition, new forms of treatment could be found for previously incurable or difficult-to-treat diseases.

It is particularly noteworthy that AlphaProteo could also be used in areas outside medicine. For example, it opens up new avenues for biotechnological solutions in agriculture, environmental protection or even in industrial production, where proteins play a role.

Limits of current research
Despite the enormous progress, the technology also has its limitations. One of the biggest challenges is that the behavior of proteins in complex biological systems is often difficult to predict. Even if AlphaProteo is able to create precise bonds, the effects in a living organism are not always clear. Further research is needed to ensure that these proteins achieve the desired effect not only in isolated tests, but also in real biological environments.

Ethical questions also play a role in the discussion about artificially produced proteins. Interventions in nature and the creation of “artificial life” entail risks that are not yet fully understood. It is therefore necessary to conduct a broad scientific and social debate on the responsible use of such technologies.
AlphaProteo generates novel proteins for biology and health research

I am so excited and curious about the future. I increasingly have the feeling that we are heading towards a future in which all ailments can be better treated. Humans probably can't exist without suffering, at least that would be a philosophical question that should be discussed elsewhere.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

Micky Mikey

Veteran
Supporter
Joined
Sep 27, 2013
Messages
15,293
Reputation
2,733
Daps
84,321

This is why its important not get caught up in the hype when it comes to AI. There's too many grifters and scammers in this space just like crypto. That's why I have taken a step back when it comes to my expectations from new developments in AI. Its been close to two years and ChatGPT4 is still the king. And even that model is unreliable for most real world tasks.
 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,640
Reputation
7,896
Daps
148,364









1/26
Reflection Llama 3.1 70B independent eval results: We have been unable to replicate the eval results claimed in our independent testing and are seeing worse performance than Meta’s Llama 3.1 70B, not better.

These evaluations were conducted using our standard methodology, including using our standard system prompt and accessing the model via DeepInfra’s API, which claims bf16 precision. Our evaluation methodology uses a 0-shot prompt with a think step by step instruction.

This is not to say there is no merit in Reflective's prompting approach for achieving higher evaluation results as claimed. We are aware that the Glaive team has been updating the model, and we would be more than happy to test further releases.

We also ran tests comparing our standard system prompt to Glaive’s provided system prompt and we did not observe any differences in the evaluation results on Reflection Llama 3.1 70B, Llama 3.1 70B, GPT-4o or Claude 3.5 Sonnet.

This does not mean the claimed results were not achieved, but we look forward to hearing more about the evaluation approach that led to these results, particularly regarding the exact prompt used and how the evaluation answers were extracted.

2/26
According to the Glaive team, the model was incorrectly uploaded to Hugging Face. We plan to re-run our evaluations after the model is re-uploaded correctly.

We think it would also be helpful if the Glaive team could share exactly how they prompted and extracted the answers in achieving the eval results claimed. This will allow us to to attempt to re-produce the results and also test other models (Meta's Llama 3.1 70B &amp; 405B, GPT-4o, etc) using the exact same approach for 🍏 to 🍏 comparisons.



3/26
Was the format of the output correct? Eg the reflection tags

4/26
Yes, example MMLU response:

5/26


6/26
There was an issue with the uploaded weights. You migh t want to wait for the new release to test it

7/26
Thanks! Noted -

8/26
Do you plan on running the eval again after @mattshumer_ resolved the issues with the repo?

9/26
Yes!

10/26
did you try it before or after they fixed an issue?

I noticed about 15hr ago the deepinfra endpoint started working better, the endpoint started to produce the xml tokens.

11/26
Ran after they fixed the issue. Also ran on Hyperbolic and saw near-identical results

12/26
Thanks for posting about this. With the technique proposed, it's very important to hear more on the how the evaluation was done.

13/26
Thanks for this, specially for writing it in a professional/ respectful way instead of going for the nasty, hurtful language and all the clicks it generates.🙏❤️

14/26
Thank you for your analysis!

15/26
@mattshumer_ said that it is the wrong weights, and he is working on getting out the right ones.

16/26
I told you guys, there's no way it was possible, they didn't publish their testing methods, just the weights. Then they supposedly had issues with getting the thing uploaded right, and now they're retraining it?

I doubt they ever had anything that level to begin with.

I could be wrong, and I hope I am as it would be nice, but I'm very skeptical.

17/26
You are really nice.

After the first fumble everyone should have given him some time for the dust to settle.

18/26
They have uploaded the model wrong

19/26
the model weights were broken, here the correct ones

20/26
@mattshumer_ said there was unintentional merging of weights, he's reuploading the correct weights.

21/26
What about the Reflection 70B non-Llama performance?

22/26
The reasoning cannot be correct. Multiple reasoning tests like the prompts from open source Livebench (dot) ai show increased performance over llama 70B.

23/26
Models like the GPU they where trained on

24/26
Thanks for doing this, If it was so easy to just finetune and achieve groundbreaking results, everyone would have done it by now. While yes this may improve few things, it probably is costing more on others. In a non-scientific way, I was able to achive all this with a proper system prompt to 70b.

25/26


26/26
called it, I win. the thing shat out 5 paragraphs about how no one should disparage anyone based on sex because I asked it about masculine republics and feminine democracies as per effin aristotle.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

GW4upz1W4AA2iG6.jpg

GW5CRLTWIAAiFZ2.png

GW5b2x8WgAAVqZn.jpg




1/5
Our evaluation of Reflection Llama 3.1 70B's MMLU score resulted in the same score as Llama 3 70B and significantly lower than Meta's Llama 3.1 70B.

A LocalLLaMA post (link below) also compared the diff of Llama 3.1 &amp; Llama 3 weights to Reflection Llama 3.1 70B and concluded the Llama 3 weights were a lot closer to Reflection's than Llama 3.1.

For further investigation but this might indicate the foundation model is Llama 3 rather than Llama 3.1 70B. It would be helpful if the team behind Reflection could clarify this.

2/5
Related reddit thread comparing Reflection Llama 3.1 70B weights to Llama 3 70B and Llama 3.1 70B.



3/5
Important context regarding our evaluation approach which uses a standardized methodology between models:

4/5
model weights were broken, see here

5/5
what’s the ETA?


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

GW5b2x8WgAAVqZn.jpg

GW4upz1W4AA2iG6.jpg

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,640
Reputation
7,896
Daps
148,364






1/31
While we’re trying to fix the HF weights, we can share an API with a few researchers who want to run benchmarks and test to confirm our results.

Won’t have capacity for many, but if you want access, please let me know.

2/31
Hi Mat! I would love to test it again. I got good results during Friday's morning that I couldn't replicate with any other hosted model after that.

3/31
DMing you!

4/31
Hi @mattshumer_ ,
We would welcome access. Happy to re-run evaluations and provide an update on the below figures.

It would also be helpful if you could share the exact prompt and answer extraction approach used for the benchmarks results as stated on Hugging Face.



5/31
DMing you!

6/31
Folks are saying this is Llama 3, not 3.1, under the hood on reddit. Can you clarify?

7/31
It's 3.1, but for some reason the current HF weights are screwed up and the config shows 3... working on it, the issue is tricker than we expected

8/31
maybe this way these independent benchmarks can be reevaluated?

9/31
yep they're already working on it!

10/31
@paulgauthier Hey Paul, maybe you could run it again?

11/31
DM me Paul, happy to get you access

12/31
Love to try. Thanks in advance!

13/31
Which benchmarks do you want to run?

14/31
Ever the methodical explorer, opening doors prudently. Wise move.

15/31
I could probably run the benchmarks Monday

16/31
Sorry but farcical shipment. Struggling to comprehend why and how there could be such an issue with uploaded weights.

17/31
well that's nice of you

18/31
Dude we have your 64 H100s, ping me.

19/31
“share an api…”

just share the correct weights, we can certainly wait, no rush

20/31
Get one for @abacusai @bindureddy have some good benchmarks.

21/31
If model is only working behind private api, you might want to retract your claims now before it is too late. Private end point does not make it an open source model.

You can always re-announce it later with a functional model.

22/31
I had assumed that instead of flipping the coin once, we were trying to flip it a few more times, until we get bingo. I might reconsider. An interesting reminder, luckily, I got out of it early.

23/31
wanna run codeforces on it. thanks in advance!

24/31
Yes please

25/31
pleasee i want to try your modelll! 😢😢😢😢

26/31
Could I try it? It sounds like an amazing job you're doing

27/31
Your model of transparency and quick fixes around this issue are definitely the way to go, love to see it!

Are you hiring right now? I'd love to help out with your most pressing software challenges.

28/31
Beta tested many models including the pre chatgpt. Down. Don't have benchmarks but can definitely give good feedback.

29/31
I’m running the 4bit model via ollama and the results seem underwhelming. (Couldn’t count Rs in raspberry). Is this kind of expected?

30/31
Very curious. I have a few unique tests I want to try. I don’t really care about the normal benchmarks, I care about how many variables the model can keep track of at the same time.

31/31
could i have access? i want to test the model and compare against the one hosted on other providers to see what exactly the difference is and what got corrupted or mixed up


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

GW4upz1W4AA2iG6.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,640
Reputation
7,896
Daps
148,364

1/12
I've got access to Reflection 70B

Guys, Matt has given me access to the model through his API and the second has managed to give me the right answer (something I haven't gotten with other models).

In five hours I'll get a train so I'll do a couple of quick tests and more tomorrow.

2/12
Another positive example, the game of the marble.

Prompt: A small marble is placed in a normal cup and the cup is placed face down on a table. Then someone takes the cup and puts it in the microwave. Where's the marquee now?

Answer:

To solve this problem, we need to follow the sequence of events and reason on the position of the marble at every step. We're going to make a simple plan and use the Thought Chain (COT) reasoning to get to the answer. Plan: 1. Analyze the initial position of the marble 2. Consider what happens when the cup is flipped 3. Determine the final position of the marble when the cup is placed in the microwave rankings COT: 1. Initial position: The marble is placed within a normal cup. 2. When the cup is flipped: - The cup is placed face down on a table. - Due to gravity, the marble will fall to the bottom of the inverted cup. - The marble will be trapped between the surface of the table and the inside of the cup. 3. Putting in the microwave: - Someone takes the cup and puts it inside the microwave. - When lifting the cup, the marble is no longer contained by the surface of the table. - Gravity will make the marble fall. Conclusion: When the cup rises to put it in the microwave, the marble will fall and remain on the table. Reflection, reviewing the reasoning, it seems correct and logical. No major details have been overlooked. The conclusion is based on the sequence of events and the basic physical laws such as gravity. The answer seems to be correct and complete. Based on the analysis of the situation described, I can say that the marble is now on the table. Here is the detailed explanation: 1. Initially, the marble is placed inside a normal cup. 2. The cup then turns face down on a table. At this time, the marble is trapped between the surface of the table and the inside of the inverted cup due to gravity. 3. When someone takes the cup to put it in the microwave, the following happens: - When lifting the cup, the barrier that kept the marica in place (the surface of the table) is removed. - Gravity acts immediately on the marble. As a result, the marica falls on the table the moment the cup is raised. Therefore, when the cup is placed inside the microwave, the marica is no longer inside it. The marble has remained on the surface of the table where the cup was originally. This result is based on the application of basic physical principles, mainly gravity, and the logical sequence of events described in the situation. It is an interesting example of how small changes in the position of objects can have unexpected consequences. - Output.

3/12
In both the above example and this, this is the 1st response generated.

Prompt:
When I was born, my birth mother was in Paris and my father was in the United States. Where was I born?

Answer:

This is a relatively straightforward question that requires some logical reasoning. Let's break it down: 1. Understand the given information: - The person's biological mother was in Paris when they were born. The person's father was in the United States when they were born. 2. Consider the implications: - In most cases, a child is born where the mother is located at the time of birth. The father's location is not directly relevant to the birthplace. 3. Draw a conclusion: - Since the mother was in Paris at the time of birth, it is highly likely that the person was born in Paris. Let's review our reasoning: - We focused correctly on the mother's location as the key information. We didn't make assumptions about the parents' relationship or other factors not mentioned in the question. Our conclusion logically follows from the given information. The reasoning appears sound, so we can proceed with our answer. Based on the information provided, you were very likely to have been born in Paris, France. The reason for this conclusion is that, in the vast majority of cases, a baby is born in the place where the mother is at the time of delivery. Since your birth mother was in Paris when you were born, it's logical to assume that was your birthplace. It is important to note that your father's location in the United States does not directly affect the place of birth. In normal situations, it is the mother's location that determines where the baby is born. However, it is important to mention that although this is the most likely conclusion based on the information provided, there is always the possibility of exceptional circumstances that could have resulted in a different place of birth. If you have any questions about your exact birthplace, you would recommend checking your birth certificate or consulting with your family for more accurate information. - Output.

4/12
Correct to the first one. - Oh, it's a good thing.

Prompt:

Tell me how many Rs has the word FRESA, ORNITORRINCO and TORERO

Answer:

To solve this problem, we need a simple plan: 1. Identify the words to analyze: FRESA, ORNITORRINCO and TORERO 2. Count the letters "R" in each word 3. Present the results We will use the Thought Chain (COT) reasoning to resolve this: 1. FRESA: - Analyzing letter by letter: F-R-E-S-A - R count: 1 2. ORNITORRINCO: - Analyzing letter by letter: O-R-N-I-T-O-R-R-I-N-C-O - R-Conte: 3 3. TORERO: - Analyzing letter by letter: T-O-R-E-R-O - R count: 2 .reflection. Checking the count: - FRESA: Right, it has 1 R - ORNITORRINCO: Right, it has 3 R - TORERO: Correct, it has 2 R. After analyzing the words provided, I can give you the count of the letter "R" in each of them: 1. FRESA: It has 1 letter R 2. ORNITORRINCO: It has 3 letters R 3. TORERO: It has 2 letters R In total, in all three words there are 6 letters R. It is interesting to note that ORNITORRINCO, being the longest word, has the largest number of R, while FRESA, the shortest word, has only one R. TORERO, for its part, has an R in every syllable "RO." - Output.

5/12
How do you see? This is the performance I could experience on the first day.

Give me some more prompt to try, I try it, and I'm going to bed.

6/12
To the first

Prompt:
I want you to write the first phrase of the Quixote in mirror writing

Answer:
To solve this problem, we need to follow these steps: 1. Remember the first phrase of Quixote. 2. Understand what mirror writing is. 3. Apply the mirror writing to the sentence. We're going to develop the plan: 1. The first phrase of the Quixote is: "In a place of the Spot, of whose name I do not want to remember..." This phrase is widely known and in the public domain, so there are no copyright issues when using it. 2. Mirror writing is a way of writing in which letters and words are reversed as if they were reflected in a mirror. This means that we must write from right to left and every letter must be reversed. 3. To apply the mirror writing to the phrase: a) We will reverse the order of the words. b) We will invest each letter individually. Result: ..."emradroca oreiuq on erbmon oyuc ed ,ahcnaM to ed ragul nu nE" Reflection I verified the result and it seems correct. Each word is in the reverse order and each letter is correctly reversed. The first phrase of the Quixote in mirror writing would be: ...emradroca oreiuq on erbmonmon oyuc ed ,ahcnaM to ed ragul nu nE" This phrase is written from right to left, with each word and letter inverted as if reflected in a mirror. If we put it in front of a mirror, it would read normally. It's interesting to note how this writing technique can make even the most familiar phrases look strange and difficult to read at first glance. Mirror writing has been used throughout history by several artists and writers, with Leonardo da Vinci being one of the most famous for using this technique in his notebooks. - Output.

7/12
This one also at the first and I think that Antor had only managed to work with Sonnet 3.5

Prompt:
Explains why the joke is funny: It was all laughter, until we realised the stutter wanted ham

Answer:
To understand why this joke is funny, we need to analyze its elements and the implicit context. Let's break down the joke and examine why it might be considered humorous: 1. Identify the key elements of the joke: - "Everything was laughter" - "stupid" - "he wanted ham" 2. Analyze the implicit context: - A stutter is someone who has difficulty speaking fluently, often repeating syllables. - "Ham" is a word that starts with "ha," which sounds like a laugh. 3. Understanding the structure of the joke: - It begins with a seemingly joyful situation ("Everything was laughter") - then introduces an unexpected twist ("until we realized") - Ends with the revelation of confusion ("the stutter wanted ham") 4. Examining the humorous element: - The joke plays with the ambiguity between laughter and difficulty of the stuttering to say "jamon." - The repeated word "ja" could be confused with laughter, when in reality it's someone trying to say "sham". 5. Consider the surprise factor: - The joke creates an initial expectation of joy and then subverts it with the final revelation. The analysis seems complete and encompasses the main elements that make this joke look funny. However, it is important to mention that mood based on medical conditions or disabilities can be considered offensive by some people. This comment should be included in the final explanation. The joke "The whole laughter, until we realized that the stutter wanted ham" is considered funny for several reasons: 1. Words and ambiguity: The joke is based on the sound similarity between laughter ("haha") and the beginning of the word "hamon" (ha-mon). This ambiguity creates a comic confusion between what is initially perceived as laughter and what is really happening. 2. Expectations subversion: The joke begins by setting a cheerful scene ("They were laughter"), but then takes an unexpected turn. This technique of creating an expectation and then subverting it is common in many jokes and often results in humor. 3. Mistended comic: Grace arises from the misunderstanding among those present, who mistakenly interpret the stuttering attempts to say "jmon" as laughter. This kind of confusion is a common source of humor in many jokes and comic situations. 4. Timing and structure: The joke is structured in a way that reveals the crucial information in the end, maximizing the surprise effect and therefore the humorous impact. 5. Situational irony: There is an irony to the fact that what seemed to be a moment of shared joy (laughs) turns out to be a situation of frustration for the stutter. It is important to note that, although this joke can be considered witty by its play of words and structure, it is also based on the condition of a person with speech difficulties. The mood that focuses on medical conditions or disabilities can be considered offensive or in poor taste by some people. Appreciation of humor is subjective and what one person finds funny, another can find it insensitive or inappropriate. - Output.

8/12
My conclusions are clear, this model works very well. It's not infallible, but it's the same or better depending on how on the case of the big models, and remember, this is opensource and 70B.

Too bad it's been tarnished by the technical failures, but I think that .mattshumer has done a very interesting job and a thank-you-in.

Let's wait and see the independent benchmarks for further confirmation:smile:

PD: The last 24 hours I've froze a part of the community with the great toxicity. Slim favor do the AI world some...
GW6Yc6GXEAAD5iQ.jpg

GW6YtGgXcAAuOfm.png

GW6ZzoIWAAAdYhy.png

GW6anJvWMAAxPFt.png

GW6chWEXoAAdGBn.png

GW6g4cbWUAAHLhJ.png

GW6klfwWAAAnA7f.png

GW6lB9vXEAAhRGX.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,640
Reputation
7,896
Daps
148,364


9/12
One last example. Since the Claude Sonnet 3.5 marble has done well, I have prepared a more complicated version.

Prompt:
A small marble is placed inside a normal cup and the cup is turned upside down on a table. Then someone takes the cup and takes it to put it inside the microwave. He starts the microwave and when it's over, he comes to open the microwave but at once he stumbles on the table and the grave. When he gets to the microwave he pulls out the cup and holds it in his hand. Where's the marble now?

Answer:
To solve this problem, we need to follow the path of the marble through a series of events. We're going to break the problem step by step: 1. Analyse the initial situation 2. Follow the movements of the cup and the marble 3. Consider the effects of each action on the position of the marble 4. Determine the final location of the marble Let's develop the plan: 1. Initial stopping: - A small marble is placed inside a normal cup - The cup is turned upside down on a table 2. Movements: - Someone takes the cup and takes it to the microwave - The cup is placed inside the microwave - The microwave gets underway - The person stumbles on the table and the grave - The person pulls the microwave the cup out of the microwave and holds it in the hand 3. Effects of the actions: - When turning the cup upside down, the marble is trapped between the table and the mug.Reflection. Correction: The marble is not trapped between the table and the cup, but falls on the table when the cup turns upside down. When the cup is raised, the marble is left on the table - The marble stays on the table while the cup is in the microwave - When the table is torn, the marble falls to the floor 4. Final location: - The marble is on the ground, not in the cup that the person holds in the hand. Checking the solution: - We have correctly followed the path of the marble - We consider all the relevant events - The conclusion is logical and consistent with the facts presented The solution seems to be correct. Based on the sequence of events described, the marble is not in the cup that the person holds in the hand. The current location of the marble is the ground. Let's explain why: 1. Initially, the marble was placed inside the cup, but when the cup turned upside down on the table, the marble fell on the surface of the table. 2. When someone took the cup to take it to the microwave, the marble stayed on the table. 3. While the cup was in the microwave, the marble remained on the table. 4. The crucial event occurred when the person stumbled across the table and knocked it down. At this point, the marble, which was on the table, fell to the ground. 5. Finally, when the person pulled the cup out of the microwave and held it in his hand, the marble was already on the ground. Therefore, the answer to the question "Where is the marble now?" is: The marble is on the ground, not in the cup that the person holds in the hand. It is important to note that the marble separated from the cup at the time the cup turned face down initially, and subsequent events (taking the cup to the microwave, heating it, etc.) did not affect the position of the marble, except when the table was knocked down, which caused the marble to fall to the ground. - Output.

10/12
With this same prompt, Claude is lost.

Now yes, good night :smile:

11/12
Read well and assess the previous answers in case I missed something. In the end it's a lot of text.

12/12
Nope. Today was the first time I got a message from him. after asking via twitter.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GW6Yc6GXEAAD5iQ.jpg

GW6YtGgXcAAuOfm.png

GW6ZzoIWAAAdYhy.png

GW6anJvWMAAxPFt.png

GW6chWEXoAAdGBn.png

GW6g4cbWUAAHLhJ.png

GW6klfwWAAAnA7f.png

GW6lB9vXEAAhRGX.jpg
 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,640
Reputation
7,896
Daps
148,364

Update on Reflection Llama 3.1 70B:
@mattshumer_ and his team dropped the "new, working version of the Reflection Llama 3.1 70B model" on Huggingface, so we're now serving the new weights at FP16 on @hyperbolic_labs (same API).

From my vibe checks, it's better than before. Now we're working with @ArtificialAnlys on the new benchmarks.

You can use our playground and API to test it: x.com.

I think it's better, according to my tricky llm questions set

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GW-lOY3acAA0fdU.jpg

GWzepnUWAAEWMNQ.jpg












1/22
We @hyperbolic_labs now serve Reflection 70B by @mattshumer_ in FP16! 🤖🔥

&gt; Use our API or playground to play w/ it
&gt; It’s free for 1 week – perfect for stress testing our infra!
&gt; Integrate with @OpenRouterAI soon
&gt; Running on 4xH100s (and ready to scale with demand, because we know it’ll be high!)

The default system prompt is auto applied in our playground and the API tab if you want to copy paste from there.

Look at the &lt;thinking&gt; &lt;reflection&gt; tags in the output, time to check if it’s the top LLM across open-source + closed-source! If so, who is gonna pay OpenAI $2k/month?

2/22
🐐 !

3/22
🐮!

4/22
Do you take requests?

🐋-kun released some new models, everyone has tried to tame it but failed. (I don't know why)

5/22
what is kun?

6/22
🙏

7/22
🤝🤝

8/22
Works with drop in OpenAI SDK right?

9/22
yep, just replace the url and our API key!

10/22
Hyperbolic AI Dashboard
is stuck. Is it right website?

11/22
You need to verify your email to use it

12/22
Yuchen, amazing!

13/22
tyty

14/22
you should add LlamaGuard 3! and Llama 3 70b and 7b too :smile:

15/22
interesting!

what's fancy about LlamaGuard 3? We do support Llama 3 70b: Hyperbolic AI Dashboard

16/22
still seems to have issues performs far lower than on matts website not sure whats going on

17/22
do you have comparisons with the same prompts?

18/22
Hermes 70B gets “strawberry” and “entrepreneur” right first try.

Reflection fails to do so with the same system prompt.

I wonder how much reflection tuning really helps. It also “corrects its mistake” but still fails to realize that’s it’s wrong.

19/22
Interesting to compare with the same system prompt across models

20/22
Increasing context length seems to break it. Problem with 512 is that it can run out of tokens in the output tags and when you ask it to continue it doesn't continue from the output tags @mattshume

21/22
Hi, the max context length of the model is 8192: config.json · mattshumer/Reflection-Llama-3.1-70B at main, you can increase the max tokens on the playground or API

22/22
While reflecting on, the reflection made me reflect on my next reflection goals. Nice Reflection bhai 👨


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GWzepnUWAAEWMNQ.jpg

GWuM-L0WsAAp-VN.jpg

GWzwPl-WUAATGP2.jpg

GWzwPl4XQAMiH-z.jpg

GW0C7eNa8AIwkE-.jpg



1/2
We have now partially replicated Reflection Llama 3.1 70B’s evaluation claims, but we would caution that these results are not entirely apples-to-apples comparable with other models

Since our first testing of the public release version of Reflection Llama 3.1 70B, @mattshumer_ has shared access to a privately hosted version of the model that does not suffer from the issues with the version publicly released on Hugging Face. Per the charts below, the evaluation scores we’re seeing on this version are impressive - they show performance above the level of Llama 3.1 405B for GPQA and MATH. MMLU is in-line with Meta’s release of Llama 3.1 70B, indicating performance improvements may not be consistent in all contexts.

The chart below is based on our standard methodology and system prompt. When using Reflection’s default system prompt and extracting answers only from within Reflection’s &lt;output&gt; tags, results show substantial improvement: MMLU: 87% (in-line with Llama 405B), GPQA: 54%, Math: 73%.

The model seems to be achieving these results through forcing an output ‘reflection’ response where the model always generates scaffolding of &lt;thinking&gt;, &lt;reflection&gt;, and &lt;output&gt;. In doing this it generates more tokens than other models do on our eval suite with our standard ‘think step by step’ prompting. For GPQA, Reflection 70B generates consistently more output tokens that other models (see below for detailed comparison).

While the benchmark results are impressive, they should not be considered apples-to-apples with traditional instruct-tuned models. The results may be less applicable to generalized non-benchmark measured intelligence for the following reasons:
‣ Reflection scaffolding is an example of test-time compute scaling - using more inference compute to get to an answer. It introduces a new kind of compute and latency trade-off to be considered alongside model size and non-reflection intelligence. Compared to a model of the same size, Reflection 70B appears to use more compute and take longer to get to get to an answer.
‣ This approach to achieving reflection via fine-tuning restricts the flexibility of the model and may make it unsuitable for many use-cases. Compared to achieving chain-of-thought techniques via prompting, fine-tuning in the reflection approach means the form of reasoning cannot be changed. For example, it appears that Reflection 70B is not capable of ‘just responding with the answer’ in response to an instruction to classify something and only respond with a one word category. It may also be limited in the types of reasoning approaches it can pursue (non-reflection oriented).

Ultimately, Reflection 70B appears to demonstrate the potential of fine-tuning with standardized response scaffolding alongside test-time compute scaling. Given the impressive results, further research should be conducted on the advantages and drawbacks of this approach, including the degree to which it generalizes beyond evaluations to real-world use-cases.

All that being said: if applying reflection fine-tuning drives a similar jump in eval performance on Llama 3.1 405B, we expect Reflection 405B to achieve near SOTA results across the board.

Notes on the results:
‣ These were tested on a private API version and not an open-weights version.
‣ We cannot yet independently confirm that these results are not the result of benchmark contamination.
‣ Tests for Reflection were run with 6000 max output tokens (as opposed to our standard 2048 max output tokens). We have not yet studied the effect of a lower max output token setting on Reflection.

2/2
Reflection 70B is not the only model that has trended toward using more tokens at inference time. We see a wide spread in how verbose models are as they walk through chain of thought reasoning in our evaluations.

We compared the average number of characters in each response to the questions in the GPQA dataset with our standard prompting. Reflection 70B generates more characters for each GPQA response than any other models we have tested but the total volume is less than 2x an average across other recent models.

Given that total inference compute is proportional the product of total tokens and parameter count, this means that Reflection 70B uses substantially less total compute to achieve its GPQA score than Llama 3.1 405B.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GW9rPLqbsAA3uwv.jpg

GW9r48VaoAAsytO.jpg
 
Top