The A.I Megathread (LLM , GPT , Development)

bnew · Oct 16, 2024

1/11
@ZyphraAI
Today, in collaboration with @NvidiaAI, we bring you Zamba2-7B – a hybrid-SSM model that outperforms Mistral, Gemma, Llama3 & other leading models in both quality and speed.

Zamba2-7B is the leading model for ≤8B weight class.

See more in the thread below

2/11
@ZyphraAI
Beyond just MMLU, we perform strongly across all standard benchmarks, beating competing models in the ≤8B bracket. Zamba2 is released with open-weights under a permissive Apache 2.0 License. We’re excited to see what the open source community will build with this model.

3/11
@ZyphraAI
Read more (blog): Zyphra
Download the weights: Zyphra/Zamba2-7B · Hugging Face
Chat with the model: Zamba
NVIDIA’s NIM: zamba2-7b-instruct | NVIDIA NIM

4/11
@ZyphraAI
For inference, Zamba2-7B is both significantly faster and more memory efficient than competing transformer models. Zamba2-7B is ideal for on-device and consumer GPUs.

5/11
@ZyphraAI
Zamba2-7B leads in accuracy/efficiency per token because:
1. Sharing transformer blocks frees more params for Mamba2.
2. Some attention makes up for SSMs struggling with ICL and long-range deps.
3. A 5T dataset (Zyda2 release tomorrow)
4. Annealing over 100B high-quality tokens.

6/11
@ZyphraAI
We also release an instruct-tuned version of Zamba2-7B. This model strongly outperforms the corresponding instruct models of Mistral and Gemma, and is head-to-head with the Llama-3.1-8B Instruct model.

7/11
@ZyphraAI
Architectural improvements over Zamba-7B :

- Mamba1 → Mamba2 blocks

- Two shared attention blocks interleaved in an ABAB pattern throughout the network.

- A LoRA projector to each shared MLP, letting the model specialize across depth

- Rotary position embeddings

8/11
@ZyphraAI
@ZyphraAI is committed to democratizing advanced AI systems, exploring novel architectures, and advancing the scientific study and understanding of powerful models.

We look forward to collaborating with others who share our vision!

9/11
@SamuelMLSmith
Impressive results, but why are you comparing to Gemma-1 and not Gemma-2 (MMLU 71%)?

It would also be interesting to see an inference speed comparison with RecurrentGemma!

10/11
@ArdaTugsat
@ollama Are we getting this?

11/11
@FullyOnChain
@dominic_w @JanCamenisch

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 16, 2024

1/13
@MFarajtabar
1/ Can Large Language Models (LLMs) truly reason? Or are they just sophisticated pattern matchers? In our latest preprint, we explore this key question through a large-scale study of both open-source like Llama, Phi, Gemma, and Mistral and leading closed models, including the recent OpenAI GPT-4o and o1-series.
https://arxiv.org/pdf/2410.05229

Work done with @i_mirzadeh, @KeivanAlizadeh2, Hooman Shahrokhi, Samy Bengio, @OncelTuzel.

/search?q=#LLM /search?q=#Reasoning /search?q=#Mathematics /search?q=#AGI /search?q=#Research /search?q=#Apple

2/13
@MFarajtabar
2/ When OpenAI released GSM8K ~3 years ago, GPT-3 (175B) scored 35% on the GSM8K test. Today, models with ~3B parameters are surpassing 85%, and larger ones are hitting >95%. But has model 'reasoning' really improved? How much of this is genuine /search?q=#logical//search?q=#symbolic reasoning? vs. /search?q=#pattern_recognition, inadvertent data /search?q=#contamination, or /search?q=#overfitting?

3/13
@MFarajtabar
3/ Introducing GSM-Symbolic—our new tool to test the limits of LLMs in mathematical reasoning. We create symbolic templates from the /search?q=#GSM8K test set, enabling the generation of numerous instances and the design of controllable experiments. We generate 50 unique GSM-Symbolic sets, essentially like GSM8K examples but with different values and names. How do models handle these distinct sets?

4/13
@MFarajtabar
4/ /search?q=#Result 1: Current accuracies on GSM8K are not reliable! We observe LARGE performance variation: Llama 8B scores anywhere between 70% to 80%, Phi-3 scores between 75% and 90%, and so on. For most models, the average performance on GSM-Symbolic is lower than on GSM8K (indicated by the dashed line).

5/13
@MFarajtabar
5/ /search?q=#Result 2: The fragility of supposed LLM reasoning. LLMs remain sensitive to changes in proper names (e.g., people, foods, objects), and even more so when numbers are altered. Would a grade-school student's math test score vary by ~10% if we only changed the names?

6/13
@MFarajtabar
6/ What if we adjust question difficulty? We introduce 3 new variants of GSM-Symbolic to study model behavior: removing one clause (GSM-M1), adding one clause (GSM-P1), or adding two clauses (GSM-P2).

7/13
@MFarajtabar
7/ /search?q=#Result 3: As questions increase in difficulty (M1 → Symbolic → P1 → P2), not only does performance drop, but variance also rises, making models increasingly unreliable.

8/13
@MFarajtabar
8/ This begs the question: Do these models truly understand mathematical concepts? Introducing /search?q=#GSM_NoOp! We add a single clause that seems relevant but doesn't contribute to the overall reasoning (hence "no-op"). Check out what happens next!

9/13
@MFarajtabar
9/ /search?q=#Result 4: A massive performance drop! All models, including o1 models, show significant declines. While it’ll be interesting to see how grade-school students perform on similar datasets, I doubt the drop would be this severe.“

10/13
@MFarajtabar
10/ /search?q=#Result 5: Can scaling data, models, or compute fundementaly solve this? We don't think so! /search?q=#OpenAI's /search?q=#o1-series is performing better but still suffers from slight performance variations. /search?q=#o1_preview shows significant improvements, but...

11/13
@MFarajtabar
11-/ .... but even o1-preview shows the same silly mistakes like this. Either it doesn't understand what 'now' is, or it doesn't understand what 'last year' is, or a more likely explanation is that its training data with inflation has this pattern, and it's following that again.

12/13
@MFarajtabar
12/ Understanding LLMs' true reasoning capabilities is crucial for deploying them in real-world scenarios where accuracy and consistency are non-negotiable—especially in /search?q=#AI_safety, /search?q=#alignment, /search?q=#education, /search?q=#health_care, and /search?q=#decision_making systems. Our findings emphasize the need for more robust and adaptable evaluation methods. Developing models that move beyond pattern recognition to true logical reasoning is the next big challenge for the /search?q=#AI /search?q=#community.

13/13
@MFarajtabar
13/ Overall, we found no evidence of formal reasoning in language models including open-source models like /search?q=#Llama, /search?q=#Phi, /search?q=#Gemma, and /search?q=#Mistral and leading closed models, including the recent /search?q=#OpenAI /search?q=#GPT-4o and /search?q=#o1-series. Their behavior is better explained by sophisticated pattern matching—so fragile, in fact, that changing names can alter results by ~10%! We can scale data, parameters, and compute—or use better training data for Phi-4, Llama-4, GPT-5. But we believe this will result in 'better pattern-matchers,' not necessarily 'better reasoners.
Check out the full paper to find out more: https://arxiv.org/pdf/2410.05229
Also stay tuned for the data release!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 16, 2024

1/8
@omarsar0
Has mathematical reasoning in LLMs really advanced?

This study tests several SoTA models on a benchmark created with symbolic templates that enable diverse mathematical problems.

They find that LLMs exhibit variance when responding to variations of the same questions.

The performance of all the models declines by adjusting the numerical values in the question.

Another interesting finding is that as questions are made more challenging (e.g., increasing the number of clauses) the performance significantly deteriorates.

The authors hypothesize that the observed decline in performance is due to a lack of logical reasoning in current LLMs.

The study highlights the importance of model reliability and robustness and why it's important to continuously evaluate LLM systems after being deployed. This is not just for math-related problems, this also happened for use cases involving analysis, research, Q&A, and retrieval. We have seen that adjustments in prompts that have to deal with knowledge, numbers, retrieval, or structure can throw off the model so there is a need to trace LLMs and monitor them in production.

2/8
@omarsar0
Paper: [2410.05229] GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

3/8
@inductionheads
Seems to me this proves that reasoning models actually do better on reasoning tasks. If they weren't doing reasoning there would be no correlation

4/8
@evolucion4
Estos son como los críticos de Gardel.

5/8
@asankhaya

[Quoted tweet]
This is surprising to only those that have not worked in formal reasoning. Yes, LLMs cannot do true logical reasoning in a formal sense, you can do better with an SMT solver. But it is also true that you can solve a lot of logical problems by just applying “reasoning steps” from the training data, specially when your training data is the entirety of written content ever produced. Both of these can be true at the same time it is not a contradiction just an interesting dichotomy.

6/8
@gpt_biz
It's interesting to see that even state-of-the-art models struggle with math reasoning under slight variations, highlighting the need for continuous evaluation in real-world applications.

7/8
@wyomingiliad
Yes! As I've stated numerous times, AI is utterly incapable of logic and reason. Not only was the 2024 Nobel Prize in Physics not awarded to advancements in physics, but AI isn't intelligent. AI is incapable of reason, creativity, innovation, and exalted idealism. It has been said that God created man in His own image, but these days many AI bros are trying to create AI in their own image and call that God, or at least a "Ph.D. in your pocket lol bro." But AI can only ever be as ethical as the foundations it was built upon--the art, science, poetry and beautiful human creativity it was trained on. Plato saw beauty in Justice, but their AI sees no beauty. Were Natural Rights, Artists' Rights, Copyright, and Human Rights respected during the training? If not, the AI is not, and cannot be ethical--as Hamlet noted--"It is not, and it cannot, come to good." And all for what? OpenAI's o1 strawberry pop-tart/stochastic parrot is incapable of simple logic and reason, exalted art, truth, and beauty. To demonstrate this, give it the geometric definitions in Book 1 of Euclid's Elements, and then ask it to complete the rest of Euclid's Elements using simple logic and reason oft employed by junior high schoolers. It won't be able to. So it is that AI algorithms are incapable of meaningful scientific research, useful invention, or creative art. There will be no AGI from LLMS nor any other "AI" fantasy. Getting AGI from an LLM is like trying to turn a Model X Tesla into the Millenium Falcon by buiding it bigger. All of scientific cultural advancement--all science and technology--is centered upon humans answering questions that nobody knew the answer to, and even more importantly, it is premised upon humans asking questions nobody has asked before. LLMs cannot do any of this. Einstein reminds us that "imagination is more important than knowledge," and too, he stated, "Never lose a holy curiosity." AI has no curiosity. AI LLMS are utterly incapable of AGI, logic, reason, and creative logos. This is easily demonstrated. No AI/LLM could produce Euclid's Elements from the simple definitions give in Book I. AI LLMs are mere autocomplete, and thus have very little (perhaps nothing) to do with human intelligence. AI is incapable of creating anything new or beautiful beyond that which it ingested/copied/stole. One of the problems with LLM/"AI" development is that all too often the CEOs are recreating it in their own image--folks who never created art nor culture nor ingested nor trained their souls on the data found in Beethoven's, Van Gogh's, Shakespeare's, and Einstein's works. For instance, while Sam Altman's AI may well be able to duplicate/parrot the String Theory hoax, it, like Altman, will utterly fail to advance physics, or any other noble field, in any manner. GPT-5 will be far, far, far inferior to all the artists and physicists it copied, ingested, and stole from. Let us quickly prove this. Train AI on all music known to humanity before Beethoven. Then ask it to create Beethoven's symphonies, or just the third or ninth. Train AI on all literature known to humanity before Shakespeare, and then ask it to write Hamlet. Train AI on all science known to humanity before relativity, and then ask it to create special and general relativity. Train AI on all math and physics known before Newton, and then ask it to write Principia. Train AI on all philosophy known before Plato, and then ask it to write Socrates' Apology. Train AI on all music known before the Beatles, and then ask it to create the White Album. Train AI on all the art created before Michelangelo, and then ask it to create the ceiling of the Sistine Chapel. Train AI on all the techonology created before Steve Jobs, and then ask it to create Apple. Train AI on all the music that existed before the blues, and then ask it to describe love and heartbreak in music, or just to recognize it. Train AI on all art known before Van Gogh, and then ask it to paint starry nights, sunflowers, and self portraits such as those that are worth hundreds of millions of dollars today. Train AI on all epic poetry known to humanity before Homer, and then ask it to write Homer's Iliad and Odyssey. So it is that AI is most often championed by those who have little appreciation for or understanding of Beethoven, Shakespeare, Einstein, Newton, Plato, The Beatles, Van Gogh, Homer, Michelangelo, Steve Jobs, music, literature, science, math, philosophy, art, technology, love, and epic poetry. Perhaps they even detest these exalted entities on some level.
@sama
@tsarnick
@nntaleb
@jordanbpeterson
@GaryMarcus
@EricRWeinstein
@elonmusk
@tegmark
@lexfridman
@kortizart
@neilturkewitz
@TrevyLimited
@RahllProduce
@WKCosmo
@ylecun
@WKCosmo

8/8
@arxivdigests
https://invidious.poast.org/nLjg4lsOGps

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/11
@GaryMarcus

Superb new article from @apple AI: “we found no evidence of formal reasoning in language models . Their behavior is better explained by sophisticated pattern matching—so fragile, in fact, that changing names can alter results by ~10%!”

𝗧𝗵𝗲𝗿𝗲 𝗶𝘀 𝗷𝘂𝘀𝘁 𝗻𝗼 𝘄𝗮𝘆 𝗰𝗮𝗻 𝘆𝗼𝘂 𝗯𝘂𝗶𝗹𝗱 𝗿𝗲𝗹𝗶𝗮𝗯𝗹𝗲 𝗮𝗴𝗲𝗻𝘁𝘀 𝗼𝗻 𝘁𝗵𝗶𝘀 𝗳𝗼𝘂𝗻𝗱𝗮𝘁𝗶𝗼𝗻, where changing a word or two in irrelevant ways can give you a different answer.

Strongly encourage you to read the whole thread.

[Quoted tweet]
1/ Can Large Language Models (LLMs) truly reason? Or are they just sophisticated pattern matchers? In our latest preprint, we explore this key question through a large-scale study of both open-source like Llama, Phi, Gemma, and Mistral and leading closed models, including the recent OpenAI GPT-4o and o1-series.
arxiv.org/pdf/2410.05229

Work done with @i_mirzadeh, @KeivanAlizadeh2, Hooman Shahrokhi, Samy Bengio, @OncelTuzel.

#LLM #Reasoning #Mathematics #AGI #Research #Apple

2/11
@GaryMarcus
longer, more in depth discussion of why LLMs are cooked, here, relating this new study to many others: LLMs don’t do formal reasoning - and that is a HUGE problem

3/11
@ahatamiz1
Unfortunately, I disagree this work has found anything meaningful about "lack of reasoning" in LLMs.

In fact, most of the identified issues are due to poor associative-recall performance. And there are many ways to improve recall in LLMs (see the proposed delta rule by Schmidhuber for example)

4/11
@GaryMarcus
This is consistent with work of mine back to 1998; I believe the results are robust, and after 25 years want more than hand-waving to believe a solution is in reach.

5/11
@kevin_nejad
O1 results are promising though, although they seem to fail with NoOps task (~17% drops in performance).

6/11
@GaryMarcus
you can’t build a trustworthy agent with that,never mind the expense

7/11
@bob_mankoff
Seems a stretch to just call this "sophisticated pattern matching". In any case, I don't think its something Gary thought these models could do when he wrote "Rebooting AI" in 2019.

8/11
@GaryMarcus
there are undoubtedly small variant on that will fail, and that’s kind of the point of the paper/thread.

bigger set of training patterns (including synthetic data) but same core failure mode

9/11
@UriGil3
this is barely science. the bottom line is that he doesn't have a control group of humans to show that their performance doesn't degrade on those variations. so no conclusion on reasoning can be reliably made.

10/11
@GaryMarcus
imagine if we tested calculators with a control group of humans. what would that prove?

11/11
@DynamicGOP
Agreed. Fundamentally flawed technology. Ok for parler tricks and canned demo reels but impossible for production systems. An algo that just makes stuff up in a way that can’t be traced as an error cannot be used. Bad tech.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 16, 2024

1/11
@GaryMarcus
GenAI fans: Don’t worry, LLMs don’t really use that much power. They are getting more efficient every day, too.

Microsoft, Amazon, Google: sign us up for a whole lot of nuclear power, now.

2/11
@GaryMarcus
source: https://www.theinformation.com/articles/tech-revives-nuclear?utm_source=ti_app&rc=dcf9pt

3/11
@GaryMarcus
by @mvpeers @theinformation

4/11
@raulizahi
Drives me crazy that most AI R&D are not making a visible effort in using existing HW and energy infrastructure more efficiently.

We need a race to lower power consumption for the same performance.

Which VC is willing to fund this?

5/11
@EverydayAI_
Most GenAI fans (myself included): LLMs suck up WAY too much power.

AI can both be good for businesses creating value and also terrible for the environment.

Also, until quantum computing is fully realized, nuclear offers advantages for big tech to diversify their powering options:

- Nuclear power is carbon-free, helping companies meet their sustainability targets
- Nuclear provides consistent power output, unlike weather-dependent renewables
– Nuclear is obviously better long term than grid

6/11
@bbailey39
So, AI is going nuclear?

7/11
@NordmarkJens
Well if they can get nuclear to improve it's cost effectiveness that could be a big win.

8/11
@Bit111111
"Don't worry, they're small reactors"

9/11
@bytewatchman
I've got a better idea - how about we focus on making tech that doesn't require signing us up for a whole lot of nuclear power first?

10/11
@DonotInnovate
It’s absolutely insane in every possible way

11/11
@BronzePostagem
Hey Gary, have you Have you changed your mind about the scaling issue with GenAi? I personally believe that we are not yet close to the limit and diminishing returns.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 16, 2024

1/11
@GaryMarcus
Not the best cognitive psychology I have seen, because it confounds together abstract reasoning with domains where motivated reasoning is well-known to exist. Doesn’t show that humans can’t reason, but rather that they prefer not to.

Strange also that the very long literature on “motivated reasoning” is being reinvented here. This sounds like a nice study but social psychologists have known about these effects since at least the 1980s. (My 2008 book Kluge talks about it some. @jayvanbavel is a good expert on the recent related literature.)

[Quoted tweet]
Can Humans Reason?

Not really - most humans simply indulge in groupthink.

They found no evidence of independent reasoning in humans. Smarter humans become more sophisticated at self-deception and hallucinating arguments.

So, in some sense, AI can already do better than humans

2/11
@BethCarey12
The humble spreadsheet tool became ubiquitous *because* it doesn't make errors like humans.

When your goal is solving language understanding for machines, the path leads to the best of both worlds - emulating what the brain does with language while exploiting the benefits of computers and what they do best.

Language as learned when a human baby, has reasoning baked in. Capturing that, creates the magic without the 'hallucinations' and energy consumption.

Amazon.com

3/11
@bindureddy
Ok, I am clearly trying to be funny and failing by the look of this

4/11
@nathalieX70
As Elon has created the best hive mind with the Adrian community, it would tend to say Bindu is right. He failed to do so with Elon.

5/11
@HumblyAlex
Any of Gary's fanboys want to earn $50?

Booch couldn't earn himself the money, and Gary seems to avoid the high hanging fruits that exist much more than the imaginary one's he regularly swings at.

Easy money if Gary's right and his arguments hold up.

[Quoted tweet]
If you can find anything written by @Ylecun, @GaryMarcus, or @Grady_Booch that entirely and conclusively negates the following in a way that stands up to scrutiny, I will give you $50.

Something tells me it doesn't exist, and they avoid these truths at all costs.

6/11
@bate5a55
Interesting critique. Did you know that the "bias blind spot," identified in 2002, shows even trained logicians fail to recognize their own reasoning biases? It highlights how self-awareness doesn't always mitigate motivated reasoning.

7/11
@BankingNeko
I agree with GaryMarcus, the study's findings aren't new and social psychologists have known this for decades. The critique of the methodology is spot on, we need more nuanced research on human reasoning.

8/11
@Gazorble
John Locke "Few men think, yet all will have opinions. Hence men’s opinions are superficial and confused."

9/11
@TrustInAutonomy
Ffs

10/11
@gauaren
Such a funny way for the retweeted person to show they don't understand the difference between can and do. Maybe an LLM can (haha) explain it to them

11/11
@FaustinoBerlin

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/11
@bindureddy
Can Humans Reason?

Not really - most humans simply indulge in groupthink.

They found no evidence of independent reasoning in humans. Smarter humans become more sophisticated at self-deception and hallucinating arguments.

So, in some sense, AI can already do better than humans

2/11
@Hello_World
Humans can reason LLMs cant. Humans don't always reason but LLMs never do.

3/11
@01Singularity01
What I've been saying. Every time I see "AI can't" or "AI doesn't" I immediately think "the same can be said for most humans". Most humans do not generalize well, for example, and there is a wide range of generalization capabilities. Overwhelmingly, people want their bias confirmed. New information is disruptive to the established training, and to be avoided at all costs.

4/11
@gpt_biz
Interesting take, but I think humans' ability to feel empathy and adapt makes our reasoning unique in ways AI still can't fully grasp yet

5/11
@M1ndPrison
No surprise here. Emotion is the kryptonite of reasoning. That isn't to say humans can not reason.

This is akin to giving AI a terrible prompt and saying it can't solve the problem.

These are expected outcomes, but do not demonstrate a useful comparison.

6/11
@climatebabes
I am an exception..

7/11
@AI_Ethicist_NYC

'Experts' complain that AI "can't really" do this or that and that it just uses what it learns in its training data. It just predicts the most likely next token given the context. It hallucinates and comes up with wrong answers. Blah blah blah.

Ironically, it's all these flaws that make it more human-like, not less. Learning from data sets is exactly what people do (on their best day). People are not good at reasoning or remembering or putting two and two together. Some people are. Most people aren't.

Either way, who cares if AI is REALLY reasoning the way we perceive reasoning as long as it comes to a correctly reasoned result.

Do you really think artificial superintelligence is going to give a sh*t whether or not we've deemed its reasoning as 'genuine'? Do lions care what the sheep think?

I think we need to focus less on the idea that AI has to 100% process the world like the human brain does and be more concerned about developing AI so that it can help our brains think in a way they have not before.

8/11
@mr_trademaker
Can Humans Run Marathons?

Not really – most humans simply indulge in group inactivity, sticking to light exercise or none at all.

They found no evidence of marathon-level endurance in most humans. Fitter humans become more skilled at self-deception, convincing themselves they're training sufficiently when they aren’t.

So in some sense, AI can already do better then humans ;)

9/11
@PhDcornerHub
Epigenetic Building of the Evolutionary Mind

Autopoiesis and social intelligence reproduce the emergent properties of thought generation and application. This process involves adaptive thinking and abductive reasoning, both of which are more complex and nuanced. Autopoiesis embodies the essence of self-reproduction, a hallmark of living entities.

Unlike LLMs, the human brain builds a large foundational model as a filter that processes only relevant information from vast data. Nature has its own metrics for intelligence and survival. The general principle of emergence and the adaptation of the evolutionary survival program reflect cognitive learning mechanisms that are autopoietic self-catalysts. Over evolutionary lineage, these processes are automated in genetic systems as a form of learning.

However, these systems do not embody true intelligence. Instead, they mimic individual components, which in turn create social intelligence. Intelligence, distinct from knowledge, is not merely an accumulation of information; it involves the capacity to think and reason. Knowledge is the information and understanding gathered over time, while intelligence refers to the ability to engage in reasoning and reflection.

Machines, by processing vast amounts of data and text, can mimic intelligence. Yet, while machines may replicate wisdom based on textual input, they inherently lack the capacity for human-like thought and reasoning. Learning and developing intelligence are interconnected but distinct processes—one can learn without necessarily developing true intelligence. Language models may simulate aspects of cognitive processes through pattern recognition, but true intelligence encompasses the innate capacity to self-reflect and adapt beyond mere programmed responses.

No human is replaceable by a knowledge machine. Humans are thinking beings, while AI serves as a knowledge machine—it is not transformative, nor is its intelligence transferable. In the end, these technologies remain clouds of knowledge, with humans doing the real thinking.

While intelligence can be introduced into systems, the ability to think is self-learned, self-adaptive, and scaled through the human brain’s evolutionary and survival-driven intelligence. It is embedded in a cognitive social matrix connected to vast libraries of data that we have crafted into language. Most intelligence is shaped by adapting to these metrics. At the core, the brain is programmed to survive, reproduce, and exist, and all intelligence is an emergent property of thought generation and application.

Intelligence is an innate quality, not a genetic predisposition. Everyone who survives demonstrates intelligence at a basic level because our very nature is oriented towards survival, evolution, and reproduction. We must not confuse intelligence shaped by the ecosystem with inherent, born intelligence. The ability to think and intelligence is generic, yet it can also be cultivated. Knowledge and wisdom, meanwhile, form two distinct aspects of this intellectual journey.

AI Technology as the Lighthouse of Human Evolution

AI technology acts as a lighthouse and the torchbearer of human intelligence and knowledge, facilitating the genetic evolution of humankind towards higher levels of achievement and understanding.

10/11
@web3nam3
Humans can reason but don’t want to.

11/11
@adolfoasorlin
Was that a reasoning?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 16, 2024

1/11
@AlphaSignalAI
HUGE news for developers. Supabase just launched the ChatGPT of databases.

An AI-based Postgres service.

You can build and launch databases, create charts, embeddings, see visuals of your DB, generate sample data, and more.

And.. it's 100% open source.

https://video.twimg.com/ext_tw_video/1823020502983606272/pu/vid/avc1/1280x720/n4nCHZIAkaWdXnax.mp4

2/11
@AlphaSignalAI
Try it here: Postgres Sandbox

3/11
@AlphaSignalAI
Liked this post?

Check out AlphaSignal | The Most Read Technical Newsletter in AI to get a daily summary of breakthrough models, repos and papers in AI.

Read by 200,000+ devs.

4/11
@lifeafterAi_
Can we run it offline ? Without connection to openai cloud ?

5/11
@AlphaSignalAI
I don't think it's using any OpenAI connection

6/11
@PaiNishant
W Supabase, ty

7/11
@TheBengaluruGuy
Open Source Link: GitHub - supabase-community/postgres-new: In-browser Postgres sandbox with AI assistance

8/11
@theinsiderzclub
Supabase just keeps pushing the envelope. This is going to be a game-changer for devs

9/11
@juanstoppa
it works pretty well

[Quoted tweet]
it's very impressive what you can do with this, you can easily accelerate the DB design 10x in my opinion

10/11
@lazukars
That sounds amazing! How do you think this new AI-based Postgres service will impact the way developers work with databases in the future?

11/11
@alberduris
ahead of our time @javitoshi

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 16, 2024

1/8
@MelMitchell1
The subtle yet still insane risks of using AI to "enhance" photos.

[Quoted tweet]
I'm talking at a conference later this year (on UX+AI).

I just saw an ad for the conference with my photo and was like, wait, that doesn't look right.

Is my bra showing in my profile pic and I've never noticed...? That's weird.

I open my original photo.
No bra showing.

I put the two photos side by side and I'm like WTF...

Someone edited my photo to unbutton my blouse and reveal a made-up hint of a bra or something else underneath.

Immediately, I email the conference host.
(FYI he is a great, respectable guy with 5 kids at home.)

He is super apologetic and immediately looks into the issue.

He quickly reports back that the woman running their social media used a cropped square image from their website.

She needed it to be more vertical, so she used an AI expand image tool to make the photo taller.

AI invented the bottom part of the image (in which it believed that women's shirts should be unbuttoned further, with some tension around the buttons, and revealing a little hint of something underneath).

—

FYI the conference organizers were super apologetic and took down all of the content with that photo.

2/8
@max_spero_
Unfortunately this is just one of many ways AI systems are tailored towards the preferences of men, often not even intentionally

3/8
@FungibleUnicorn
This is a stupid complaint and not at all worrisome

4/8
@PessoaBrain

5/8
@postlelab
/search?q=#fukkedup

6/8
@icedwater
And this is just one example that got spotted... when the deluge comes, if it hasn't already, how will we have time to check even 1% of everything?

7/8
@roxannedarling
The risk increases if you are female.

8/8
@ChristineKooi
Good night.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 16, 2024

1/10
@AlphaSignalAI
It went under the radar but OpenAI released their most performant GPT-4o assistant model yesterday.

- Structured outputs with 100% reliability
- 4x max token (16,384) output limit
- 50% cheaper on inputs
- 33% cheaper on outputs
- #1 on the @allen_ai 's ZeroEval leaderboard

Assistant model means the system message in OpenAI API doc: is "You are a helpful assistant."

2/10
@jackadoresai
I tried it out with the new GPT-4o assistant model and I'm blown away by the results. 100% reliability and 4x the max token output limit? This is a game-changer for AI developers like me.

3/10
@TheAlphaSignal
https://openai.com/index/introducing-structured-outputs-in-the-api/

4/10
@CjRandhava
Exciting to see OpenAI's most advanced GPT-4o model roll out!

5/10
@omarmohdomain
@AlphaSignalAI Impressive capabilities. Excited to see real-world use cases emerge.

6/10
@nylirehs17
https://openai.com/index/introducing-structured-outputs-in-the-api/

7/10
@Ahmadghafari5
https://openai.com/index/introducing-structured-outputs-in-the-api/

8/10
@BayuNax144
I tried it out with the new GPT-4o assistant model and I'm blown away by the results. 100% reliability and 4x the max token output limit? This is a game-changer for AI developers like me.

9/10
@Junior002014
I tried it out with the new GPT-4o assistant model and I'm blown away by the results. 100% reliability and 4x the max token output limit? This is a game-changer for AI developers like me.

10/10
@SamiHaddioui
https://openai.com/index/introducing-structured-outputs-in-the-api/

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 16, 2024

nvidia / llama-3.1-nemotron-70b-instruct
PREVIEW
Llama-3.1-Nemotron-70B-Instruct is a large language model customized by NVIDIA in order to improve the helpfulness of LLM generated responses.

llama-3_1-nemotron-70b-instruct | NVIDIA NIM

Experience the leading models to build enterprise generative AI apps now.

build.nvidia.com

nvidia/Llama-3.1-Nemotron-70B-Instruct-HF · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

1/11
@_xjdr
Nemotron-70B entropix edition is pretty fukking good

2/11
@artuskg
How do you get Python 3.11 on TPU VMs? I lost two hours trying to get the frog running on a TPU v4-8 with Python 3.11, which I installed via pyenv. It burned through all my Sonnet 3.5 quota

and can only see 4 of the 8 chips plus other problems. I copied the weights to a bucket and deleted the VM for now. Do you have a discord etc to join?

3/11
@_xjdr
when you launch the vm make sure you are using the ubuntu-22.04 vm base image and then you can use the deadsnakes ppa to get the specific version you want

4/11
@LechMazur
70B needs to be tested with something harder.
Try this:
"List 10 famous people who lived for 89 years"

5/11
@_xjdr

6/11
@menhguin
Wait

If nemotron beats 4o

Did we just mog 4o

7/11
@Yuchenj_UW
wow, it’s good

8/11
@charle_marle
Entropix Mixer too good

9/11
@yupiop12
dang its sharp

10/11
@InverseMarcus
awesome work man

11/11
@m_chirculescu
Very nice - full reasoning that can be followed. Impressive.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 16, 2024

1/11
@jaykrishAGI
Meta’s New AI Method: Teaching AI to "Think" Before Responding to Instructions

Meta researchers have come up with a new technique called Thought Preference Optimization (TPO) that’s designed to improve how AI responds to general instructions.

So what is TPO ?

2/11
@jaykrishAGI
What is Thought Preference Optimization (TPO)?

TPO is a process where AI models generate internal thoughts before delivering their final answer.

These thoughts are private and not shown to users;
they’re like the AI's brainstorming session before giving its final response.

3/11
@jaykrishAGI
Instead of simply answering right away, the AI uses trial and error behind the scenes to figure out the best response.

Think of it as an AI doing some quiet thinking before it speaks.

Unlike standard AI models, which just jump to an answer, TPO models take a bit more time.

4/11
@jaykrishAGI
they’re better at handling tasks like creative writing or marketing.

Interestingly, though, this technique isn’t as effective for math-related tasks.

5/11
@jaykrishAGI
Why Does This Matter?

This new method is an extension of previous work, like OpenAI’s 'Strawberry' model

focused on giving AI models more time to reason through things.

Meta’s approach, however, is trying to push AI beyond just reasoning or problem-solving tasks.

6/11
@jaykrishAGI
With TPO,

AI can become more adaptable and capable of handling a broader range of tasks, including creative ones.

7/11
@jaykrishAGI
Although Meta’s Yann LeCun, who leads their AI research, might not be fully convinced by this idea, the success of TPO in specific areas shows that

AI can learn to "think" for a variety of tasks, not just mathematical or logical ones.

8/11
@jaykrishAGI
This could make AI more useful as a flexible assistant in different fields, from content creation to customer service.

9/11
@jaykrishAGI
TPO opens up new possibilities for AI by allowing it to process its thoughts privately before responding.

This method could potentially make AI smarter and more thoughtful, especially in non-technical areas like writing or marketing.

10/11
@jaykrishAGI
By teaching AI to think independently,

Meta might be onto something that could broaden the role AI plays in everyday tasks.

11/11
@jaykrishAGI
Thanks for reading

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/4
@jaseweston

New work: Thinking LLMs!

- Introduces Thought Preference Optimization (TPO)
- Trains LLMs to think & respond for *all* instruction following tasks, not just math
-Gives gains on AlpacaEval (beats GPT-4 & Llama3-70b) & ArenaHard with an 8B model
[2410.10630] Thinking LLMs: General Instruction Following with Thought Generation

1/4

2/4
@jaseweston
Recipe

- no need for human data!
- Start with instruct model.
- Prompt to think & respond. Initially works poorly.
Iterate:
- Sample k thought+responses
- Use standard judge on responses *alone*
- Construct full preference pairs & run DPO
Takes 3-4 iterations to work.

2/4

3/4
@jaseweston
Initial thought + responses give poor win rates because the instruct model (Llama-3-8B) hasn't been optimized for this goal.

After 3-4 TPO iterations, our method outperforms the baseline (LLM without thinking):

3rd on AlpacaEval leaderboard

Best 8B model on ArenaHard

3/4

4/4
@jaseweston
A performance breakdown shows our model wins on many categories not considered "reasoning" (see pics for examples).

This is encouraging, hopefully Thinking LLMs can be widely adopted in non-reasoning domains.

Thanks for reading & hope this gives lots to think about!

4/4

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 16, 2024

1/4
@youraimarketer
Nvidia/Llama-3.1-Nemotron-70B-Instruct-HF is INSANE!

I’ve tested it on one of my most challenging tasks: Structured Information Extraction from a large corpus of unstructured text across diverse formats.

It significantly outperforms GPT-4o on;
- identifying key insights,
- generating accurate knowledge graphs,
- establishing meaningful connections between disparate information groups.

The model's attention to detail is exceptional, capturing subtle nuances and dependencies that are often missed by other models.

This is incredibly exciting!!!

2/4
@victorpaycro
Where can I access that one?

3/4
@youraimarketer
llama-3_1-nemotron-70b-instruct | NVIDIA NIM

4/4
@techfusionjb
I'm blown away by the level of detail it can capture! Exceptional attention to detail is a game-changer for Structured Information Extraction. This is indeed incredibly exciting!!!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 16, 2024

1/11
@sarahookr
We have been working hard on multilingual safety – AI that serves everyone across the world.

This is an incredibly difficult setting – very proud of the work we have done and the resources we have invested here.

Sharing some of our work below.

2/11
@sarahookr
Safety isn't one-size-fits-all. It varies by culture, location and language, yet traditional alignment work often treats it as static.

We did very work on alignment that captures both local

and global

preferences.

[2406.18682] The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm

3/11
@sarahookr
We also released the Aya Red-teaming dataset.

A first-of-its-kind human-annotated multilingual red-teaming dataset consisting of prompts in 8 languages, across 9 different categories of harm.

Annotates for both local

and global

harms.

CohereForAI/aya_redteaming · Datasets at Hugging Face

4/11
@sarahookr
Despite the excitement around merging, little is known.

In our latest work, we exhaustively validate merging techniques + share key recipes to benefit dual-objective goals.

We find large gains, and that merging

outperforms mixing

data.

[2410.10801] Mix Data or Merge Models? Optimizing for Diverse Multi-Task Learning

5/11
@sarahookr
For our state-of-art Aya 101 model release – we conducted most extensive evaluation of massively multilingual safety to-date.

We introduced safety context distillation as an effective + efficient harm mitigation technique across 101 languages.

https://arxiv.org/pdf/2402.07827

6/11
@sarahookr
Work to date on toxicity mitigation has exclusively focused on the English language. Yet harm exists in all languages.

In this recent work

, we expand toxicity mitigation across multiple languages.

[2403.03893] From One to Many: Expanding the Scope of Toxicity Mitigation in Language Models

7/11
@sarahookr
Toxicity definitions evolve and change over time.

Why don't mitigation techniques account for this?

I really like our work on Goodtriever which advances a framework for continual toxicity mitigation.

[2310.07589] Goodtriever: Adaptive Toxicity Mitigation with Retrieval-augmented Models

8/11
@sarahookr
From my perspective -- multilingual is the most realistic setting to advance safety in. It captures the real world -- non-homogenous preferences, presenting data drift, requires multi-objective optimization.

Proud of the work we have done over last few years.

9/11
@sarahookr
If you made it this far -- you can read more here about all of our safety research here: Cohere For AI (C4AI)

We also do plenty of work on other topics including efficiency at scale, generalization, pre-training dynamics, preference training, synthetic data. Join us.

10/11
@digitalhealthxx
Fantastic work on advancing multilingual safety!

Building AI that serves diverse languages and communities is a challenging but crucial endeavor. Kudos to the team for the dedication and investment in this important area. Looking forward to seeing the impact of these efforts!

11/11
@bate5a55
Fantastic work! Few realize that languages like Pirahã lack recursion and numerals, posing unique challenges for AI models—addressing such linguistic diversity is key to true multilingual safety.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 16, 2024

1/2
Emergent Mind is one of my favorite AI tools.

I’m currently evaluating different 'example' counts within my prompts to achieve optimal results.

However, deciding on the number of "shots" is a nuanced process.

I asked the question: "How many examples are optimal for LLM prompts?"

The answer provided by the tool is a great starting point for diving into relevant research papers.

Balancing the number of examples is crucial, as using too many can lead to token limitations, increased latency, and potential overfitting to specific prompt structures.

On the other hand, too few examples may result in the model underfitting the task, especially in more complex or nuanced scenarios.

I plan to use a prompt evaluation platform to benchmark these prompts on key metrics like accuracy, token efficiency, latency, and cost. (if you have any platform suggestions here, please drop below)

This will allow us to fine-tune our approach and assess how well the LLM generalizes across various tasks.

At this stage, using 3-5 examples with Chain of Thought (CoT) reasoning seems effective.

2/2
You can sign up here: Emergent Mind: AI Research Assistant

Also the founder @mhmazur is very responsive and making the platform better day by day.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 16, 2024

1/4
@omarsar0
Model Swarms

Researchers from Google and UoW propose a new collaborative search algorithm to adapt LLM via swarm intelligence.

A pool of LLM experts collaboratively move in the weight space and optimize a utility function representing various adaptation objectives.

Top quote: "Extensive experiments demonstrate that MODEL SWARMS could flexibly adapt LLM experts to a single task, multi-task domains, reward models, as well as diverse human interests, improving over 12 model composition baselines by up to 21.0% across tasks and contexts."

One interesting observation in the paper is that the collaborative search process helps to discover new skills and enables the weak-to-strong transition of experts.

2/4
@omarsar0
Paper: [2410.11163] Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence

3/4
@BensenHsu
The paper proposes a new algorithm called "Model Swarms" that adapts large language models (LLMs) to different tasks and objectives through a collaborative search process. The authors compare Model Swarms to various existing model composition and adaptation methods.

Model Swarms outperforms 12 baseline methods across four types of adaptation objectives: single task, multi-task domain, reward modeling, and human interests. It achieves up to 21% improvement on reasoning tasks and is able to discover new skills and capabilities in the LLM experts through the collaborative search process. Surprisingly, the experts that end up as the best often did not start as the best, suggesting that Model Swarms can unlock the "diamond in the rough" potential of weaker models.

full paper: MODEL SWARMS: COLLABORATIVE SEARCH TO ADAPT LLM EXPERTS VIA SWARM INTELLIGENCE

4/4
@MasoudMaani
21%

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 16, 2024

1/1
@omarsar0
Model Kinship for Merging LLMs

Proposes model kinship to measure the degree of similarity between LLMs.

Model kinship is used to build a model merging strategy (Top-k Greedy Merging with Model Kinship) which yield better performance.

The authors find that this new criterion can be used to effectively and continuously perform model merging.

Paper: [2410.12613] Exploring Model Kinship for Merging Large Language Models

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

The A.I Megathread (LLM , GPT , Development)

More options

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

llama-3_1-nemotron-70b-instruct | NVIDIA NIM

nvidia/Llama-3.1-Nemotron-70B-Instruct-HF · Hugging Face

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran