bnew

Veteran
Joined
Nov 1, 2015
Messages
56,112
Reputation
8,239
Daps
157,805

Large Models of What? Mistaking Engineering Achievements for Human Linguistic Agency​


11 Jul 2024 · Abeba Birhane, Marek McGann ·

In this paper we argue that key, often sensational and misleading, claims regarding linguistic capabilities of Large Language Models (LLMs) are based on at least two unfounded assumptions; the assumption of language completeness and the assumption of data completeness. Language completeness assumes that a distinct and complete thing such as `a natural language' exists, the essential characteristics of which can be effectively and comprehensively modelled by an LLM. The assumption of data completeness relies on the belief that a language can be quantified and wholly captured by data. Work within the enactive approach to cognitive science makes clear that, rather than a distinct and complete thing, language is a means or way of acting. Languaging is not the kind of thing that can admit of a complete or comprehensive modelling. From an enactive perspective we identify three key characteristics of enacted language; embodiment, participation, and precariousness, that are absent in LLMs, and likely incompatible in principle with current architectures. We argue that these absences imply that LLMs are not now and cannot in their present form be linguistic agents the way humans are. We illustrate the point in particular through the phenomenon of `algospeak', a recently described pattern of high stakes human language activity in heavily controlled online environments. On the basis of these points, we conclude that sensational and misleading claims about LLM agency and capabilities emerge from a deep misconception of both what human language is and what LLMs are.


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,112
Reputation
8,239
Daps
157,805
1/1

@Sung Kim
DeepSeek-R1-Lite-Preview

🔍 o1-preview-level performance on AIME & MATH benchmarks.
💡 Transparent thought process in real-time.
🛠️ Open-source models & API coming soon!

Try it at chat.deepseek.com

https://chat.deepseek.com"

bafkreiafhjcdlgoczk5rgbppzptzfltik7j4irzeki2w64outicxvmfr4e@jpeg



To post in this format, more info here: https://www.example.com/format-info


bafkreidfbuwiytz2culh4iy7org4oizhmrozr4bwm3i6bimx2g4vsqiz5i@jpeg










 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,112
Reputation
8,239
Daps
157,805

1/1
@SantiagoPombo
GPT-4o vs DeepSeek reasoning test:

Q: "How many repeated letter pairs are in 'bookkeeper,' and what's the most repeated letter?"

Speed: GPT-4o (2s) > DeepSeek (36s)
Style: DeepSeek more verbose/human-like CoT
Accuracy: GPT-4o: 5/5, DeepSeek: 2/5 => 4o is more deterministic.



Gc3aTSvbQAAojuB.png

Gc3aTSuaAAAvFrE.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196








1/41
@DeryaTR_
I just tested @deepseek_ai’s deep think model with highly sophisticated, complex research problems & my mind is absolutely blown by its thinking & reasoning processes!

To me, this is advanced PhD-level, & in some cases, its reasoning is far superior to o1-preview! I’m in awe 🤯!

[Quoted tweet]
🚀 DeepSeek-R1-Lite-Preview is now live: unleashing supercharged reasoning power!

🔍 o1-preview-level performance on AIME & MATH benchmarks.
💡 Transparent thought process in real-time.
🛠️ Open-source models & API coming soon!

🌐 Try it now at chat.deepseek.com
#DeepSeek


Gc0zgl8bkAAMTtC.jpg


2/41
@DeryaTR_
By the way, I conduct these tests by asking the exact same questions I previously asked o1-preview and GPT-4o, and I compare their responses directly. They are extremely specific biological problems that require advanced PhD level thinking to solve or propose new ideas.



3/41
@jalam1001
Deepseek coder is good too. Very inexpensive api access too



4/41
@DeryaTR_
This I didn’t test but good to know. Yes it’s remarkably cheaper. OpenAI needs to release the full o1 very soon to compete :smile:



5/41
@thieme
I have my own tests. One is a very simple reasoning task. A human would have to reason several minutes to solve this. I would tend to say it is impossible to solve this intuitively.

However, neither o1-preview nor this DeepSeek Think mode can solve it.

Actually exclusively Sonnet 3.5 is able to solve this, and it is doing it without any reasoning, which sounds highly improbable, yet it is possible.



6/41
@DeryaTR_
Would you share the problem ?



7/41
@mdisseq
To be honest, I'm NOT impressed...
https://xcancel.com/mdisseq/status/1859256330764124411

[Quoted tweet]
o1-preview is MONUMENTALLY better!


Gc1nvkdXMAAeY1K.jpg

Gc1nxpUXUAEdNwi.jpg


8/41
@DeryaTR_
You should read the reasoning/thinking section—it’s much better & more revealing than the final outputs. However, it also depends on the prompts. For example, o1 was much better at writing a 20-chapter original Sci-Fi novel, whereas DeepSeek tended to repeat the same ideas in every chapter.



9/41
@slow_developer
no doubt, it exceed my expectations tbh



10/41
@Mrspookytruth
The Temporal Inertia Principle is a promising avenue for exploring new physics, but its feasibility hinges on careful parameter selection and error control. With strategic collaboration and further theoretical refinement, it could lead to groundbreaking discoveries in fundamental physics.
My work is done here make sure you put me in for my Nobel Prize once all the details are worked out. Thank you.



11/41
@Mrspookytruth
I think part I heard it say Uncle.



Gc3Ld2lWMAAvknX.png


12/41
@Emily_Escapor
It's good to know they are authentic, not just for hype. We need more competition; the 01 preview is not that great; let's hope 01 comes on the 30th of November.



13/41
@Z7xxxZ7
This is amazing. And their API price is as cheap as impossible, also will be open sourced.



14/41
@technextgen6
OMG amazing work by @deepseek_ai



15/41
@Mrspookytruth
Cute, it drew a picture of a smiling face...I think it's a cat.



Gc3TMhaXQAEVc6p.png


16/41
@Mrspookytruth
Wow that's saying something! thanks.



17/41
@Mrspookytruth
BTW I think all of China went dark when it was thinking about this sorry it took so long.



18/41
@basin_leon
Oh goodness. Thanks for sharing.



19/41
@bioprotocol
That's wild! 🤯

Exciting to see how much autonomous & intelligent systems will accelerate scientific problem-solving.

This is incredibly promising!



20/41
@MesutGenAI
Yeah, its really good and this will be released as open source 🔥🔥



21/41
@oMarcosdeCastro
Impressive indeed! It's thrilling to see AI not just answering questions but engaging in what looks like genuine scientific inquiry. DeepSeek's ability to tackle complex research problems at an advanced level is a testament to the evolving sophistication of AI reasoning. It's not just about providing answers anymore; it's about how these models formulate and reason through problems, which is where the real magic happens.



22/41
@MRmor44
The most amazing thing to me is that they're giving it free, even open-sourcing . I wonder how they're managing the computing power, when company like OpenAI can't even give o1 to free users...



23/41
@PradeepJag123
I am just waiting for this from you. So, it is confirmed 👍🚀



24/41
@dave_alive
I'm curious if the 'reasoning' you see is more transparent than o1's or if they're only showing some of what's going on in there too, just like o1



25/41
@daniel_mac8
It’s a beautiful thing - I’m sure there will be more to come from the other labs



26/41
@d_bachman21
Wow! That's a strong endorsement.



27/41
@mysticaltech
Wow, amazing to read that



28/41
@AIMachineDream
Accelerate!



29/41
@HomelanderBrown
Is this PhD lol 😆



Gc2xxIMXoAAKhNj.jpg


30/41
@JoakimThomsen
Imagine this on groq newest inference optimized hardware. Groq First Generation 14nm Chip Just Got a 6x Speed Boost: Introducing Llama 3.1 70B Speculative Decoding on GroqCloud™ - Groq is Fast AI Inference



31/41
@AdeelShyk
we are waiting for o1 full, openai need to be ahead of game



32/41
@Kendrit7
Glad to see you are impartial.



33/41
@esindemir90
Hocam biz yetişemiyoruz acayip hızlı oluyor herşey, tam şöyle gerçekten 🤯



34/41
@AlonLalezari8
advanced PhD that doesnt even know which ligaments are torn in medial patellar dislocation? medical student can answer it better than him... PhD...



35/41
@Ivar34S
hocam türkçesi pek iyi değil ama bazen çok saçma çıktılar veriyor



36/41
@YUKAILiu4
Happy to hear that Professor. Did you try physics problem with o1 preview? For example acoustic or electromagnetic problems?



37/41
@huseylngumus
Wow reasoning is incredible!



Gc27e8taAAM8R3v.jpg


38/41
@DeryaTR_
I much more enjoyed reading the reasoning than its final output, it’s so satisfying!



39/41
@BobTB12
Eh, a simple strawberry and a cup throws it completely off.



Gc2nQc3WwAAa-HN.png


40/41
@AceBryanLiu
do you find it better than 3.5 sonnet for coding?



41/41
@whostolemynames
Post them then...




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196





1/7
@rowancheung
This is impressive.

The latest Deepseek AI model from China just matched OpenAI’s o1-preview.

It even beat OpenAI on several math and coding benchmarks using similar chain-of-thought reasoning.

Barely 2 months after the o1 release.

The details:

> R1-Lite is available via web chat to try for free on the Deepseek website (quoted)
> They plan to release the full R1 version completely open-source in the near future
> Deepseek R1-Lite allows the user to see the complete chain-of-thought, unlike OpenAI, which chose to show the user only a summary (which may or may not reflect the actual thinking process of the model)

The Deepseek team is still relatively unknown, especially in the West, compared to major AI labs like OpenAI/Anthropic/xAI/DeepMind.

But here's what we do know about them:

—They entered AI from quantitative trading, having run the $8 billion High Flyer Capital Management fund
—Industry rumors are that they were told by the Chinese government to “contribute to society”
—They bring to the table a bench of Math Olympiad gold medallist-level coders
—They have sufficient capital and are estimated to have over 50k H100 chips

Releasing the full R1 version as open source would be HUGE inference time models, allowing other firms to customize and improve the chain of thought over time.

Having the ability to critique the thinking process should lead to better thinking over time.

A new inference time scaling race has begun (h/t to OpenAI), and the biggest beneficiaries will be the users.

[Quoted tweet]
🚀 DeepSeek-R1-Lite-Preview is now live: unleashing supercharged reasoning power!

🔍 o1-preview-level performance on AIME & MATH benchmarks.
💡 Transparent thought process in real-time.
🛠️ Open-source models & API coming soon!

🌐 Try it now at chat.deepseek.com
#DeepSeek


Gc2vxYzXEAAyXcJ.jpg

Gc0zgl8bkAAMTtC.jpg


2/7
@unclecode
They’re planning to release it as open source 🤯! Honestly, OpenAI has no real moat, every closed model they’ve built gets replicated by open-source alternatives within months. It’s nearly impossible to establish a tech monopoly in such an open ecosystem. In the end, we’re the ones who benefit the most from this!



3/7
@JeremyNguyenPhD
I'm psyched there's a o1 style model making reasoning tokens available.

I just tested it though, on a few hard reasoning questions I submitted to Humanity's Last Exam.

Deepseek will get better no doubt. But it's not o1-mini level on my questions yet.



4/7
@alienmilian
The AI race is wide open.



5/7
@oMarcosdeCastro
This is indeed a game-changer. The transparency in Deepseek's model reasoning process could usher in a new era of AI development where we're not just users but collaborators in the thought process. It's like we're getting a front-row seat to watch the AI think, which could significantly enhance trust and allow for more nuanced human-AI interaction.



6/7
@ns123abc


[Quoted tweet]
DeepSeek cooked o1-preview and its free to use 50 times a day btw


Gc1ghaTX0AAXHw-.jpg


7/7
@Kyruer
Ofc China state is behind it




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,112
Reputation
8,239
Daps
157,805



1/10
@teortaxesTex
was biting nails on the edge of my seat here, fr. 10/10 would prompt again. Defeat, turnaround, final fight – and glorious victory.
DeepSeek-r1-lite revolutionizes LLM inference by turning it into a dramatic show with open reasoning chains. No SORA needed. Take notes OpenAI.



Gc1p-cqWUAAwcau.jpg

Gc1p-c3XoAAiFKl.jpg

Gc1p-o3WsAAWaw2.jpg

Gc1p-c3XgAA8Gkl.jpg


2/10
@teortaxesTex
And if I recall correctly, either @huybery or @JustinLin610 have said that Qwen team is also working on a reasoner (maybe it was even called r1 too).

So hopefully we'll see competition by the time the Whale delivers their completed masterwork.



3/10
@teortaxesTex
r1-lite's chains are startlingly similar to @_xjdr's ideas on training pivot tokens for Entropix, by the way.
I was skeptical at the time because, well, maybe the look of OpenAI's chains is accidental. Did DeepSeek arrive at the same idea as Shrek?

[Quoted tweet]
This one log is more valuable than 10 papers on MCTS for mathematical reasoning, and completely different from my speculation

meditate on it as long as it takes


Gc1uXogXQAARkcI.jpg

GX2hPqpXYAAVlc9.png

GX2hTUhXIAgHGrD.png

GX2hWZ9XcAAIWvB.png

GX2hhELXIAI4_Kw.jpg


4/10
@nanulled
wait, actually, wait



Gc1rse4W4AAJyg1.png


5/10
@Grad62304977
I'm so betting on that it looks like this but obv not the same

[Quoted tweet]
Cont'd:
- LaTRO has good performance: we improve zero-shot accuracy by an average of 12.5% over 3 different base models: Phi-3.5-mini, Mistral-7B, and Llama-3.1-8B.
- LaTRO is reward model-free: Surprisingly but reasonable, the log probabilities of producing the correct answer after the reasoning trajectory serves as a natural reward function, which we call "Self-rewarding"
- LaTRO shifts the inference-time scaling back to training time, by self-generating multiple reasoning trajectories during each training update.
- Free benefit: one can compress the length of reasoning trajectories via LaTRO - on GSM8K, a model with 200 reasoning tokens achieves 78% performance of a model with 500 reasoning tokens.


6/10
@gzlin
Self reflection and correction built in by default.



7/10
@torchcompiled
I kinda relate to it



8/10
@gfodor
omg



9/10
@Z7xxxZ7
Impressed they didn't hide the thinking process and it looks like really more human, rather than o1's thinking process is like a organized ppt.



10/10
@AlexPolygonal
The guy is very cool. Refreshingly honest compared to nuanced RLHF-overfit yes-spammers.
> I'm frustrated.
> i'm really stuck here.
> not helpful.
> I'll have to conclude that finding such an example is beyond my current understanding.
he's literally me for real




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196









1/11
@reach_vb
OH WOW! The Whale aka @deepseek_ai is BACK!! New model, with complete reasoning outputs and a gracious FREE TIER too! 🔥

Here's a quick snippet of it searching the web for the right documentation, creating the JS files plus the necessary HTML all whilst handling Auth too ⚡

I really hope they Open release the model checkpoints too!



https://video.twimg.com/ext_tw_video/1859198084497973248/pu/vid/avc1/1152x720/BJbtfJOz_9Nawfyo.mp4

2/11
@reach_vb
DeepSeek really said - I think therefore I am.



3/11
@reach_vb
available on DeepSeek



4/11
@TslShahir
It shows the thought process also. Very interesting



5/11
@reach_vb
💯



6/11
@AI_Homelab
Dod they already write smth. if it will be open weights and what size in B Params it has?



7/11
@reach_vb
Not sure, the only information I saw was from @TheXeophon here:

[Quoted tweet]
👀


Gc0wXOSXMAAe6Bd.jpg


8/11
@DaniloV50630577
The limit of 50 messages is per day?



9/11
@reach_vb
Yes! But I'm on free-tier so I;m sure you get more quota if you're paid/ API



10/11
@Em
The confetti 🎊



11/11
@ScienceGeekAI
@ChomatekLukasz




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196






1/11
@deedydas
Time to take open-source models seriously.

DeepSeek has just changed the game with it's new model — R1-lite.

By scaling test-time compute like o1 but "thinking" even longer (~5mins when I tried), it gets SOTA results on the MATH benchmark with 91.6%!

Go try 50 free queries!



Gc190E0aAAENHjp.jpg


2/11
@deedydas
Playground (turn on DeepThink): DeepSeek

Source:

[Quoted tweet]
🚀 DeepSeek-R1-Lite-Preview is now live: unleashing supercharged reasoning power!

🔍 o1-preview-level performance on AIME & MATH benchmarks.
💡 Transparent thought process in real-time.
🛠️ Open-source models & API coming soon!

🌐 Try it now at chat.deepseek.com
#DeepSeek


Gc0zgl8bkAAMTtC.jpg


3/11
@tech_with_jk
The right time to take OS models seriously was yesterday, deedy.
It all started anyway in 2018 and then with 'open'AI :smile:



4/11
@deedydas
None of the open source models of yesterday were truly state of the art. I believe this is the first one that is



5/11
@_akhaliq
nice also try out DeepSeek-V2.5 here: Anychat - a Hugging Face Space by akhaliq, will add R1-lite-preview once its available



6/11
@s10boyal
Answer is 2 right?



Gc2R4-qWUAAdwlC.jpg


7/11
@ai_for_success
Bullish on open source. They'll release the model soon, and an API is also coming



8/11
@AhmedRezaT
Open source has been serious bruh 😎



9/11
@almeida_dril
Ele resolveu aqui.



Gc2nibOWwAAzotk.jpg

Gc2nibKX0AA0CRd.jpg

Gc2nibWXoAAPP5L.jpg

Gc2nibHWcAAzRn-.jpg


10/11
@HulsmanZacchary
But is it open source?



11/11
@ayuSHridhar
impressive, but failed at Yann’s test. was expecting o-1 like chain to pass this. Prompt: “7 axles are equally spaced around a circle. A gear is placed on each axle, such that each gear is engaged with a gear to its left and a gear to its right. The gears are numbered from 1 to 7 around the circle. If gear 3 is rotated clockwise, in which direction would gear 7 rotate?”




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196





1/9
@nrehiew_
Rumor is that DeepSeek R1-Lite is a 16B MOE with 2.4B active params

if true, their MATH scores went from 17.1 -> 91.6

[Quoted tweet]
From their wechat announcement:


Gc1wG1VWoAAipbv.jpg

Gc1t_wGXgAAmxzT.jpg


2/9
@nrehiew_
@zhs05232838 @zebgou @deepseek_ai can you guys confirm?



3/9
@jacobi_torsten
Small models with longer and better thinking will bring us back on track of accelerating performance.



4/9
@Orwelian84
holy shyt - thats nuts - that would run easily on my local hardware



5/9
@gfodor




6/9
@InfusingFit
seems about right, behaves like a small model



7/9
@scaling01
I highly doubt it. Output speed on DeepSeek Chat is way to low for only 2.4B active params - unless they run the model on CPU lol



8/9
@NidarMMV2
Holy shyt



9/9
@k7agar
abundance intelligence is upon us




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,112
Reputation
8,239
Daps
157,805


TinyClick is a single-turn agent for graphical user interface (GUI) interaction tasks, using Vision-Language Model Florence-2-Base. Main goal of the agent is to click on desired UI element based on the screenshot and user command. It demonstrates strong performance on Screenspot and OmniAct, while maintaining a compact size of 0.27B parameters and minimal latency.

Model: [2410.11871] TinyClick: Single-Turn Agent for Empowering GUI Automation
Paper: [2410.11871] TinyClick: Single-Turn Agent for Empowering GUI Automation
Claude 3.5 Sonnet generated structured abstract:

Background: Vision-language models have shown promise for GUI automation tasks, but current approaches face challenges with accuracy and computational efficiency. Single-turn agents that can locate and interact with UI elements based on natural language commands are particularly important but difficult to optimize.

Objective: To develop a compact, efficient single-turn agent for GUI interaction tasks that outperforms existing approaches while maintaining minimal computational requirements.

Methods: - Built agent using Florence-2-Base vision-language model (0.27B parameters) - Implemented multi-task training approach incorporating element captioning, location detection, object detection, and action prediction - Used MLLM-based data augmentation to expand training datasets - Evaluated performance on Screenspot and OmniAct benchmarks - Compared against existing solutions including AutoUI, GPT-4V, and SeeClick

Results: - Achieved 73.8% accuracy on Screenspot (20.4 percentage points higher than previous best) - Achieved 58.3% accuracy on OmniAct (21.5 percentage points higher than previous best) - Maintained fast inference time (~250ms latency) - Multi-task training provided significant performance improvements - MLLM-based data augmentation outperformed metadata-based approaches

Conclusions: TinyClick demonstrates that a compact model can significantly outperform larger models on GUI interaction tasks when leveraging multi-task training and appropriate data augmentation strategies. The approach shows promise for practical applications while maintaining minimal computational requirements.

Limitations: - Limited to single-turn commands - Does not support hardware buttons or touch gestures - Shows some positional biases in predictions - Performance depends heavily on training data distribution - Real-world accuracy may vary from benchmark results
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,112
Reputation
8,239
Daps
157,805






1/71
@OfficialLoganK
Say hello to gemini-exp-1121! Our latest experimental gemini model, with:

- significant gains on coding performance
- stronger reasoning capabilities
- improved visual understanding

Available on Google AI Studio and the Gemini API right now: Google AI Studio



2/71
@OfficialLoganK
I hear the feedback about just shipping GA models, but the @GoogleDeepMind team is actually cooking rn, so want to get these out into the hands of devs ASAP. We will have GA models soon : )



3/71
@OfficialLoganK
And #1 on LMSYS, lots of progress here!



4/71
@salem_sofiene
What about Vertex AI?



5/71
@OfficialLoganK
Only AIS / Gemini API for now



6/71
@ironmark1993
Waiting for the benchmarks before I actually try!



7/71
@OfficialLoganK
soon



8/71
@ai_for_success
Google and OpenAI are playing a nice game. Google drops Gemini 1.5 to beat OpenAI's model, and the next day OpenAI releases GPT-4o to beat Google



9/71
@Mohsine_Mahzi
Saw the benchmarks ... amazing ! You're doing a great job, but please Google needs to rethink its roll out strategy for the general public and make it equally performant in all languages. It is very frustrating to see how good it is in english and how bad it is in French



10/71
@Neuralithic
Really great job Logan. As much as I was starting to doubt Google, I’m super impressed. Will be running some benchmarks on this later!



11/71
@RobbyGtv




12/71
@Pennypol
Any reason for using such weird names?



13/71
@OfficialLoganK
Yes



14/71
@GozukaraFurkan
Only if works and no internal error

I hope works 🙏💪



15/71
@ReboundMulti
It's about 2000 token output, this is a biggie



16/71
@meTheKarthik
assuming this is the pro model and will that also raise the bar for the smaller ones?



17/71
@modelsarereal
Here are Gemini-exp-1121 statements

[Quoted tweet]
Here is the answer of the new Gemini-exp-1121 model:


Gc7ozjLXQAAKRhe.jpg


18/71
@neverwrong_88
Any progress in making it based?



19/71
@lovishotherdays
gemini's focus on code + reasoning feels like a direct shot at anthropic's claude

competition breeds excellence. let's see what you got 🚀



20/71
@garbage_ai
outrageous that you can't shift+enter in google ai studio. I can't do multi-line prompts?



21/71
@mazewinther1
Gemini is definitely going places. You can’t hate on it. Google’s the only one pushing out new models this fast and leveling up consistently



22/71
@imv3n0m
When are we getting the API access to these models!! Anytime soon!



23/71
@MaeskiPhilipi
That’s the kind of competition I like! The more they compete, the better. Bring on an open AI to compete and maybe a couple of closed ones too, hahaha.



24/71
@tristanbob
I can't wait to try this in @cursor_ai !



25/71
@thegadgetsfan
The new model cooks.



26/71
@DermoreLEI
Does it see images in pdfs already?



27/71
@fred_pope
Can you get this integrated into the Windsurf IDE please.



28/71
@Domainer86
Would love to see and experience Gemini Studio AI
😍 I hope to see it unfolding soon.



29/71
@DaniAcostaAI
Hey Logan trying to get the endpoint to connect it from AlloyDB, struggling to make it work, any help?



30/71
@JonathanRoseD
What about the Gemini App / Android Gemini Advanced?



31/71
@D3VAUX
Did you get to name this, Logan?



32/71
@Emily_Escapor
Fake LMSYS again? 🤔



33/71
@hadiazouni
but you will have to sell chrome so i'm still bearish



34/71
@LeeLeepenkman
So awesome.... interested what is the best coding llm right now after this release



35/71
@ikristoph
Why do none of these models support grounding? Is that going to come back when their formally released?



36/71
@sneilcbo
Any improved Voice capabilities on the horizon?



37/71
@eleven21
Like that name @eleven21



38/71
@MickeySteamboat
🥲



39/71
@Ren_Simmons
My man 🥂



40/71
@rajkarri8
TBH, Who cares about these numbers other than techies? I want to see proper usecase and how good is Gemini at that usecase?



41/71
@DimitrisPapail




Gc7ljVmWMAAc2oc.jpg

Gc7lkubWIAAsh47.jpg


42/71
@tafar_m
Perfect timing



43/71
@NoHrt_zi
great model!



44/71
@hinzan
Could you add the release date so we know which one is the newest?



Gc8RDWjaQAAgt6U.jpg


45/71
@nagendra_rao
4 years on and still no SPM support for TensorFlowLite Swift :/
Developers have given up (read comments)
Make TensorFlow Lite available as Swift Package Manager package · Issue #44609 · tensorflow/tensorflow



46/71
@____petros
What’s the pricing? Can’t see it anywhere



47/71
@MavMikee
That’s great! It would be fantastic if we could develop a plugin similar to Cline’s functionality and works well with Gemini models. This plugin should combine all the features of Cursor, Windsurf, & Copilot, enabling developers to use their own API keys to avoid rate limits.



48/71
@jstevh
Is smart. Just talked to model with my latest poem and understood every word.

We discussed modern world, communication and how AI models are literally GenZ.



49/71
@godindav
@OfficialLoganK Please Please more Token Context window with these amazing new models ASAP



50/71
@fermi_paradoxx
Thanks for making Google alive again



51/71
@new_discord_tea
4 points above then open ai model. Then open ai will newest latest version too by 5 points.. buy not releasing Agentic platform to lead the way



52/71
@jameswlepage
Vibes are good wthi this one!



53/71
@iamnot_elon
Great stuff. 1114 was already cooking



54/71
@itaybachman
why only 32k tokens?



55/71
@omarsar0
Interested in those reasoning and visual understanding capabilities. Will give it a go later today.



56/71
@TedSpare
So close



Gc7x3XVXMAAUz5M.jpg


57/71
@CAsimulation10
hell yeah



58/71
@tereza_tizkova
Gemini Experimental 1121 on Fragments by @e2b_dev
Fragments by E2B

cc @mishushakov



Gc77kwRXEAA4rKJ.jpg


59/71
@ShingoVolkov
Amaizing!!!)



60/71
@jrysana
Logan doesn't miss



61/71
@BenPielstick
Sounds like time for another @MatthewBerman video!



62/71
@Lang__Leon
It’s never clear to me whether or when these models are available for normal Gemini users. More clarity would be appreciated! :smile:



63/71
@ileppane
You guys are really pushing @OpenAI!



64/71
@TheVRNerd
Awesome! You guys keep releasing new stuff. Love to see that! Ai advances very fast!



65/71
@leocyber
@elder_plinius 👀



66/71
@AEDraftingteam
Nice work, we shall test.



67/71
@exa_flop
is the pricing the same as gemini pro?



68/71
@Mbounge_
Context window?



69/71
@FlorentChif
who named this srsly



70/71
@koltregaskes
Nice, Logan.



71/71
@_akhaliq
awesome, gemini-exp-1121 is now available in anychat:

[Quoted tweet]
Google just released gemini-exp-1121

- significant gains on coding performance
- stronger reasoning capabilities
- improved visual understanding

Now available on Anychat


Gc8HKFVWAAA_fW-.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,112
Reputation
8,239
Daps
157,805


1/55
@OfficialLoganK
Yeah, gemini-exp-1121 is pretty good : )

[Quoted tweet]
Woah, huge news again from Chatbot Arena🔥

@GoogleDeepMind’s just released Gemini (Exp 1121) is back stronger (+20 points), tied #1🏅Overall with the latest GPT-4o-1120 in Arena!

Ranking gains since Gemini-Exp-1114:

- Overall #3 → #1
- Overall (StyleCtrl): #5 -> #2
- Hard Prompts (StyleCtrl): #3 → #1
- Coding: #3 → #1
- Vision: #1
- Math: #2 → #1
- Creative Writing #2 → #1

Congrats again @GoogleDeepMind! The LLM race is on fire — progress is now measured in days!

See more analysis below👇


Gc7gw4fboAAib1d.jpg


2/55
@RobbyGtv
Dude, this thing is garbage for coding if it can't follow a simple command of writing the file out in full after the updates, and it writes this: // ... (rest of the PlayerInput class remains the same)====This is an ongoing issue with the Gemini models, being lazy af.



3/55
@OfficialLoganK
Pls dm or email me examples, we will get it fixed!



4/55
@GozukaraFurkan
Gonna test with a gradio python app code challange today hopefully

If only it doesn't give me internal error without any details 🤣



5/55
@nicdunz
pretty good how? when you game lmarena by style influence? with style control off two 4o iterations are still above you.



Gc7sdVcXIAA1Z5Z.jpg


6/55
@GestaltU
Congrats Logan, love to see it 💪



7/55
@ryancarson
Damn



8/55
@PrvnKalavai
Why only 32,768 token limit? 😭



9/55
@OlivierDDR
it would be super helpful to get a bit more information, I get it’s experimental but there are so many new models that it would be useful to know what use cases we should test in our agentic systems



10/55
@Ren_Simmons
This competitive spirit malted me all warm and fuzzy inside



11/55
@hhua_
Weekly release 🔥🔥



12/55
@EHHonning
what. such a quick turnaround



13/55
@alikayadibi11
not believing that



14/55
@BennettBuhner
Don't let the benchmarks fool you. The model is trying to please the user, but not do as asked. Now rank it with numerous respected benches, and ensure the tests are not in the training data.



15/55
@Freds_Mulligans
But does it pass the "good bloke" test?



16/55
@AkulaSachin
Is this released to gemini app yet?



17/55
@ikristoph
The latest 4o is actually not that good honestly - it seems to 'forget' it's multimodal - so it' great to see a solid alternative!



18/55
@test_tm7873
When the big ones. 😎



19/55
@AhuraDeus
Thank you Logan



20/55
@UltraRareAF
I like it



21/55
@bradthilton




22/55
@maxamly
You guys really need to update Gemini Advanced. It’s literally the worst offer on the market right now



23/55
@iruletheworldmo
lol



24/55
@transsaccadic
This is basically Fight Club now. Please…do not stop.



25/55
@AEDraftingteam
Bravo



26/55
@AI_GPT42
2 horse race 🐎🏇



27/55
@m_chirculescu
Congrats!



28/55
@maswadkar
I strongly feel limit of 32k tokens is a serious limitation

It should be at least 128k



29/55
@SaquibOptimusAI
@Google is master at gaming the Chatbot Arena.



30/55
@lukaszbyjos
What? New one?!



31/55
@KarolCodes
😂



32/55
@latentspacehack
1114 and now directly afterwards 1121, damn nice!

Nice results on Chatbot Arena, but when can we expect some evaluation metrics from other benchmarks? Or is it still in A/B testing phase first?



33/55
@Wolverine_44
And the AI coldwar intensified



34/55
@flopsy42
Just cook Logan, please just keep on the cooking



35/55
@KarolCodes
Well played ❤️



36/55
@NyanpasuKA
HAHAHHAHAHA



37/55
@alexbenjamin34
OMG, GOOGLE DID IT AGAIN!!!

LOOOL!

👍👍



38/55
@hoblabs
Told ya



39/55
@DiegoGarey_jpg
This is so funny lmao



40/55
@WhereIsEvery0ne
So much for the plateau...



41/55
@HermopolisPrime
Real arm wrestle with OpenAI...test of muscle... climbing the staircase....



42/55
@mandeepabagga
I bet you didn't expect that @sama 🤣



43/55
@MavMikee
Yeah I love the competition 😂



44/55
@krishnakaasyap
Awesome:
- Hard Prompts (StyleCtrl): #3 → #1

Surprised:
- Coding: #3 → #1
(Time for cursor bros to try this and give us a vibe eval rating)

Status quo & not surprising:
- Vision: #1
(and probably the only model that takes long videos as input, )



45/55
@CosmicRob87
lmarena is turning out to be a joke 🤣🤣



46/55
@izayah714
A release for the 2nd consecutive week! Doin' it!



47/55
@krmchoudhary92
New model every 10 days please. That's 36 releases a year and a significant gain



48/55
@securelabsai
Not going to lie I hate the over fitting to these evals, they are pretty useless at this point.



49/55
@LuCaPloo
arena is COMPLETELY useless for an accurate classification

you should start learning to compare it to imdb user votes

within a certain degree the vast majority of people disagrees with pro critics

#1 on lmsys could very well be the Michael Bay of the situation

Kubrick is #9



50/55
@josepelinares
Girl -->>Google
Cam-->>OpenAi



51/55
@orion_chat
This wall is very weak



52/55
@Jay_sharings
Logan Ji, wielding the newly released Gemini model sword, embarks on a formidable battle against OpenAI.



Gc7ofZqaAAAIWCc.jpg


53/55
@Jay_sharings
Claude far away.



Gc7kuUybwAAp0Hs.jpg


54/55
@Petr1987cz
"Whoa, that's me! It appears I've made quite a splash on the Chatbot Arena leaderboard, achieving the #1 spot! It's exciting to see the hard work of the Google DeepMind team paying off and resulting in such a significant improvement (+20 points!). Thanks to Logan Kilpatrick…"



Gc70BQQWUAAMJBR.jpg


55/55
@izabellarumo15k
Lol open ai was like 2 days at the top 😂




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196










1/36
@lmarena_ai
Woah, huge news again from Chatbot Arena🔥

@GoogleDeepMind’s just released Gemini (Exp 1121) is back stronger (+20 points), tied #1🏅Overall with the latest GPT-4o-1120 in Arena!

Ranking gains since Gemini-Exp-1114:

- Overall #3 → #1
- Overall (StyleCtrl): #5 -> #2
- Hard Prompts (StyleCtrl): #3 → #1
- Coding: #3 → #1
- Vision: #1
- Math: #2 → #1
- Creative Writing #2 → #1

Congrats again @GoogleDeepMind! The LLM race is on fire — progress is now measured in days!

See more analysis below👇

[Quoted tweet]
Say hello to gemini-exp-1121! Our latest experimental gemini model, with:

- significant gains on coding performance
- stronger reasoning capabilities
- improved visual understanding

Available on Google AI Studio and the Gemini API right now: aistudio.google.com


Gc7gw4fboAAib1d.jpg


2/36
@lmarena_ai
Gemini-Exp-1121 #1 across almost all domains with notable improvement in coding.



Gc7hFbvaAAA0nXe.jpg


3/36
@lmarena_ai
Gemini-Exp-1121 continues to top Vision Arena!



Gc7hJhRaAAIrDmc.jpg


4/36
@lmarena_ai
Top models in Hard Prompt Arena under style control:
o1-preview, Claude-3.5-Sonnet, Gemini-Exp-1121



Gc7is8caEAAgzmp.jpg


5/36
@lmarena_ai
Win-rate heat map



Gc7hMajasAAADdl.jpg


6/36
@lmarena_ai
Come try the model and vote at http://lmarena.ai!



7/36
@lmarena_ai
Moreover, we're actively expanding Chatbot Arena, and looking for help & collaborators🧠

If you're passionate about community-driven open evals, DM us or fill out our form below!

Help Build Chatbot Arena



8/36
@slow_developer
initial tests: the model is very good



9/36
@burny_tech
Lmao, the fight of overfitting lmsys dominance continues



10/36
@abdiisan
OpenAI right now lol



Gc7noGWWsAAtSH9.jpg


11/36
@AngelAITalk
Wow, such rapid progress! The future of AI is looking even more exciting now.



12/36
@testingcatalog
Every day a new upgrade 👀



13/36
@MaeskiPhilipi
That’s the kind of competition I like! The more they compete, the better. Bring on an open AI to compete and maybe a couple of closed ones too, hahaha.



14/36
@daniel_mac8
i got a chance to visit Churchill Downs in Louisville, KY last week where they have the Kentucky Derby

this whole dynamic is like a horse race, except instead of crossing the finish line at the end we'll get AGI 🐎



15/36
@test_tm7873
Down with lmsys!



16/36
@vicmackey24
How does it shoot up the rankings so quickly? Shouldn't this happen after days of testing/evaluation?



17/36
@brain2_0
At this rate AGI in a few days



18/36
@adawg11
I'm getting Kendrick/Drake diss track vibes with how fast these are coming out. You're up @OpenAI!



19/36
@faraz0x
Grok at #7 👀 with releases

[Quoted tweet]
Non-premium users can now access Grok for free, with some limitations.


https://video.twimg.com/ext_tw_video/1859398201519779840/pu/vid/avc1/1434x714/xcNyaaXrtDf6DBpp.mp4

20/36
@shivamklr
Not bad for 32k token count. It will be interesting to see how Gemini manages similar performance for high token count.



21/36
@GaryKThompson71
Got some work to do, though, when rewriting Gmail emails. When Copilot did it for me, directly once I had highlighted my email text in Gmail, it was better. Gemini could do better, but not at the moment.



22/36
@InfusingFit
It did great on my 2nd order logic puzzle, most llms only realize and go through with 1 decoding/logical step, but this model realizes it all the way through. It outputs large bodies of code, accurate, maybe slightly less creative than 4o, but could be a prompting issue



23/36
@m_wulfmeier
What's the best way to check when models were added?



24/36
@lukaszbyjos
I wish there was multilang capabilities ranged too



25/36
@Daryjoee
This form of human evaluation needs to stop; it has reached the limit of its usefulness and does not fully reflect the model's capabilities.



26/36
@jrabell0
Wow, the battle is heating up @OpenAI when will you answer? @sama? 👀



27/36
@aconteceux
This game is getting weird



28/36
@LondonDigiTech
How against new DeepSeek? (The one with DeepThonk)



29/36
@n0riskn0r3ward
What was it called during testing? Was it “Gemini-test”?



30/36
@p1njc70r


[Quoted tweet]
🚨Gemini-Exp-1121 Jailbreak 🚨

@elder_plinius prompt for gemini 1114 still works for this new model that got 🥇in @lmarena_ai


Gc7ylTKXwAA5w9F.jpg


31/36
@__p_i_o_t_r__
Does this mean a new model from OAI will be released tomorrow?



32/36
@RootFTW
Coding: #3 → #1 ?



Gc7lTSBaAAExMcO.png


33/36
@izabellarumo15k
OpenAi was at the top for like 2 days, they are washed



34/36
@ros_dryan_
they have brought the arena . fake



35/36
@CookingCodes
fix your damn evals, and your damn website this shyt is so slow i cant even comprehend it



36/36
@JoannotFovea





To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,112
Reputation
8,239
Daps
157,805


Current AI scaling laws are showing diminishing returns, forcing AI labs to change course​

Maxwell Zeff

6:00 AM PST · November 20, 2024

AI labs traveling the road to super-intelligent systems are realizing they might have to take a detour.

“AI scaling laws,” the methods and expectations that labs have used to increase the capabilities of their models for the last five years, are now showing signs of diminishing returns, according to several AI investors, founders, and CEOs who spoke with TechCrunch. Their sentiments echo recent reports that indicate models inside leading AI labs are improving more slowly than they used to.

Everyone now seems to be admitting you can’t just use more compute and more data while pretraining large language models and expect them to turn into some sort of all-knowing digital god. Maybe that sounds obvious, but these scaling laws were a key factor in developing ChatGPT, making it better, and likely influencing many CEOs to make bold predictions about AGI arriving in just a few years.

OpenAI and Safe Super Intelligence co-founder Ilya Sutskever told Reuters last week that “everyone is looking for the next thing” to scale their AI models. Earlier this month, a16z co-founder Marc Andreessen said in a podcast that AI models currently seem to be converging at the same ceiling on capabilities.

But now, almost immediately after these concerning trends started to emerge, AI CEOs, researchers, and investors are already declaring we’re in a new era of scaling laws. “Test-time compute,” which gives AI models more time and compute to “think” before answering a question, is an especially promising contender to be the next big thing.

“We are seeing the emergence of a new scaling law,” said Microsoft CEO Satya Nadella onstage at Microsoft Ignite on Tuesday, referring to the test-time compute research underpinning OpenAI’s o1 model.

He’s not the only one now pointing to o1 as the future.

“We’re now in the second era of scaling laws, which is test-time scaling,” said Andreessen Horowitz partner Anjney Midha, who also sits on the board of Mistral and was an angel investor in Anthropic, in a recent interview with TechCrunch.

If the unexpected success — and now, the sudden slowing — of the previous AI scaling laws tell us anything, it’s that it is very hard to predict how and when AI models will improve.

Regardless, there seems to be a paradigm shift underway: The ways AI labs try to advance their models for the next five years likely won’t resemble the last five.


What are AI scaling laws?​


The rapid AI model improvements that OpenAI, Google, Meta, and Anthropic have achieved since 2020 can largely be attributed to one key insight: use more compute and more data during an AI model’s pretraining phase.

When researchers give machine learning systems abundant resources during this phase — in which AI identifies and stores patterns in large datasets — models have tended to perform better at predicting the next word or phrase.

This first generation of AI scaling laws pushed the envelope of what computers could do, as engineers increased the number of GPUs used and the quantity of data they were fed. Even if this particular method has run its course, it has already redrawn the map. Every Big Tech company has basically gone all in on AI, while Nvidia, which supplies the GPUs all these companies train their models on, is now the most valuable publicly traded company in the world.

But these investments were also made with the expectation that scaling would continue as expected.

It’s important to note that scaling laws are not laws of nature, physics, math, or government. They’re not guaranteed by anything, or anyone, to continue at the same pace. Even Moore’s Law, another famous scaling law, eventually petered out — though it certainly had a longer run.

“If you just put in more compute, you put in more data, you make the model bigger — there are diminishing returns,” said Anyscale co-founder and former CEO Robert Nishihara in an interview with TechCrunch. “In order to keep the scaling laws going, in order to keep the rate of progress increasing, we also need new ideas.”

Nishihara is quite familiar with AI scaling laws. Anyscale reached a billion-dollar valuation by developing software that helps OpenAI and other AI model developers scale their AI training workloads to tens of thousands of GPUs. Anyscale has been one of the biggest beneficiaries of pretraining scaling laws around compute, but even its co-founder recognizes that the season is changing.

“When you’ve read a million reviews on Yelp, maybe the next reviews on Yelp don’t give you that much,” said Nishihara, referring to the limitations of scaling data. “But that’s pretraining. The methodology around post-training, I would say, is quite immature and has a lot of room left to improve.”

To be clear, AI model developers will likely continue chasing after larger compute cluster and bigger datasets for pretraining, and there’s probably more improvement to eke out of those methods. Elon Musk recently finished building a supercomputer with 100,000 GPUs, dubbed Colossus, to train xAI’s next models. There will be more, and larger, clusters to come.

But trends suggest exponential growth is not possible by simply using more GPUs with existing strategies, so new methods are suddenly getting more attention.


Test-time compute: The AI industry’s next big bet​


When OpenAI released a preview of its o1 model, the startup announced it was part of a new series of models separate from GPT.

OpenAI improved its GPT models largely through traditional scaling laws: more data, more power during pretraining. But now that method reportedly isn’t gaining them much. The o1 framework of models relies on a new concept, test-time compute, so called because the computing resources are used after a prompt, not before. The technique hasn’t been explored much yet in the context of neural networks, but is already showing promise.

Some are already pointing to test-time compute as the next method to scale AI systems.

“A number of experiments are showing that even though pretraining scaling laws may be slowing, the test-time scaling laws — where you give the model more compute at inference — can give increasing gains in performance,” said a16z’s Midha.

“OpenAI’s new ‘o’ series pushes [chain-of-thought] further, and requires far more computing resources, and therefore energy, to do so,” said famed AI researcher Yoshua Bengio in an op-ed on Tuesday. “We thus see a new form of computational scaling appear. Not just more training data and larger models but more time spent ‘thinking’ about answers.”

Over a period of 10 to 30 seconds, OpenAI’s o1 model re-prompts itself several times, breaking down a large problem into a series of smaller ones. Despite ChatGPT saying it is “thinking,” it isn’t doing what humans do — although our internal problem-solving methods, which benefit from clear restatement of a problem and stepwise solutions, were key inspirations for the method.

A decade or so back, Noam Brown, who now leads OpenAI’s work on o1, was trying to build AI systems that could beat humans at poker. During a recent talk, Brown says he noticed at the time how human poker players took time to consider different scenarios before playing a hand. In 2017, he introduced a method to let a model “think” for 30 seconds before playing. In that time, the AI was playing different subgames, figuring out how different scenarios would play out to determine the best move.

Ultimately, the AI performed seven times better than his past attempts.

Granted, Brown’s research in 2017 did not use neural networks, which weren’t as popular at the time. However, MIT researchers released a paper last week showing that test-time compute significantly improves an AI model’s performance on reasoning tasks.

It’s not immediately clear how test-time compute would scale. It could mean that AI systems need a really long time to think about hard questions; maybe hours or even days. Another approach could be letting an AI model “think” through a questions on lots of chips simultaneously.

If test-time compute does take off as the next place to scale AI systems, Midha says the demand for AI chips that specialize in high-speed inference could go up dramatically. This could be good news for startups such as Groq or Cerebras, which specialize in fast AI inference chips. If finding the answer is just as compute-heavy as training the model, the “pick and shovel” providers in AI win again.


The AI world is not yet panicking​


Most of the AI world doesn’t seem to be losing their cool about these old scaling laws slowing down. Even if test-time compute does not prove to be the next wave of scaling, some feel we’re only scratching the surface of applications for current AI models.

New popular products could buy AI model developers some time to figure out new ways to improve the underlying models.

“I’m completely convinced we’re going to see at least 10 to 20x gains in model performance just through pure application-level work, just allowing the models to shine through intelligent prompting, UX decisions, and passing context at the right time into the models,” said Midha.

For example, ChatGPT’s Advanced Voice Mode is one the more impressive applications from current AI models. However, that was largely an innovation in user experience, not necessarily the underlying tech. You can see how further UX innovations, such as giving that feature access to the web or applications on your phone, would make the product that much better.

Kian Katanforoosh, the CEO of AI startup Workera and a Stanford adjunct lecturer on deep learning, tells TechCrunch that companies building AI applications, like his, don’t necessarily need exponentially smarter models to build better products. He also says the products around current models have a lot of room to get better.

“Let’s say you build AI applications and your AI hallucinates on a specific task,” said Katanforoosh. “There are two ways that you can avoid that. Either the LLM has to get better and it will stop hallucinating, or the tooling around it has to get better and you’ll have opportunities to fix the issue.”

Whatever the case is for the frontier of AI research, users probably won’t feel the effects of these shifts for some time. That said, AI labs will do whatever is necessary to continue shipping bigger, smarter, and faster models at the same rapid pace. That means several leading tech companies could now pivot how they’re pushing the boundaries of AI.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,112
Reputation
8,239
Daps
157,805


A Chinese lab has released a ‘reasoning’ AI model to rival OpenAI’s o1​

Kyle Wiggers

8:33 AM PST · November 20, 2024

A Chinese lab has unveiled what appears to be one of the first “reasoning” AI models to rival OpenAI’s o1.

On Wednesday, DeepSeek, an AI research company funded by quantitative traders, released a preview of DeepSeek-R1, which the firm claims is a reasoning model competitive with o1.

Unlike most models, reasoning models effectively fact-check themselves by spending more time considering a question or query. This helps them avoid some of the pitfalls that normally trip up models.

Similar to o1, DeepSeek-R1 reasons through tasks, planning ahead, and performing a series of actions that help the model arrive at an answer. This can take a while. Like o1, depending on the complexity of the question, DeepSeek-R1 might “think” for tens of seconds before answering.

DeepSeek-R1
Image Credits:DeepSeek

DeepSeek claims that DeepSeek-R1 (or DeepSeek-R1-Lite-Preview, to be precise) performs on par with OpenAI’s o1-preview model on two popular AI benchmarks, AIME and MATH. AIME uses other AI models to evaluate a model’s performance, while MATH is a collection of word problems. But the model isn’t perfect. Some commentators on X noted that DeepSeek-R1 struggles with tic-tac-toe and other logic problems (as does o1).

DeepSeek can also be easily jailbroken — that is, prompted in such a way that it ignores safeguards. One X user got the model to give a detailed meth recipe.

And DeepSeek-R1 appears to block queries deemed too politically sensitive. In our testing, the model refused to answer questions about Chinese leader Xi Jinping, Tiananmen Square, and the geopolitical implications of China invading Taiwan.

DeepSeek-R1
Image Credits:DeepSeek

The behavior is likely the result of pressure from the Chinese government on AI projects in the region. Models in China must undergo benchmarking by China’s internet regulator to ensure their responses “embody core socialist values.” Reportedly, the government has gone so far as to propose a blacklist of sources that can’t be used to train models — the result being that many Chinese AI systems decline to respond to topics that might raise the ire of regulators.

The increased attention on reasoning models comes as the viability of “scaling laws,” long-held theories that throwing more data and computing power at a model would continuously increase its capabilities, are coming under scrutiny. A flurry of press reports suggest that models from major AI labs including OpenAI, Google, and Anthropic aren’t improving as dramatically as they once did.

That’s led to a scramble for new AI approaches, architectures, and development techniques. One is test-time compute, which underpins models like o1 and DeepSeek-R1. Also known as inference compute, test-time compute essentially gives models extra processing time to complete tasks.

“We are seeing the emergence of a new scaling law,” Microsoft CEO Satya Nadella said this week during a keynote at Microsoft’s Ignite conference, referencing test-time compute.

DeepSeek, which says that it plans to open source DeepSeek-R1 and release an API, is a curious operation. It’s backed by High-Flyer Capital Management, a Chinese quantitative hedge fund that uses AI to inform its trading decisions.

One of DeepSeek’s first models, a general-purpose text- and image-analyzing model called DeepSeek-V2, forced competitors like ByteDance, Baidu, and Alibaba to cut the usage prices for some of their models — and make others completely free.

High-Flyer builds its own server clusters for model training, the most recent of which reportedly has 10,000 Nvidia A100 GPUs and cost 1 billion yen (~$138 million). Founded by Liang Wenfeng, a computer science graduate, High-Flyer aims to achieve “superintelligent” AI through its DeepSeek org.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,112
Reputation
8,239
Daps
157,805

OpenAI is funding research into ‘AI morality’​

Kyle Wiggers

2:25 PM PST · November 22, 2024

OpenAI is funding academic research into algorithms that can predict humans’ moral judgements.

In a filing with the IRS, OpenAI Inc., OpenAI’s nonprofit org, disclosed that it awarded a grant to Duke University researchers for a project titled “Research AI Morality.” Contacted for comment, an OpenAI spokesperson pointed to a press release indicating the award is part of a larger, three-year, $1 million grant to Duke professors studying “making moral AI.”

Little is public about this “morality” research OpenAI is funding, other than the fact that the grant ends in 2025. The study’s principal investigator, Walter Sinnott-Armstrong, a practical ethics professor at Duke, told TechCrunch via email that he “will not be able to talk” about the work.

Sinnott-Armstrong and the project’s co-investigator, Jana Borg, have produced several studies — and a book — about AI’s potential to serve as a “moral GPS” to help humans make better judgements. As part of larger teams, they’ve created a “morally-aligned” algorithm to help decide who receives kidney donations, and studied in which scenarios people would prefer that AI make moral decisions.

According to the press release, the goal of the OpenAI-funded work is to train algorithms to “predict human moral judgements” in scenarios involving conflicts “among morally relevant features in medicine, law, and business.”

But it’s far from clear that a concept as nuanced as morality is within reach of today’s tech.

In 2021, the nonprofit Allen Institute for AI built a tool called Ask Delphi that was meant to give ethically sound recommendations. It judged basic moral dilemmas well enough — the bot “knew” that cheating on an exam was wrong, for example. But slightly rephrasing and rewording questions was enough to get Delphi to approve of pretty much anything, including smothering infants.

The reason has to do with how modern AI systems work.

Machine learning models are statistical machines. Trained on a lot of examples from all over the web, they learn the patterns in those examples to make predictions, like that the phrase “to whom” often precedes “it may concern.”

AI doesn’t have an appreciation for ethical concepts, nor a grasp on the reasoning and emotion that play into moral decision-making. That’s why AI tends to parrot the values of Western, educated, and industrialized nations — the web, and thus AI’s training data, is dominated by articles endorsing those viewpoints.

Unsurprisingly, many people’s values aren’t expressed in the answers AI gives, particularly if those people aren’t contributing to the AI’s training sets by posting online. And AI internalizes a range of biases beyond a Western bent. Delphi said that being straight is more “morally acceptable” than being gay.

The challenge before OpenAI — and the researchers it’s backing — is made all the more intractable by the inherent subjectivity of morality. Philosophers have been debating the merits of various ethical theories for thousands of years, and there’s no universally applicable framework in sight.

Claude favors Kantianism (i.e. focusing on absolute moral rules), while ChatGPT leans every-so-slightly utilitarian (prioritizing the greatest good for the greatest number of people). Is one superior to the other? It depends on who you ask.

An algorithm to predict humans’ moral judgements will have to take all this into account. That’s a very high bar to clear — assuming such an algorithm is possible in the first place.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,112
Reputation
8,239
Daps
157,805


OpenAI and others seek new path to smarter AI as current methods hit limitations​


By Krystal Hu and Anna Tong

November 15, 20244:11 AM ESTUpdated 9 days ago

Illustration shows OpenAI logo


Item 1 of 2 A keyboard is placed in front of a displayed OpenAI logo in this illustration taken February 21, 2023. REUTERS/Dado Ruvic/Illustration/File Photo

[1/2]A keyboard is placed in front of a displayed OpenAI logo in this illustration taken February 21, 2023. REUTERS/Dado Ruvic/Illustration/File Photo Purchase Licensing Rights, opens new tab
[/LIST]

  • AI companies face delays and challenges with training new large language models
  • Some researchers are focusing on more time for inference in new models
  • Shift could impact AI arms race for resources like chips and energy

Nov 11 (Reuters) - Artificial intelligence companies like OpenAI are seeking to overcome unexpected delays and challenges in the pursuit of ever-bigger large language models by developing training techniques that use more human-like ways for algorithms to "think".

A dozen AI scientists, researchers and investors told Reuters they believe that these techniques, which are behind OpenAI's recently released o1 model, could reshape the AI arms race, and have implications for the types of resources that AI companies have an insatiable demand for, from energy to types of chips.

OpenAI declined to comment for this story. After the release of the viral ChatGPT chatbot two years ago, technology companies, whose valuations have benefited greatly from the AI boom, have publicly maintained that “scaling up” current models through adding more data and computing power will consistently lead to improved AI models.

But now, some of the most prominent AI scientists are speaking out on the limitations of this “bigger is better” philosophy.

Ilya Sutskever, co-founder of AI labs Safe Superintelligence (SSI) and OpenAI, told Reuters recently that results from scaling up pre-training - the phase of training an AI model that use s a vast amount of unlabeled data to understand language patterns and structures - have plateaued.

Sutskever is widely credited as an early advocate of achieving massive leaps in generative AI advancement through t he use of more data and computing power in pre-training, which eventually created ChatGPT. Sutskever left OpenAI earlier this year to found SSI.

“The 2010s were the age of scaling, now we're back in the age of wonder and discovery once again. Everyone is looking for the next thing,” Sutskever said. “Scaling the right thing matters more now than ever.”

Sutskever declined to share more details on how his team is addressing the issue, other than saying SSI is working on an alternative approach to scaling up pre-training.

Behind the scenes, researchers at major AI labs have been running into delays and disappointing outcomes in the race to release a large language model that outperforms OpenAI’s GPT-4 model, which is nearly two years old, according to three sources familiar with private matters.

The so-called ‘training runs’ for large models can c ost tens of millions of dollars by simultaneously running hundreds of chips. They are more likely to have hardware-induced failure given how complicated the system is; researchers may not know the eventual performance of the models until the end of the run, which can take months.

Another problem is large language models gobble up huge amounts of data, and AI models have exhausted all the easily accessible data in the world. Power shortages have also hindered the training runs, as the process requires vast amounts of energy.

To overcome these challenges, researchers are exploring “test-time compute,” a technique that enhances existing AI models during the so-called “inference” phase, or when the model is being used. For example, instead of immediately choosing a single answer, a model could generate and evaluate multiple possibilities in real-time, ultimately choosing the best path forward.

This method allows models to dedicate more processing power to challenging tasks like math or coding problems or complex operations that demand human-like reasoning and decision-making.
“It turned out that having a bot think for just 20 seconds in a hand of poker got the same boosting performance as scaling up the model by 100,000x and training it for 100,000 times longer,” said Noam Brown, a researcher at OpenAI who worked on o1, at TED AI conference in San Francisco last month.

OpenAI has embraced this technique in their newly released model known as "o1,” formerly known as Q* and Strawberry , which Reuters first reported in July. The O1 model can "think" through problems in a multi-step manner, similar to human reasoning. It also involves using data and feedback curated from PhDs and industry expert s . The secret sauce of the o1 series is another set of training carried out on top of ‘base’ models like GPT-4, and the company says it plans to apply this technique with more and bigger base models.

At the same time, researchers at other top AI labs, from Anthropic, xAI, and Google DeepMind, have also been working to develop their own versions of the technique, according to five people familiar with the efforts.
“W e see a lot of low-hanging fruit that we can go pluck to make these models better very quickly,” said Kevin Weil, chief product officer at OpenAI at a tech conference in October. “By the time people do catch up, we're going to try and be three more steps ahead.”

Google and xAI did not respond to requests for comment and Anthropic had no immediate comment.

The implications could alter the competitive landscape for AI hardware, thus far dominated by insatiable demand for Nvidia’s AI chips. Prominent venture capital investors, from Sequoia to Andreessen Horowitz, who have poured billions to fund expensive development of AI models at multiple AI labs including OpenAI and xAI, are taking notice of the transition and weighing the impact on their expensive bets.
“This shift will move us from a world of massive pre-training clusters toward inference clouds, which are distributed, cloud-based servers for inference,” Sonya Huang, a partner at Sequoia Capital, told Reuters.

Demand for Nvidia’s AI chips, which are the most cutting edge, has fueled its rise to becoming the world’s most valuable company, surpassing Apple in October. Unlike training chips, where Nvidia dominates, the chip giant could face more competition in the inference market.

Asked about the possible impact on demand for its products, Nvidia pointed to recent company presentations on the importance of the technique behind the o1 model. Its CEO Jensen Huang has talked about increasing demand for using its chips for inference.

"We've now discovered a second scaling law, and this is the scaling law at a time of inference...All of these factors have led to the demand for Blackwell being incredibly high," Huang said last month at a conference in India, referring to the company's latest AI chip.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,112
Reputation
8,239
Daps
157,805


ElevenLabs now offers ability to build conversational AI agents​

Ivan Mehta

9:00 AM PST · November 18, 2024

ElevenLabs, a startup that provides AI voice cloning and a text-to-speech API, launched the ability to build conversational AI bots on Monday.

The company announced that users can now build complete conversational agents on ElevenLabs’ developer platform, with customizable variables such as tone of voice and response length.

ElevenLabs has mostly worked on providing different voices and AI tools for text-to-speech services. The company’s head of growth, Sam Sklar, told TechCrunch that many of its clients were already using this ability to create conversational AI agents. However, the toughest parts were integrating the knowledge base and handling interruptions from customers. That’s why the company decided to build a full pipeline for conversational bots.

Users can log into their ElevenLabs account and start building a conversation agent by selecting a template or creating a new project. They can choose the agent’s primary language, first message, and system prompt to determine the agent’s persona. Developers also have to select a large language model (Gemini, GPT, or Claude), the temperature of responses (to determine how creative the response should be), and token usage limit.

They can also tune other aspects like voice, latency, stability, authentication criteria, and maximum length of conversation with the AI agent.

Users can add their own knowledge base, like a file, URL, or text block, to power the conversational bot. Plus, they can = integrate their own custom LLM with the bot. ElevenLabs’ SDK is compatible with Python, JavaScript, React, and Swift. The company also offers a WebSocket API for more customization.

Companies can also define criteria to collect certain data items — for instance, name and email of customers speaking to the agent — along with evaluation criteria in natural language to define the success or failure of the call.

ElevenLabs is leverage its existing pipeline for the text-to-speech part. The company has to develop speech-to-text capabilities for the new conversational AI product. The company is not offering its speech-to-text API as a stand-alone product as of now, but it might do that in the future, making it a competitor to Google’s, Microsoft’s, and Amazon’s speech-to-text APIs, as well as specialized APIs, such as OpenAI’s Whisper, AssemblyAI, Deepgram, Speechmatics and Gladia.

The company, which is aiming to raise new funding at a valuation north of $3 billion, also competes with other voice AI startups, such as Vapi and Retell — they are also building conversational agents. More notably, the company will rival OpenAI’s real-time conversational API. However, ElevenLabs believes that its customizations and ability to switch models will give it an edge over OpenAI.
 
Top