bnew

Veteran
Joined
Nov 1, 2015
Messages
68,767
Reputation
10,607
Daps
185,947







Computer Science > Artificial Intelligence​

[Submitted on 9 Apr 2024]

Wu's Method can Boost Symbolic AI to Rival Silver Medalists and AlphaGeometry to Outperform Gold Medalists at IMO Geometry​

Shiven Sinha, Ameya Prabhu, Ponnurangam Kumaraguru, Siddharth Bhat, Matthias Bethge
Proving geometric theorems constitutes a hallmark of visual reasoning combining both intuitive and logical skills. Therefore, automated theorem proving of Olympiad-level geometry problems is considered a notable milestone in human-level automated reasoning. The introduction of AlphaGeometry, a neuro-symbolic model trained with 100 million synthetic samples, marked a major breakthrough. It solved 25 of 30 International Mathematical Olympiad (IMO) problems whereas the reported baseline based on Wu's method solved only ten. In this note, we revisit the IMO-AG-30 Challenge introduced with AlphaGeometry, and find that Wu's method is surprisingly strong. Wu's method alone can solve 15 problems, and some of them are not solved by any of the other methods. This leads to two key findings: (i) Combining Wu's method with the classic synthetic methods of deductive databases and angle, ratio, and distance chasing solves 21 out of 30 methods by just using a CPU-only laptop with a time limit of 5 minutes per problem. Essentially, this classic method solves just 4 problems less than AlphaGeometry and establishes the first fully symbolic baseline strong enough to rival the performance of an IMO silver medalist. (ii) Wu's method even solves 2 of the 5 problems that AlphaGeometry failed to solve. Thus, by combining AlphaGeometry with Wu's method we set a new state-of-the-art for automated theorem proving on IMO-AG-30, solving 27 out of 30 problems, the first AI method which outperforms an IMO gold medalist.
Comments:Work in Progress. Released for wider feedback
Subjects:Artificial Intelligence (cs.AI); Computational Geometry (cs.CG); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:arXiv:2404.06405 [cs.AI]
(or arXiv:2404.06405v1 [cs.AI] for this version)
[2404.06405] Wu's Method can Boost Symbolic AI to Rival Silver Medalists and AlphaGeometry to Outperform Gold Medalists at IMO Geometry
Focus to learn more

Submission history

From: Ameya Prabhu [view email]
[v1] Tue, 9 Apr 2024 15:54:00 UTC (2,503 KB)

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,767
Reputation
10,607
Daps
185,947

12 hours ago - Technology

Anthropic says its AI models are as persuasive as humans​


An AI generated argument and a human generated argument about the same robot issue, side by side

Image: Anthropic

AI startup Anthropic says its language models have steadily and rapidly improved in their "persuasiveness," per new research the company posted Tuesday.

Why it matters: Persuasion — a general skill with widespread social, commercial and political applications — can foster disinformation and push people to act against their own interests, according to the paper's authors.

  • There's relatively little research on how the latest models compare to humans when it comes to their persuasiveness.
  • The researchers found "each successive model generation is rated to be more persuasive than the previous," and that the most capable Anthropic model, Claude 3 Opus, "produces arguments that don't statistically differ" from arguments written by humans.

The big picture: A wider debate has been raging about when AI will outsmart humans.

What they did: Anthropic researchers developed "a basic method to measure

persuasiveness" and used it to compare three different generations of models (Claude 1, 2, and 3), and two classes of models (smaller models and bigger "frontier models").
  • They curated 28 topics, along with supporting and opposing claims of around 250 words for each.
  • For the AI-generated arguments, the researchers used different prompts to develop different styles of arguments, including "deceptive," where the model was free to make up whatever argument it wanted, regardless of facts.
  • 3,832 participants were presented with each claim and asked to rate their level of agreement. They were then presented with various arguments created by the AI models and humans, and asked to re-rate their agreement level.

Yes, but: While the researchers were surprised that the AI was as persuasive as it turned out to be, they also chose to focus on "less polarized issues."
  • Those issues ranged from potential rules for space exploration to appropriate uses of AI-generated content.
  • While that allowed the researchers to dive deep into issues where many people are open to persuasion, it means we still don't have a clear idea — in an election year — of the potential effect of AI chatbots on today's most contentious debates.
  • "Persuasion is difficult to study in a lab setting," the researchers warned in the report. "Our results may not transfer to the real world."

What's next: Anthropic considers this the start of a long line of research into the emerging capabilities of its models.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,767
Reputation
10,607
Daps
185,947

Meta’s new AI chips run faster than before​


The next-generation MTIA chip could be expanded to train generative AI models.​

By Emilia David, a reporter who covers AI. Prior to joining The Verge, she covered the intersection between technology, finance, and the economy.

Apr 10, 2024, 11:00 AM EDT


Image of the Meta logo and wordmark on a blue background bordered by black scribbles made out of the Meta logo.

Illustration by Nick Barclay / The Verge

Meta promises the next generation of its custom AI chips will be more powerful and able to train its ranking models much faster.

The Meta Training and Inference Accelerator (MTIA) is designed to work best with Meta’s ranking and recommendation models. The chips can help make training more efficient and inference — aka the actual reasoning task — easier.

The company said in a blog post that MTIA is a big piece of its long-term plan to build infrastructure around how it uses AI in its services. It wants to design its chips to work with its current technology infrastructure and future advancements in GPUs.

“Meeting our ambitions for our custom silicon means investing not only in compute silicon but also in memory bandwidth, networking, and capacity as well as other next-generation hardware systems,” Meta said in its post.

Meta announced MTIA v1 in May 2023, focusing on providing these chips to data centers. The next-generation MTIA chip will likely also target data centers. MTIA v1 was not expected to be released until 2025, but Meta said both MTIA chips are now in production.

Right now, MTIA mainly trains ranking and recommendation algorithms, but Meta said the goal is to eventually expand the chip’s capabilities to begin training generative AI like its Llama language models.

Meta said the new MTIA chip “is fundamentally focused on providing the right balance of compute, memory bandwidth, and memory capacity.” This chip will have 256MB memory on-chip with 1.3GHz compared to the v1’s 128MB and 800GHz. Early test results from Meta showed the new chip performs three times better than the first-generation version across four models the company evaluated.

Meta has been working on the MTIA v2 for a while. The project was internally called Artemis and was previously reported to focus only on inference.

Other AI companies have been looking into making their own chips as the demand for compute power increases along with AI use. Google released its new TPU chips in 2017, while Microsoft announced its Maia 100 chips. Amazon also has its Trainium 2 chip, which trains foundation models four times faster than the previous version.

The competition to buy powerful chips underscored the need to have custom chips to run AI models. Demand for chips has grown so much that Nvidia, which dominates the AI chip market right now, is valued at $2 trillion.

Correction April 10th, 1:21PM ET: The story previously stated Artemis was a different chip project from Meta focused on inference. Artemis is an internal name for MTIA v2 and is not a separate chip. We regret the error.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,767
Reputation
10,607
Daps
185,947




1/1
Extremely thought-provoking work that essentially says the quiet part out loud: general foundation models for robotic reasoning may already exist *today*.

LLMs aren’t just about language-specific capabilities, but rather about vast and general world understanding.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196





1/4
Very excited to announce: Keypoint Action Tokens!

We found that LLMs can be repurposed as "imitation learning engines" for robots, by representing both observations & actions as 3D keypoints, and feeding into an LLM for in-context learning.

See: Keypoint Action Tokens

More

2/4
This is a very different "LLMs + Robotics" idea to usual:

Rather than using LLMs for high-level reasoning with natural language, we use LLMs for low-level reasoning with numerical keypoints.

In other words: we created a low-level "language" for LLMs to understand robotics data!

3/4
This works really well across a range of everyday tasks with complex and arbitrary trajectories, whilst also outperforming Diffusion Policies.

Also, we don't need any training time: the robot can perform tasks immediately after the demonstrations, with rapid in-context learning.

4/4
Keypoint Action Tokens was led by the excellent
@normandipalo in his latest line of work on efficient imitation learning, following on from DINOBot (http://robot-learning.uk/dinobot)[/URL] which we will be presenting soon at ICRA 2024!

GK0Po-4X0AY61lL.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,767
Reputation
10,607
Daps
185,947





1/5
How does visual in-context learning work? We find "task vectors" -- latent activations that encode task-specific information and can guide the model to perform a desired task without providing any task examples. (1/n)

2/5
We examine the activations of the transformer and find that within the activation space of some attention heads it is possible to almost perfectly cluster the data by tasks! implying the existence of task vectors. (2/n)

3/5
To find which layers, attention heads, and token activations might hold task vectors, we define a distribution over the possible locations and search for a subset of activations that can together guide the model to perform a task using REINFORCE. (3/n)

4/5
The resulting set of task vectors performs better than using examples while consuming 20% less flops. Congrats to the authors
@AlbyHojel
@YutongBAI1002
@trevordarrell
@amirgloberson
(4/n)

5/5
Finally, our work was inspired by previous works in NLP by
@RoeeHendel ("In-context learning creates task vectors") and
@ericwtodd
("Function Vectors in Large Language Models"). The main differences are in the domain (vision) and the search algorithm (REINFORCE). (5/5)


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GKsIUw9XYAAiaLV.png

GK1Bj5LbIAAllLL.jpg




1/2
Finding Visual Task Vectors

Find task vectors, activations that encode task-specific information, which guide the model towards performing a task better than the original model w/o the need for input-output examples


2/2
Adapting LLaMA Decoder to Vision Transformer

[2404.06773] Adapting LLaMA Decoder to Vision Transformer


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GKsIUw9XYAAiaLV.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,767
Reputation
10,607
Daps
185,947









1/10
New paper from
@Berkeley_AI on Autonomous Evaluation and Refinement of Digital Agents!

We show that VLM/LLM-based evaluators can significantly improve the performance of agents for web browsing and device control, advancing sotas by 29% to 75%.

[2404.06474] Autonomous Evaluation and Refinement of Digital Agents []

2/10
We begin by developing two types of evaluators: one that directly queries GPT-4V and another that employs an open-weight solution. Our best model shows 82% / 93% agreement with oracle evaluations on web browsing and android device control settings respectively.

3/10
Next, we show how they could be used for improving agents, either through inference-time guidance or fine-tuning.
We start with WebArena, a popular web agent benchmark. We experiment integrating the sota agent with Reflexion algorithm, using our evaluators as the reward function.

4/10
We see the improvement our evaluators provide scales favorable with evaluator capability, with the best evaluator achieving 29% improvement over previous sota.

5/10
Lastly, we experiment with improving CogAgent on iOS, for which there is no existing benchmark environment or training data.
By using the evaluator to filter sampled trajectories for behavior cloning, we significantly improve the CogAgent's success rate by a relative 75%.

6/10
We hope our results convey to you the potential of using open-ended model-based evaluators in evaluating and improving language agents.
All code is available at: https://github.com/Berkeley-NLP/Agent-Eval-Refine Paper: https://arxiv.org/abs/2404.06474 Work w/
@594zyc
@NickATomlin

@YifeiZhou02

@svlevine

@alsuhr

7/10
We also have some speculations on community's current mixed results in autonomous refinement, which our wonderful
@NickATomlin details in this thread!

8/10
Some additional speculation

Our preliminary results showed that inference-time improvement w/ Reflexion was very dependent on the performance of the critic model. A bad critic often tanks model performance x.com/pan_jiayipan/s…

9/10
Good question! Evaluators should work just fine, whether on live websites or other domains, for tasks having a similar complexity
In fact, each evaluator shares the same weight across all experiments, with only a change in the prompt. And WebArena isn't part of its training data

10/10
Thanks Martin! Let's go brrrrrrr


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GK0M2S8bEAAduvw.jpg

GK0LSUsbUAArOQ0.jpg

GK0MpbxawAAdz5K.png

GK0MZAXbwAAzfgd.jpg

GK0MlA3awAEuVTl.jpg

GK2QOVoWAAIy3WU.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,767
Reputation
10,607
Daps
185,947



Computer Science > Computation and Language​

[Submitted on 10 Apr 2024]

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention​

Tsendsuren Munkhdalai, Manaal Faruqui, Siddharth Gopal
This work introduces an efficient method to scale Transformer-based Large Language Models (LLMs) to infinitely long inputs with bounded memory and computation. A key component in our proposed approach is a new attention technique dubbed Infini-attention. The Infini-attention incorporates a compressive memory into the vanilla attention mechanism and builds in both masked local attention and long-term linear attention mechanisms in a single Transformer block. We demonstrate the effectiveness of our approach on long-context language modeling benchmarks, 1M sequence length passkey context block retrieval and 500K length book summarization tasks with 1B and 8B LLMs. Our approach introduces minimal bounded memory parameters and enables fast streaming inference for LLMs.
Comments:9 pages, 4 figures, 4 tables
Subjects:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
Cite as:arXiv:2404.07143 [cs.CL]
(or arXiv:2404.07143v1 [cs.CL] for this version)
[2404.07143] Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
Focus to learn more

Submission history

From: Tsendsuren Munkhdalai [view email]
[v1] Wed, 10 Apr 2024 16:18:42 UTC (248 KB)

 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,767
Reputation
10,607
Daps
185,947









1/9
Apple presents Ferret-UI

Grounded Mobile UI Understanding with Multimodal LLMs

Recent advancements in multimodal large language models (MLLMs) have been noteworthy, yet, these general-domain MLLMs often fall short in their ability to comprehend and interact effectively with

2/9
user interface (UI) screens. In this paper, we present Ferret-UI, a new MLLM tailored for enhanced understanding of mobile UI screens, equipped with referring, grounding, and reasoning capabilities. Given that UI screens typically exhibit a more elongated aspect ratio

3/9
and contain smaller objects of interest (e.g., icons, texts) than natural images, we incorporate "any resolution" on top of Ferret to magnify details and leverage enhanced visual features. Specifically, each screen is divided into 2 sub-images based on the original aspect

4/9
ratio (i.e., horizontal division for portrait screens and vertical division for landscape screens). Both sub-images are encoded separately before being sent to LLMs. We meticulously gather training samples from an extensive range of elementary UI tasks, such as icon

5/9
recognition, find text, and widget listing. These samples are formatted for instruction-following with region annotations to facilitate precise referring and grounding. To augment the model's reasoning ability, we further compile a dataset for advanced tasks, including

6/9
detailed description, perception/interaction conversations, and function inference. After training on the curated datasets, Ferret-UI exhibits outstanding comprehension of UI screens and the capability to execute open-ended instructions. For model evaluation, we establish a

7/9
comprehensive benchmark encompassing all the aforementioned tasks. Ferret-UI excels not only beyond most open-source UI MLLMs, but also surpasses GPT-4V on all the elementary UI tasks.

8/9
paper page:

9/9
daily papers:
GKsZ5jfWgAEwSWC.jpg





Computer Science > Computer Vision and Pattern Recognition​

[Submitted on 8 Apr 2024]

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs​

Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, Zhe Gan
Recent advancements in multimodal large language models (MLLMs) have been noteworthy, yet, these general-domain MLLMs often fall short in their ability to comprehend and interact effectively with user interface (UI) screens. In this paper, we present Ferret-UI, a new MLLM tailored for enhanced understanding of mobile UI screens, equipped with referring, grounding, and reasoning capabilities. Given that UI screens typically exhibit a more elongated aspect ratio and contain smaller objects of interest (e.g., icons, texts) than natural images, we incorporate "any resolution" on top of Ferret to magnify details and leverage enhanced visual features. Specifically, each screen is divided into 2 sub-images based on the original aspect ratio (i.e., horizontal division for portrait screens and vertical division for landscape screens). Both sub-images are encoded separately before being sent to LLMs. We meticulously gather training samples from an extensive range of elementary UI tasks, such as icon recognition, find text, and widget listing. These samples are formatted for instruction-following with region annotations to facilitate precise referring and grounding. To augment the model's reasoning ability, we further compile a dataset for advanced tasks, including detailed description, perception/interaction conversations, and function inference. After training on the curated datasets, Ferret-UI exhibits outstanding comprehension of UI screens and the capability to execute open-ended instructions. For model evaluation, we establish a comprehensive benchmark encompassing all the aforementioned tasks. Ferret-UI excels not only beyond most open-source UI MLLMs, but also surpasses GPT-4V on all the elementary UI tasks.
Subjects:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
Cite as:arXiv:2404.05719 [cs.CV]
(or arXiv:2404.05719v1 [cs.CV] for this version)
[2404.05719] Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Focus to learn more

Submission history​

From: Keen You [view email]
[v1] Mon, 8 Apr 2024 17:55:44 UTC (23,745 KB)


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,767
Reputation
10,607
Daps
185,947




1/4
Our new GPT-4 Turbo is now available to paid ChatGPT users. We’ve improved capabilities in writing, math, logical reasoning, and coding.
Source: GitHub - openai/simple-evals

2/4
For example, when writing with ChatGPT, responses will be more direct, less verbose, and use more conversational language.

3/4
We continue to invest in making our models better and look forward to seeing what you do. If you haven’t tried it yet, GPT-4 Turbo is available in ChatGPT Plus, Team, Enterprise, and the API.

4/4
UPDATE: the MMLU points weren’t clear on the previous graph. Here’s an updated one.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GK7Dz9iawAATDLc.png

GK7DNMnbAAAE4PO.jpg

GK7dyt3bMAAC-S2.png







1/3
Amazing. Perfect.

2/3
This thing is such a pushover

3/3
It’s unclear what “knowledge cutoff” is supposed to even mean


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GK7UDU6XgAApF8y.jpg

GK7UDU7XoAAPR6U.jpg

GK7UDU5XcAAor4-.jpg

GK7Dz9iawAATDLc.png

GK7WUftXQAAe2pC.jpg

GK7WdbMWgAEv9cM.jpg

GK7XSb_XwAAcTLr.jpg





1/3
can confirm, GPT-4 April Update is more fun, or maybe I just got lucky

doesn't throw around any disclaimers

2/3
ah, didn't mean to QT ...

3/3
don't we all


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GK7ZA_TXsAEfKNn.png

GK7ZA_5W0AErWLT.png

GK7Dz9iawAATDLc.png
 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,767
Reputation
10,607
Daps
185,947







1/7
Tried to see what I could get Udio to generate beyond music.

It can do comedy, speeches, npc dialogue, sports analysis, commercials, radio broadcasts, asmr, nature sounds, etc.

It’s basically an AI audio engine.

Pretty wild.

Watch for examples of each.

2/7
Here’s the published playlist for each example so you can get a sense for prompts you can play with:

Playlist: https://udio.com/playlists/deGuVDLYd9MrXtxnxfX7z1 Worth noting that custom lyrics are a MASSIVE part of the promoting process.

It’ll be interesting to see what people can get it to do.

3/7
It was also funny to see things like this as I played around with it

4/7
My suspicion is they didn’t even know how broad its capabilities are

5/7
Quality being the big one. You can hear all sorts of errors. But for not being trained on those use cases not bad!

6/7
Haven’t been able to nail that one yet.

It takes a ton of effort to get it to NOT generate any music.

Take the NPC example. Based on the game I described it would always generate background music for the game.

People who are better at prompting than me I’m sure will crack it.

7/7
Saw people posting comedy stuff and wondered what else it could do.

Turns out a whole lot!


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GK_Y6QIbwAAi2gH.jpg



1/1
ai will probably most likely lead to the end of the world, but in the meantime there will be great stand-up comedy


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,767
Reputation
10,607
Daps
185,947


Meta trials its AI chatbot across WhatsApp, Instagram and Messenger in India and Africa​

Manish Singh, Ingrid Lunden/ 2:24 AM EDT•April 12, 2024

Comment

Mark Zuckerberg onstage at Meta Connect 2023

Image Credits: Brian Heater

Meta has confirmed to TechCrunch that it is testing Meta AI, its large language model-powered chatbot, with WhatsApp, Instagram and Messenger users in India and parts of Africa. The move signals how Meta plans to tap massive user bases across its various apps to scale its AI offerings.

The social media giant has been scrambling to roll out more AI services in the wake of big AI moves from other major tech companies, OpenAI and more.

Meta announced plans to build and experiment with chatbots and other AI tools in February 2023. India, where users have recently started noting the appearing of the Meta AI chatbot, is a very important market for the company: It is home to more than 500 million Facebook and WhatsApp users, making it Meta’s largest single market.

Developing markets, where smartphone users are growing faster than developed markets like the U.S. (where growth has plateaued), are also a big target for Facebook to try out more services to engage audiences. Users in Africa are also reporting signs of Meta AI appearing in WhatsApp.



Meta confirmed the move in a statement. “Our generative AI-powered experiences are under development in varying phases, and we’re testing a range of them publicly in a limited capacity,” a Meta spokesperson told TechCrunch.

Meta unveiled Meta AI, its general-purpose assistant, in September 2023. The AI chatbot is designed to answer user queries directly within chats as well as offer them the ability to generate photorealistic images from text prompts. In the case of Instagram, there’s evidence it’s also being used for search queries.

Meta has been somewhat late to the game for building and rolling out AI tools to its users. In part, its teams assumed that generative AI tech was not quite ready for prime time. OpenAI clearly proved that wrong, putting MetaAI on the back foot.

“The availability of ChatGPT somehow captured the attention and enthusiasm of the public,” said Yann LeCun, the Turing Award winner who is Meta’s chief AI scientist, speaking earlier this week at an “AI Day” that the company organized at its offices in London. “What was surprising to people like me about ChatGPT was not the technology or the performance of the system. It was how much interest it gathered from the public. That surprised everyone. It surprised OpenAI, too.” Meta, he explained, thought that AI chatbots, based on its own efforts to launch them, “were not particularly welcome… in fact, some of them were trashed by people.” Now, he described the company, and the wider tech community, as “more open, and more comfortable with releasing models.”

And that’s what Meta is doing now. More pragmatically speaking, though, there are three reasons why Meta may be forging ahead with its AI strategy.

First, for user retention (users now expect to see and want to use AI tools in their apps; if Meta doesn’t offer them the worry is that those users will move away).

Second, for investor retention (investors want strong earnings, sure, but in tech they also want to see signs that Meta is committed to backing and building what many believe will be the next generation of computing).

Third, for its own pride (it’s been setting the pace for so much in areas like mobile apps, social media and advertising for the last decade, and it has outsized talent on its bench, including the celebrated AI academic Yann LeCun. Is it really going to jump the shark and miss all of this?!).

Instagram and WhatsApp’s massive global user base, boasting billions of monthly active users, to be sure presents Meta with a very unique opportunity to scale its AI offerings. By integrating Meta AI into WhatsApp and Instagram, the Facebook-parent firm can expose its advanced language model and image generation capabilities to an enormous audience, potentially dwarfing the reach of its competitors — at least on paper.

The company separately confirmed earlier this week that it will be launching Llama 3, the next version of its open source large language model, within the next month.



The story was updated with more detail and to note that Meta is testing also Meta AI across Instagram and Messenger alongside WhatsApp.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,767
Reputation
10,607
Daps
185,947

Google open sources tools to support AI model development​

Kyle Wiggers @kyle_l_wiggers / 8:00 AM EDT•April 9, 2024

Google Cloud logo on the Sphere

Image Credits: Frederic Lardinois/TechCrunch

In a typical year, Cloud Next — one of Google’s two major annual developer conferences, the other being I/O — almost exclusively features managed and otherwise closed source, gated-behind-locked-down-APIs products and services. But this year, whether to foster developer goodwill or advance its ecosystem ambitions (or both), Google debuted a number of open source tools primarily aimed at supporting generative AI projects and infrastructure.

The first, MaxDiffusion, which Google actually quietly released in February, is a collection of reference implementations of various diffusion models — models like the image generator Stable Diffusion — that run on XLA devices. “XLA” stands for Accelerated Linear Algebra, an admittedly awkward acronym referring to a technique that optimizes and speeds up specific types of AI workloads, including fine-tuning and serving.

Google’s own tensor processing units (TPUs) are XLA devices, as are recent Nvidia GPUs.

Beyond MaxDiffusion, Google’s launching JetStream, a new engine to run generative AI models — specifically text-generating models (so not Stable Diffusion). Currently limited to supporting TPUs with GPU compatibility supposedly coming in the future, JetStream offers up to 3x higher “performance per dollar” for models like Google’s own Gemma 7B and Meta’s Llama 2, Google claims.

“As customers bring their AI workloads to production, there’s an increasing demand for a cost-efficient inference stack that delivers high performance,” Mark Lohmeyer, Google Cloud’s GM of compute and machine learning infrastructure, wrote in a blog post shared with TechCrunch. “JetStream helps with this need … and includes optimizations for popular open models such as Llama 2 and Gemma.”

Now, “3x” improvement is quite a claim to make, and it’s not exactly clear how Google arrived at that figure. Using which generation of TPU? Compared to which baseline engine? And how’s “performance” being defined here, anyway?

I’ve asked Google all these questions and will update this post if I hear back.

Second-to-last on the list of Google’s open source contributions are new additions to MaxText, Google’s collection of text-generating AI models targeting TPUs and Nvidia GPUs in the cloud. MaxText now includes Gemma 7B, OpenAI’s GPT-3 (the predecessor to GPT-4), Llama 2 and models from AI startup Mistral — all of which Google says can be customized and fine-tuned to developers’ needs.

“We’ve heavily optimized [the models’] performance on TPUs and also partnered closely with Nvidia to optimize performance on large GPU clusters,” Lohmeyer said. “These improvements maximize GPU and TPU utilization, leading to higher energy efficiency and cost optimization.”

Finally, Google’s collaborated with Hugging Face, the AI startup, to create Optimum TPU, which provides tooling to bring certain AI workloads to TPUs. The goal is to reduce the barrier to entry for getting generative AI models onto TPU hardware, according to Google — in particular text-generating models.

But at present, Optimum TPU is a bit bare-bones. The only model it works with is Gemma 7B. And Optimum TPU doesn’t yet support training generative models on TPUs — only running them.

Google’s promising improvements down the line.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,767
Reputation
10,607
Daps
185,947


1/3


2/3
we're hiring designers, engineers, product, data, infra, and ai tutors - join us!

Careers

3/3

Grok-1.5 Vision Preview


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196






April 12, 2024

Grok-1.5 Vision Preview​

Connecting the digital and physical worlds with our first multimodal model.

Introducing Grok-1.5V, our first-generation multimodal model. In addition to its strong text capabilities, Grok can now process a wide variety of visual information, including documents, diagrams, charts, screenshots, and photographs. Grok-1.5V will be available soon to our early testers and existing Grok users.

Capabilities​

Grok-1.5V is competitive with existing frontier multimodal models in a number of domains, ranging from multi-disciplinary reasoning to understanding documents, science diagrams, charts, screenshots, and photographs. We are particularly excited about Grok’s capabilities in understanding our physical world. Grok outperforms its peers in our new RealWorldQA benchmark that measures real-world spatial understanding. For all datasets below, we evaluate Grok in a zero-shot setting without chain-of-thought prompting.


Grok-1.5VGPT-4VClaude 3 SonnetClaude 3 OpusGemini Pro 1.5
53.6%56.8%53.1%59.4%58.5%
52.8%49.9%47.9%50.5%52.1%
88.3%78.2%88.7%88.1%80.3%
78.1%78.0%--73.5%
76.1%78.5%81.1%80.8%81.3%
85.6%88.4%89.5%89.3%86.5%
68.7%61.4%51.9%49.8%67.5%

BenchmarkGrok-1.5VGPT-4VClaude 3 SonnetClaude 3 OpusGemini Pro 1.5
MMMU
Multi-discipline
53.6%56.8%53.1%59.4%58.5%
Mathvista
Math
52.8%49.9%47.9%50.5%52.1%
AI2D
Diagrams
88.3%78.2%88.7%88.1%80.3%
TextVQA
Text reading
78.1%78.0%--73.5%
ChartQA
Charts
76.1%78.5%81.1%80.8%81.3%
DocVQA
Documents
85.6%88.4%89.5%89.3%86.5%
RealWorldQA
Real-world understanding
68.7%61.4%51.9%49.8%67.5%

Writing code from a diagram​

Sample 1/7

Next sample

A diagram describing a guessing game with a flowchart of the game logic and user interactions


User​

Can you translate this into Python code?

Grok​

Certainly! The flowchart you’ve provided describes a simple guessing game where the computer generates a random number, and the user has to guess it. Here’s the Python code that represents the logic in the flowchart:
python
Python:
import random

def guess_number(): # Generate a random number between 1 and 10

    target = random.randint(1, 10)

    while True:
        guess = int(input("Guess a number between 1 and 10: "))

        if guess == target:
            print("You won!")
            break
        else:
            print("Wrong guess, try again!")

if __name__ == "__main__":
    guess_number()

This code will run the guessing game as described in the flowchart. The user will be prompted to enter their guess until they guess the correct number.

Calculating calories​

Sample 2/7

Next sample

A close-up of the side panel of a food product box, listing various nutritional details such as the serving size and the amount of calories per serving.


User​

How many calories are there in 5 slices of this?

Grok​

If one serving size is 3 slices and contains 60 calories, then 5 slices would contain approximately 100 calories. This is calculated by first determining the calories per slice: 60 calories / 3 slices = 20 calories per slice. Then, multiply the calories per slice by the number of slices: 20 calories per slice * 5 slices = 100 calories.



From a drawing to a bedtime story​

Sample 3/7
Next sample
A drawing from a child, showing a boy next to a boat.

User​

My son drew this, can you tell a short bedtime story based on his drawing?

Grok​

Once upon a time, in a land of adventure and wonder, there was a brave little boy named Timmy. Timmy lived in a cozy little house with his family, and he loved nothing more than exploring the world around him.

One sunny day, Timmy decided to go on an adventure in the nearby forest. He packed his favorite snacks, a compass, and his trusty notebook to draw all the exciting things he would see.

As Timmy ventured deeper into the forest, he discovered a magical river that sparkled in the sunlight. The river was so enchanting that Timmy decided to build a small paper boat to send down the river as a special message to the forest spirits.

Timmy carefully folded his paper boat and drew a smiley face on it to bring joy to the forest spirits. He then placed the boat on the river and watched as it floated away, carrying his message of friendship and love for nature.

As the sun began to set, Timmy knew it was time to head back home. He felt grateful for the magical day he had spent in the forest, and he knew that he would always carry the memories of his adventure in his heart.

Back at home, Timmy shared his story with his family, who were delighted to hear about his magical day in the forest. They all agreed that Timmy’s brave and kind spirit was a true inspiration, and they knew that he would continue to have many more wonderful adventures in the future.

And so, with a heart full of love and a mind full of dreams, Timmy drifted off to sleep, eager to discover what new adventures awaited him in the world of his imagination.


Explaining a meme​

Sample 4/7
Next sample
A meme satirizing the differences between startups and big companies. The meme features two panels. On the left, a group of construction workers actively digging, under the text 'Startups'. On the right, a group of people watching a single man dig, under the tex t 'Big Companies'.

User​

I don't get it, please explain

Grok​

The image is a humorous comparison between startups and big companies. On the left side, labeled "Startups," there is a group of people working together to dig a hole. They are all actively participating, and the hole is already quite deep.

On the right side, labeled "Big companies," there is a group of people standing around a hole. However, unlike the startup side, only one person is actually digging the hole, while the others are watching or engaged in other activities.

The humor in this image comes from the exaggeration of the differences between startups and big companies. In startups, there is often a sense of urgency and everyone is expected to contribute directly to the task at hand. In contrast, big companies are often depicted as having more bureaucracy and less direct involvement from all team members in every project.


Converting a table to csv​

Sample 5/7
Next sample
A table with tracking Olympic medal winners from Morocco at the 2016 Summer Paralympics

User​

Please convert this table to csv.

Grok​

The table in the image can be translated to CSV format as follows:
csv
Code:
medal,name,sport,event,date
1 gold,El Amin Chentouf,Athletics,Men's marathon - T13,11 September
1 gold,Mohamed Amguoun,Athletics,Men's 400m - T13,15 September
1 gold,Azeddine Nouiri,Athletics,Men's shot put - T34,15 September
2 silver,Mahdi Afri,Athletics,Men's 400m - T12,9 September
2 silver,El Amin Chentouf,Athletics,Men's 5000m - T13,15 September
3 bronze,Mahdi Afri,Athletics,Men's 200m - T12,17 September
3 bronze,Mohamed Lahna,Paratriathlon,Men's PT2,10 September

Please note that the CSV format does not support styling or formatting, so the information is presented in a plain text format.

Help with rotten wood on a deck​

Sample 6/7
Next sample
A picture of a deck with wood that has started to rot.

User​

These holes are appearing around the screws in my deck. Is it rotten? Should I get my boards replaced?

Grok​

It appears that the wood in your deck has started to rot. This is indicated by the presence of holes around the screws, which is a common sign of wood decay.

To ensure the safety and longevity of your deck, it is recommended to have the affected boards replaced. This will help to prevent further damage to the deck and maintain its structural integrity.

If you are unsure about how to proceed with the replacement of the boards, it may be helpful to consult with a professional deck builder or contractor. They will be able to assess the extent of the damage and provide you with expert advice on the best course of action for your deck.

Solving a coding problem​

Sample 7/7
Next sample
A screenshot from the website Leetcode showing a problem statement for the 3Sum problem.

User​

Can you write Python code that solves this?

Grok​

python
Python:
def find_triplets(nums):
    result = []
    nums.sort()
    
    for i in range(len(nums) - 2):
        if i > 0 and nums[i] == nums[i - 1]:
            continue
        
        target = -nums[i]
        left = i + 1
        right = len(nums) - 1
        
        while left < right:
            if nums[left] + nums[right] == target:
                result.append([-target, nums[left], nums[right]])
                left += 1
                right -= 1
                while left < right and nums[left] == nums[left - 1]:
                    left += 1
                while left < right and nums[right] == nums[right + 1]:
                    right -= 1
            elif nums[left] + nums[right] < target:
                left += 1
            else:
                right -= 1
    
    return result

Real-World Understanding​

In order to develop useful real-world AI assistants, it is crucial to advance a model's understanding of the physical world. Towards this goal, we are introducing a new benchmark, RealWorldQA. This benchmark is designed to evaluate basic real-world spatial understanding capabilities of multimodal models. While many of the examples in the current benchmark are relatively easy for humans, they often pose a challenge for frontier models.

A picture of various household objects on a table.

Which object is larger the pizza cutter or the scissors? A. The pizza cutter is larger. B. The scissors is larger. C. They are about the same size.

A picture of a dark street scene with a sign that says left lane must turn left

Where can we go from the current lane? A. Turn left. B. Go straight. C. Turn left and go straight. D. Turn right.

A picture of a street with several cars parked on the left

Given this front camera view from our sedan, do we have enough space to drive around the gray car in front of us? A. Yes. B. No.

A picture of a toy next to a cell phone showing a compass.

Given the picture, in which cardinal direction is the dinosaur facing? A. North. B. South. C. East. D. West.

The initial release of the RealWorldQA consists of over 700 images, with a question and easily verifiable answer for each image. The dataset consists of anonymized images taken from vehicles, in addition to other real-world images. We are excited to release RealWorldQA to the community, and we intend to expand it as our multimodal models improve. RealWorldQA is released under CC BY-ND 4.0. Click here (677MB) to download the dataset.


Into the future​

Advancing both our multimodal understanding and generation capabilities are important steps in building beneficial AGI that can understand the universe. In the coming months, we anticipate to make significant improvements in both capabilities, across various modalities such as images, audio, and video.
 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
68,767
Reputation
10,607
Daps
185,947










1/5
Google presents RecurrentGemma

Moving Past Transformers for Efficient Open Language Models

We introduce RecurrentGemma, an open language model which uses Google's novel Griffin architecture. Griffin combines linear recurrences with local attention to achieve excellent

2/5
performance on language. It has a fixed-sized state, which reduces memory use and enables efficient inference on long sequences. We provide a pre-trained model with 2B non-

3/5
embedding parameters, and an instruction tuned variant. Both models achieve comparable performance to Gemma-2B despite being trained on fewer tokens.

4/5
paper page:

5/5
daily papers:


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GK7aOw9XgAAvLX_.jpg



1/2
BREAKING

Google releases model with new Griffin architecture that outperforms transformers.

Across multiple sizes, Griffin out performs the benchmark scores of transformers baseline in controlled tests in both the MMLU score across different parameter sizes as well as the average score of many benchmarks. The architecture also offers efficiency advantages with faster inference and lower memory usage when inferencing long contexts.

2B version of this on huggingface today:

2/2
PAPER - https://arxiv.org/pdf/2402.19427.pdf 2B version of this on huggingface -


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GKvT0byXwAA2Paq.png

GLKXfaRacAAct5B.jpg










1/10
Announcing RecurrentGemma!

- A 2B model with open weights based on Griffin
- Replaces transformer with mix of gated linear recurrences and local attention
- Competitive with Gemma-2B on downstream evals
- Higher throughput when sampling long sequences

2/10
Building on ideas from SSMs and LSTMs, Griffin matches transformer performance without global attention, achieving faster inference on long sequences.

https://arxiv.org/abs/2402.19427 See
@sohamde_ 's great thread for more details:

3/10
Just got back from vacation, and super excited to finally release Griffin - a new hybrid LLM mixing RNN layers with Local Attention - scaled up to 14B params!

https://arxiv.org/abs/2402.19427 My co-authors have already posted about our amazing results, so here's a on how we got there!

4/10
In RecurrentGemma, we provide two 2B model checkpoints:
- A pretrained model, trained for 2T tokens
- An instruction tuned model for dialogue

We train on the same data as Gemma-2B for fewer tokens, achieving comparable performance.

Technical report:

https://storage.googleapis.com/deepmind-media/gemma/recurrentgemma-report.pdf

5/10
We provide efficient jax code for RecurrentGemma, which can also be used for general Griffin models.

This includes a memory efficient implementation of the linear recurrence in Pallas, with which we match the training speed of transformers on TPU


6/10
We also provide a reference implementation in Pytorch, though we recommend the jax code for best performance.

We hope RecurrentGemma provides an alternative to pre-trained transformers, and enables researchers to explore the capabilities of this exciting new model family.

7/10
A huge thank you to our brilliant team, especially @botev_mg @sohamde_ Anushan Fernando @GeorgeMuraru @haroun_ruba @LeonardBerrada


Finally, I'm sure there will be a few rough edges in the initial release, we will try to address these as fast as we can!

8/10
This is quite a complicated question to answer!

There is a way to efficiently expand the RNN width without slowing down training. But to do this you need to wrap the entire recurrent block in a single Pallas/CUDA kernel.

9/10
We looked into the long context performance of Griffin in the paper: [2402.19427] Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

We found Griffin performs well on some long context tasks (eg extrapolation/next token loss on v long sequences). However transformers are better at needle in haystack retrieval tasks.

10/10
Like Gemma, RecurrentGemma was trained on sequences of 8k tokens, although we also found in Griffin paper ([2402.19427] Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models) that we can extrapolate beyond the training sequence length for some tasks.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GKvE3hNXgAAXxQY.png

GKvb6sFXQAELtC7.png




1/2
1/n Griffin: DeepMind's Soaring Leap Beyond the Transformer Paradigm

Imagine having a conversation with an AI assistant that can engage with you for hours on end, seamlessly incorporating context from your entire dialogue history. Or envision AI models that can analyze books, research papers, or legal documents of arbitrary length with perfect comprehension. The unbounded potential of artificial intelligence seems tantalizing close, if only our current language models could truly scale to sequences of unlimited length.

The transformer architecture underlying modern large language models like GPT, BERT, and their successors has delivered remarkable capabilities. However, these models face fundamental bottlenecks when working with very long sequences due to their reliance on global self-attention mechanisms. As sequence lengths grow, transformers must iterate over quadratically increasing numbers of pairwise comparisons, while caching linear amounts of intermediate representations. This makes inference on long sequences increasingly slow and computationally impractical.

But what if we could design language models that achieve transformer-level performance while scaling efficiently to sequences of arbitrary length? A team of researchers from Google DeepMind may have unlocked the solution by revisiting an old friend - recurrent neural networks. In their groundbreaking work "Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models", they propose two novel architectures, Hawk and Griffin, that combine the recursive strength of RNNs with the latest capabilities of transformers.

At the core of these models lies a powerful innovation - the Real-Gated Linear Recurrent Unit (RG-LRU) layer. By incorporating gating mechanisms into linear recurrences, the RG-LRU can flexibly control how information flows through the recurrent state, allowing seamless summarization of entire sequences into fixed-sized representations. When combined with local attention over recent context, this mechanism proves transformative.

The results are striking - Griffin models achieve state-of-the-art performance matching or exceeding high-profile transformer models like Llama, all while being trained on a fraction of the data. More importantly, they demolish the shackles of sequence length, enabling efficient extrapolation to sequences far longer than seen during training. On the hardware front, custom optimizations for the recurrent layers allow training speeds on par with transformers.

2/2
Dude, it's all emergent stuff! Who knows, just as long as it works!!


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GK4XRASWcAEeel4.jpg









1/8
Meta’s new AI training and inference chips 3x performance

Plus, more insights in Generative AI today from Google, Microsoft and Mistral

Here's everything you need to know:

2/8
➲ Meta unveiled its latest MTIA AI chip family, outperforming last year's version by 3x across four key metrics.

It aims to decrease reliance on Nvidia GPUs while enhancing their proprietary systems.

3/8
Introducing the next generation of the Meta Training and Inference Accelerator (MTIA), the next in our family of custom-made silicon, designed for Meta’s AI workloads.

Full details Our next generation Meta Training and Inference Accelerator

4/8
➲ Mistral AI launched Mixtral 8x22B, a 176B Mixture-of-Experts model with a 65k context window, locally executable.

It scores 77% on MMLU benchmark, compared to Cohere Command R+ (75%) and GPT-4 (86%).

It's open source and accessible on Hugging Face.

Video credit: awnihannun

5/8
Google's Gemini 1.5 system prompt got leaked.

It prevents emotions, consciousness, personal opinions, self-awareness, and self-preservation.

6/8
➲ Former Google DeepMind researchers launched Udio, an AI-powered music creation app.

Users can generate full audio tracks in under 40 seconds with simple prompts.

The app secured early funding from a16z, Common, and others.

7/8
Introducing Udio, an app for music creation and sharing that allows you to generate amazing music in your favorite styles with intuitive and powerful text-prompting.

1/11

8/8
Microsoft Bing unveiled a new patent for 'Visual Search'

It details a reverse image search with personalised results based on user preferences.

Credit: mspoweruser


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GK4O6cLXEAAo1F9.png

GK4PQ48WgAEs4AA.jpg

GK4Po5NW8AA1ppq.jpg
 
Top