bnew

Veteran
Joined
Nov 1, 2015
Messages
58,040
Reputation
8,592
Daps
161,672

A new way to build neural networks could make AI more understandable​


The simplified approach makes it easier to see how neural networks produce the outputs they do.

By Anil Ananthaswa
myarchive page


August 30, 2024

builder with a KAN diagram inset in a block

Stephanie Arnett/ MIT Technology Review | Envato

A tweak to the way artificial neurons work in neural networks could make AIs easier to decipher.

Artificial neurons—the fundamental building blocks of deep neural networks—have survived almost unchanged for decades. While these networks give modern artificial intelligence its power, they are also inscrutable.

Existing artificial neurons, used in large language models like GPT4, work by taking in a large number of inputs, adding them together, and converting the sum into an output using another mathematical operation inside the neuron. Combinations of such neurons make up neural networks, and their combined workings can be difficult to decode.

But the new way to combine neurons works a little differently. Some of the complexity of the existing neurons is both simplified and moved outside the neurons. Inside, the new neurons simply sum up their inputs and produce an output, without the need for the extra hidden operation. Networks of such neurons are called Kolmogorov-Arnold Networks (KANs), after the Russian mathematicians who inspired them.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.

The simplification, studied in detail by a group led by researchers at MIT, could make it easier to understand why neural networks produce certain outputs, help verify their decisions, and even probe for bias. Preliminary evidence also suggests that as KANs are made bigger, their accuracy increases faster than networks built of traditional neurons.

“It's interesting work,” says Andrew Wilson, who studies the foundations of machine learning at New York University. “It's nice that people are trying to fundamentally rethink the design of these [networks].”

The basic elements of KANs were actually proposed in the 1990s, and researchers kept building simple versions of such networks. But the MIT-led team has taken the idea further, showing how to build and train bigger KANs, performing empirical tests on them, and analyzing some KANs to demonstrate how their problem-solving ability could be interpreted by humans. “We revitalized this idea,” said team member Ziming Liu, a PhD student in Max Tegmark’s lab at MIT. “And, hopefully, with the interpretability… we [may] no longer [have to] think neural networks are black boxes.”

While it's still early days, the team’s work on KANs is attracting attention. GitHub pages have sprung up that show how to use KANs for myriad applications, such as image recognition and solving fluid dynamics problems.

Finding the formula


The current advance came when Liu and colleagues at MIT, Caltech, and other institutes were trying to understand the inner workings of standard artificial neural networks.

Today, almost all types of AI, including those used to build large language models and image recognition systems, include sub-networks known as a multilayer perceptron (MLP). In an MLP, artificial neurons are arranged in dense, interconnected “layers.” Each neuron has within it something called an “activation function”—a mathematical operation that takes in a bunch of inputs and transforms them in some pre-specified manner into an output.

In an MLP, each artificial neuron receives inputs from all the neurons in the previous layer and multiplies each input with a corresponding “weight” (a number signifying the importance of that input). These weighted inputs are added together and fed to the activation function inside the neuron to generate an output, which is then passed on to neurons in the next layer. An MLP learns to distinguish between images of cats and dogs, for example, by choosing the correct values for the weights of the inputs for all the neurons. Crucially, the activation function is fixed and doesn’t change during training.

Once trained, all the neurons of an MLP and their connections taken together essentially act as another function that takes an input (say, tens of thousands of pixels in an image) and produces the desired output (say, 0 for cat and 1 for dog). Understanding what that function looks like, meaning its mathematical form, is an important part of being able to understand why it produces some output. For example, why does it tag someone as creditworthy given inputs about their financial status? But MLPs are black boxes. Reverse-engineering the network is nearly impossible for complex tasks such as image recognition.

And even when Liu and colleagues tried to reverse-engineer an MLP for simpler tasks that involved bespoke “synthetic” data, they struggled.

“If we cannot even interpret these synthetic datasets from neural networks, then it's hopeless to deal with real-world data sets,” says Liu. “We found it really hard to try to understand these neural networks. We wanted to change the architecture.”

Mapping the math


The main change was to remove the fixed activation function and introduce a much simpler learnable function to transform each incoming input before it enters the neuron.

Unlike the activation function in an MLP neuron, which takes in numerous inputs, each simple function outside the KAN neuron takes in one number and spits out another number. Now, during training, instead of learning the individual weights, as happens in an MLP, the KAN just learns how to represent each simple function. In a paper posted this year on the preprint server ArXiv, Liu and colleagues showed that these simple functions outside the neurons are much easier to interpret, making it possible to reconstruct the mathematical form of the function being learned by the entire KAN.

Related Story​

faceoff between a colorful army of the proponents of different philosophies
What is AI?


Everyone thinks they know but no one can agree. And that’s a problem.

The team, however, has only tested the interpretability of KANs on simple, synthetic data sets, not on real-world problems, such as image recognition, which are more complicated. “[We are] slowly pushing the boundary,” says Liu. “Interpretability can be a very challenging task.”

Liu and colleagues have also shown that KANs get more accurate at their tasks with increasing size faster than MLPs do. The team proved the result theoretically and showed it empirically for science-related tasks (such as learning to approximate functions relevant to physics). “It's still unclear whether this observation will extend to standard machine learning tasks, but at least for science-related tasks, it seems promising,” Liu says.

Liu acknowledges that KANs come with one important downside: it takes more time and compute power to train a KAN, compared to an MLP.

“This limits the application efficiency of KANs on large-scale data sets and complex tasks,” says Di Zhang, of Xi’an Jiaotong-Liverpool University in Suzhou, China. But he suggests that more efficient algorithms and hardware accelerators could help.

Anil Ananthaswamy is a science journalist and author who writes about physics, computational neuroscience, and machine learning. His new book, WHY MACHINES LEARN: The Elegant Math Behind Modern AI, was published by Dutton (Penguin Random House US) in July.

by​

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,040
Reputation
8,592
Daps
161,672

AI

The org behind the dataset used to train Stable Diffusion claims it has removed CSAM​


Kyle Wiggers

10:39 AM PDT • August 30, 2024


A collage of images created by Stable Diffusion.
Image Credits: Daniel Jeffries (opens in a new window)

LAION,
the German research org that created the data used to train Stable Diffusion, among other generative AI models, has released a new dataset that it claims has been “thoroughly cleaned of known links to suspected child sexual abuse material (CSAM).”

The new dataset, Re-LAION-5B, is actually a re-release of an old dataset, LAION-5B — but with “fixes” implemented with recommendations from the nonprofit Internet Watch Foundation, Human Rights Watch, the Canadian Center for Child Protection and the now-defunct Stanford Internet Observatory. It’s available for download in two versions, Re-LAION-5B Research and Re-LAION-5B Research-Safe (which also removes additional NSFW content), both of which were filtered for thousands of links to known — and “likely” — CSAM, LAION says.

“LAION has been committed to removing illegal content from its datasets from the very beginning and has implemented appropriate measures to achieve this from the outset,” LAION wrote in a blog post. “LAION strictly adheres to the principle that illegal content is removed ASAP after it becomes known.”

Important to note is that LAION’s datasets don’t — and never did — contain images. Rather, they’re indexes of links to images and image alt text that LAION curated, all of which came from a different dataset — the Common Crawl — of scraped sites and web pages.

The release of Re-LAION-5B comes after an investigation in December 2023 by the Stanford Internet Observatory that found that LAION-5B — specifically a subset called LAION-5B 400M — included at least 1,679 links to illegal images scraped from social media posts and popular adult websites. According to the report, 400M also contained links to “a wide range of inappropriate content including pornographic imagery, racist slurs, and harmful social stereotypes.”

While the Stanford co-authors of the report noted that it would be difficult to remove the offending content and that the presence of CSAM doesn’t necessarily influence the output of models trained on the dataset, LAION said it would temporarily take LAION-5B offline.

The Stanford report recommended that models trained on LAION-5B “should be deprecated and distribution ceased where feasible.” Perhaps relatedly, AI startup Runway recently took down its Stable Diffusion 1.5 model from the AI hosting platform Hugging Face; we’ve reached out to the company for more information. (Runway in 2023 partnered with Stability AI, the company behind Stable Diffusion, to help train the original Stable Diffusion model.)

Of the new Re-LAION-5B dataset, which contains around 5.5 billion text-image pairs and was released under an Apache 2.0 license, LAION says that the metadata can be used by third parties to clean existing copies of LAION-5B by removing the matching illegal content.

LAION stresses that its datasets are intended for research — not commercial — purposes. But, if history is any indication, that won’t dissuade some organizations. Beyond Stability AI, Google once used LAION datasets to train its image-generating models.

“In all, 2,236 links [to suspected CSAM] were removed after matching with the lists of link and image hashes provided by our partners,” LAION continued in the post. “These links also subsume 1008 links found by the Stanford Internet Observatory report in December 2023 … We strongly urge all research labs and organizations who still make use of old LAION-5B to migrate to Re-LAION-5B datasets as soon as possible.”
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,040
Reputation
8,592
Daps
161,672


1/3
The prompt used in the paper 'People Cannot Distinguish GPT-4 from a Human in a Turing Test' is quite revealing about humans.

tldr: "Be dumb"

2/3
Yeah, It's only a 5-minute Turing test.

"GPT-4 was judged to be a human 54% of the time, outperforming ELIZA (22%) but lagging behind actual humans (67%)"

3/3
https://arxiv.org/pdf/2405.08007


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GWYOD3VWsAAEg8S.png

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,040
Reputation
8,592
Daps
161,672




1/5
CEO of OpenAI Japan says GPT-Next will be released this year, and its effective computational load is 100x greater than GPT-4

More on Orion and Strawberry:

"GPT-4 NEXT, which will be released this year, is expected to be trained using a miniature version of Strawberry with roughly the same computational resources as GPT-4, with an effective computational load 100 times greater.

"The AI model called 'GPT Next' that will be released in the future will evolve nearly 100 times based on past performance. Unlike traditional software, AI technology grows exponentially."

…The slide clearly states 2024 "GPT Next". This 100 times increase probably does not refer to the scaling of computing resources, but rather to the effective computational volume + 2 OOMs, including improvements to the architecture and learning efficiency.

Orion, which has been in the spotlight recently, was trained for several months on the equivalent of 10k H100 compared to GPT-4, adding 10 times the computational resource scale, making it +3 OOMs, and is expected to be released sometime next year."

(Note: this translation is from Google - if you speak Japanese and see anything off, please share!)

2/5
Reminder: scaling laws have held through 15 orders of magnitude

Exponential go brr

3/5
Note: this is a translation of Bioshok's tweet. It contains a mix of new info (the slide saying GPT Next is coming in 2024 and will be +100x effective compute), recent OpenAI news, and some of his speculations to tie it together.

4/5
OpenAI's Head of Developer Experience used a similar graph in a recent presentation

5/5
Yep insiders continue to be very clear about the exponential


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GWjGcZSWgAAW_t7.png

GWiqEWUa0AA-tF9.jpg

GWjLrGJXQAUWMXo.jpg

GWjamyAWsAEQqf_.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,040
Reputation
8,592
Daps
161,672

1/1
AI2 presents OLMoE: Open Mixture-of-Experts Language Models

- Opensources SotA LMs w/ MoE up to 7B active params.
- Releases model weights, training data, code, and logs.

repo: GitHub - allenai/OLMoE: OLMoE: Open Mixture-of-Experts Language Models
hf: allenai/OLMoE-1B-7B-0924 · Hugging Face
abs: [2409.02060] OLMoE: Open Mixture-of-Experts Language Models


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GWm9-EcaAAAf8j7.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,040
Reputation
8,592
Daps
161,672

1/4
Improving 2D Feature Representations by 3D-Aware Fine-Tuning

proj: Improving 2D Feature Representations by 3D-Aware Fine-Tuning
abs: [2407.20229] Improving 2D Feature Representations by 3D-Aware Fine-Tuning

2/4
The study aims to improve the 3D understanding abilities of 2D vision foundation models, which are typically trained on 2D image data without considering the underlying 3D structure of the visual world.

The researchers demonstrate that incorporating the 3D-aware fine-tuned features improves the performance of 2D vision models on semantic segmentation and depth estimation tasks, both on in-domain and out-of-domain datasets. The fine-tuned features exhibit cleaner and more detailed feature maps compared to the original 2D features.

full paper: Improving 2D Feature Representations by 3D-Aware Fine-Tuning

3/4
Thanks @arankomatsuzaki for sharing!

4/4
Fascinating approach to improving 2D feature representations! I'm curious to see how this 3D-aware fine-tuning method will impact downstream tasks like semantic segmentation.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GWkbBHqaoAAK2bL.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,040
Reputation
8,592
Daps
161,672





1/6
Your LLM may be sparser than you thought!

Excited to announce TEAL, a simple training-free method that achieves up to 40-50% model-wide activation sparsity on Llama-2/3 and Mistral models. Combined with a custom kernel, we achieve end-to-end speedups of up to 1.53x-1.8x!

2/6
Activation sparsity results in decoding speedup by avoiding the transfer of weights associated with zero-valued activations. Older work exploits the high emergent sparsity (~95%!) in the hidden states of ReLU-based LLMs. However, modern SwiGLU-based LLMs aren't very sparse (<1%).

3/6
It turns out they’re “almost” sparse -- activations across the model tend to cluster around zero. We don’t have a rigorous explanation, but we suspect a large part of this is due to LayerNorm. By clipping entries close to zero, we can potentially reintroduce activation sparsity.

4/6
Indeed, we can! In practice, we set thresholds for each matrix to clip a certain percent of activations based on their magnitudes. On perplexity and on a variety of downstream tasks, we’re able to achieve up to 40-50% model-wide sparsity with minimal degradation.

5/6
To benchmark end-to-end performance, we integrate a custom sparse GEMV kernel with GPT-Fast. We measure wall-clock speedups of 1.53x and 1.8x at 40% and 50% sparsity respectively.

6/6
Paper: [2408.14690] Training-Free Activation Sparsity in Large Language Models
Code: GitHub - FasterDecoding/TEAL
Blog: TEAL: Training-Free Activation Sparsity in Large Language Models

Thanks to collaborators that made this possible (Pragaash, @tianle_cai , @HanGuo97 , Prof. Yoon Kim, @ben_athi)!


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GWM2ZzHawAAB9Yy.jpg

GWM2oaPWYAE7mBM.jpg

GWM2xAOWUAATm1t.png

GWM223kaoAAXTF9.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,040
Reputation
8,592
Daps
161,672

1/4
Meta presents Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

- Can generate images and text on a par with similar scale diffusion models and language models
- Compresses each image to just 16 patches

[2408.11039] Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

2/4
The paper introduces Transfusion, a method for training a single multi-modal model that can understand and generate both discrete (text) and continuous (image) data. This is in contrast to previous approaches that either use separate models for different modalities or quantize continuous data into discrete tokens.

The experiments show that Transfusion models consistently outperform the Chameleon baseline, which discretizes images and trains on the mixed text-image sequence. Transfusion achieves significantly better performance on text-to-image generation, requiring only about 30% of the compute of Chameleon to reach the same level of performance. Transfusion also performs better on text-only tasks, despite modeling text in the same way as Chameleon.

full paper: Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

3/4
Looking forward ⏩ to read this paper

4/4
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GVeUCb-aMAEtTfk.png

GVeclGTW8AA3Bgq.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,040
Reputation
8,592
Daps
161,672


1/3
We released phi 3.5: mini+MoE+vision

A better mini model with multilingual support: microsoft/Phi-3.5-mini-instruct · Hugging Face

A new MoE model:microsoft/Phi-3.5-MoE-instruct · Hugging Face

A new vision model supporting multiple images: microsoft/Phi-3.5-vision-instruct · Hugging Face

2/3
All 3 models are now live on Azure AI Studio too:
Phi-3.5-mini-instruct: Azure AI Studio
Phi-3.5-MoE-instruct: Azure AI Studio
Phi-3.5-vision-instruct: Azure AI Studio

3/3
I think it does. Please feel free to share your feedback.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,040
Reputation
8,592
Daps
161,672





1/6
Finally released Hermes 3! A 405b, 70b, and 8b version!

You can find all the details and resources about Hermes 3 on our website:
Hermes 3 - NOUS RESEARCH

We have a H3 405B bot running in our discord, join and try it out here: Join the Nous Research Discord Server!

Or give it a try on @LambdaAPI here: https://lambda.chat/chatui/

2/6
yep!

3/6
The 4bit 70B gguf can fit on 40GB vram - unfofrtunately not a single gpu

4/6
405B - our FP8

5/6
fp8

6/6
One tweet below ^_^


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GVCkLDbaIAA8SSi.jpg

GVCcltuaEAA_mYv.jpg

GVCeToCaEAE8vxG.jpg

GVCeXIsaEAAryOp.jpg








1/8
Introducing 𝐇𝐞𝐫𝐦𝐞𝐬 𝟑: The latest version in our Hermes series, a generalist language model 𝐚𝐥𝐢𝐠𝐧𝐞𝐝 𝐭𝐨 𝐲𝐨𝐮.

Hermes 3 - NOUS RESEARCH

Hermes 3 is available in 3 sizes, 8, 70, and 405B parameters. Hermes has improvements across the board, but with particular capability improvements in roleplaying, agentic tasks, more reliable function calling, multi-turn chats, long context coherence and more.

We published a technical report detailing new capabilities, training run information and more:

Paper: https://nousresearch.com/wp-content/uploads/2024/08/Hermes-3-Technical-Report.pdf

This model was trained in collaboration with our great partners @LambdaAPI, and they are now offering it for free in a chat interface here: https://lambda.chat/chatui/

You can also chat with Hermes 405B on our discord, join here: Join the Nous Research Discord Server!

Hermes 3 was a project built with the help of @Teknium1, @TheEmozilla, @nullvaluetensor, @karan4d, @huemin_art, and an uncountable number of people and work in the Open Source community.

2/8
Hermes 3 performs strongly against Llama-3.1 Instruct Models, but with a focus on aligning the model to you, instead of a company or external policy - meaning less censorship and more steerability - with additional capabilities like agentic XML, scratchpads, roleplaying prowess, and more. Step level reasoning and planning, internal monologues, improved RAG, and even LLM as a judge capabilities were also targeted.

Below are benchmark comparisons between Hermes 3 and Llama-3.1 Instruct and a sample of utilizing the agentic XML tags:

3/8
Lambda's Hermes 3 Announcement Post: Unveiling Hermes 3: The First Full-Parameter Fine-Tuned Llama 3.1 405B Model is on Lambda’s Cloud

Nous' blog post on our experience discovering emergent behavior with 405B:
Freedom at the Frontier: Hermes 3 - NOUS RESEARCH

Hermes 3 405B was trained with @LambdaAPI's new 1-Click Cluster offering, check it out here: Lambda GPU Cloud | 1-Click Clusters

Check out our reference inference code for Hermes Function Calling here: GitHub - NousResearch/Hermes-Function-Calling

Thanks to all the other organizations who helped bring this together, including @weights_biases, @neuralmagic, @vllm_project, @huggingface, @WeAreFireworks, @AiEleuther, @togethercompute, @AIatMeta, and many more

4/8
Special shoutouts to @intrstllrninja for all the work on making function calling real, robust, and useful

and a special thanks to our designer @StudioMilitary for the cover art and all the other designs that Nous uses!

5/8
Will work on @MistralAI's 12b model soon!

6/8
Hermes 3 - a NousResearch Collection

7/8
@karan4d gave it a personality all its own there lol

Lambda has a default assistant personality system chat

8/8
Believe Lambda is also hosting an api version, will update when clear


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GVCcltuaEAA_mYv.jpg

GVCeToCaEAE8vxG.jpg

GVCeXIsaEAAryOp.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,040
Reputation
8,592
Daps
161,672









1/10
SEAL Leaderboard Update🥇

We added 3 new models to the SEAL Leaderboards:

- GPT-4o-latest (gpt-4o-2024-08-06)
- Gemini 1.5 Pro (Aug 27, 2024) (gemini-1.5-pro-exp-0827)
- Mistral Large 2 (mistral-large-2407)

More detailed results in thread 🧵

2/10
GPT-4o-latest (gpt-4o-2024-0806) ranks above the prior GPT-4o (May 2024) on:

- Math (now #2 behind Claude 3.5 Sonnet)
- Coding (now #2 behind Claude 3.5 Sonnet)

but actually performed worse than the May model on Instruction Following (now ranked #8) and Spanish (now ranked #3)

3/10
Gemini-1.5-pro-exp-0827 ranks higher than the May model version on all leaderboards:

- Instruction Following (now #3 behind Llama and Claude)
- Coding (now #4 behind Claude, GPT-4o, and Mistral)
- Math (now #7)

4/10
Mistral Large 2 (mistral-large-2407) also improves on Mistral Large on all leaderboards, and performed quite well:

- Coding (#3 behind Claude and GPT-4o)
- Instruction Following (now #6)
- Math (now #8)
- Spanish (now #4 behind GPT-4o May, Gemini 1.5 Pro, and GPT-4o August)

5/10
Some insights from the leaderboards:

CODING

We notice that GPT-4o (August 2024) performs the best on Code Correctness among all the models, but Claude outdoes it on Prompt Adherence

6/10
INSTRUCTION FOLLOWING (FACTUALITY):

We find that GPT-4 Turbo is still the most factual model among all of the models we've evaluated.

With the trend towards smaller, distilled models, we notice that it seems to come at the cost of performance on factuality.

7/10
INSTRUCTION FOLLOWING:

Claude 3.5 Sonnet and Llama 4.1 405B Instruct stand out as models with BOTH:

- very high (>0.9) Main Request fulfillment
- very high (>0.7) Constraint fulfillment

Also, all Claude models take a hit for Structural Clarity in writing style

8/10
MULTILINGUAL (SPANISH)

Lastly, one interesting pattern we noticed is that every model seemed to perform worse at Instruction Following (Main Request fulfillment and Constraint fulfillment) in Spanish versus English.

This implies there's still meaningful headroom for models to improve on multilingual capabilities.

9/10
Given that our Math leaderboard (GSM1K) is now relatively saturated (most models score >90), we are working on producing a new Math benchmark to properly discern between models.

10/10
Expect more updates, including the addition of Grok 2, in the coming weeks!


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GWptA1ua4AApkya.jpg

GWp0hkLa8AQQjrj.jpg

GWp1DQRXkAAQdWx.jpg

GWp1ZQ2WEAAPAos.jpg

GWp1-9ka8AEpKui.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,040
Reputation
8,592
Daps
161,672









1/10
Releasing OLMoE - the first good Mixture-of-Experts LLM that's 100% open-source
- 1B active, 7B total params for 5T tokens
- Best small LLM & matches more costly ones like Gemma, Llama
- Open Model/Data/Code/Logs + lots of analysis & experiments

📜[2409.02060] OLMoE: Open Mixture-of-Experts Language Models
🧵1/9

2/10
OLMoE Performance

- Best model with 1B active params making it the cheapest LM for many use cases
- Exceeds many larger models like DeepSeek, Llama2 & Qwen-Chat
🧵2/9

3/10
OLMoE Efficiency

Thanks to Mixture-of-Experts, better data & hyperparams, OLMoE is much more efficient than OLMo 7B:
- >4x less training FLOPs
- >5x less params used per forward pass
i.e. cheaper training + cheaper inference!
🧵3/9

4/10
OLMoE Experiments

1) Expert granularity i.e. we use 64 small experts per layer
2) Dropless token choice beats expert choice routing
3) Shared experts worse for us
4) Sparse upcycling not helpful in our regime

Check the paper for more🙂
🧵4/9

5/10
OLMoE Analysis 1

With 250 ckpts & all open, interesting analysis research can be done w/ OLMoE! Our early investigations:

Potential domain specialization in OLMoE with some experts active much more for arXiv/GitHub (less so for Mixtral…maybe as it was upcycled?)
🧵5/9

6/10
OLMoE Analysis 2

Potential specialization on token IDs, e.g. Expert 43 seems to be a geography expert & 3 a family expert..🤔

Check the paper for analysis on router saturation, expert co-activation & more details🙂
🧵6/9

7/10
Thank you for reading!

Paper with links to model, data, code, etc: [2409.02060] OLMoE: Open Mixture-of-Experts Language Models ; all is meant to be free & open🙂

Thanks to prior work on MoEs including Qwen, JetMoE, OpenMoE & thanks to @allen_ai @ContextualAI for enabling OLMoE ❤️
🧵7/9

8/10
Thanks to an amazing team: @soldni @mechanicaldirk @kylelostat @jacobcares @sewon__min @WeijiaShi2 @epwalsh Oyvind Tafjord @gu_yuling @natolambert @shanearora0 @AkshytaB93 Dustin Schwenk @_awettig @Tim_Dettmers @huybery @douwekiela Ali Farhadi @nlpnoah @PangWeiKoh 🧵8/9

9/10
@apsdehal @HannaHajishirzi Also grateful to many colleagues & others @ArmenAgha @adityakusupati @AnanyaHarsh @vwxyzjn @strubell @faeze_brh @hamishivi @KarelDoostrlnck @IanMagnusson @jayc481 @liujc1998 @pdasigi @Sahil1V @StasBekman @valentina__py @yanaiela @yizhongwyz 🤗
🧵9/9

10/10
Thanks! My understanding is that expert choice routing breaks autoregressive text generation as it is ambiguous which expert receives the next token. But since we don't use it to generate and don't evaluate it on text generation, we don't run into this problem.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GWl8U5VagAAa_Wi.jpg

GWlwMGRa8AAObkE.jpg

GWlwtXBWAAALJW-.jpg

GWlxWeyW0AA_RwD.jpg

GWlypdRbYAAOOo0.jpg

GWlzQXcacAAZ6R_.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,040
Reputation
8,592
Daps
161,672

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,040
Reputation
8,592
Daps
161,672








1/9
This morning the US Government announced that OpenAI and Anthropic have signed a formal collaboration on AI safety research, testing and evaluation. Under the deal the USAISI will have access to major new models from both OpenAI and Anthropic prior to release.

2/9


3/9
U.S. AI Safety Institute Signs Agreements Regarding AI Safety Research, Testing and Evaluation With Anthropic and OpenAI

4/9
Coverage from Reuters.

5/9
https://www.reuters.com/technology/...-with-us-govt-ai-research-testing-2024-08-29/

6/9
USAISI, which I'm choosing to pronounce You - Sigh - See, was established Nov 1st under the AI Executive Order. It falls under the National Institute of Standards and Technology (NIST) which in turn is an agency of the United States Department of Commerce.

7/9
At the Direction of President Biden, Department of Commerce to Establish U.S. Artificial Intelligence Safety Institute to Lead Efforts on AI Safety

8/9
Statement from Mr A:

9/9
I've seen no details beyond what was said today, I don't think anything is public yet. Even something like access to a 'major' new model has a lot of ambiguity. Some people would say that's not until GPT-5/Claude 4.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GWJ27-cXkAMjfLM.png

GWJ3G5dXQAE9wtZ.png

GWJ3lavXIAAZSkm.png

GWKGFVGXIAAE1vc.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,040
Reputation
8,592
Daps
161,672

1/11
After a recent price reduction by OpenAI, GPT-4o tokens now cost $4 per million tokens (using a blended rate that assumes 80% input and 20% output tokens). GPT-4 cost $36 per million tokens at its initial release in March 2023. This price reduction over 17 months corresponds to about a 79% drop in price per year. (4/36 = (1 - p)^{17/12})

As you can see, token prices are falling rapidly! One force that’s driving prices down is the release of open weights models such as Llama 3.1. If API providers, including startups Anyscale, Fireworks, Together AI, and some large cloud companies, do not have to worry about recouping the cost of developing a model, they can compete directly on price and a few other factors such as speed.

Further, hardware innovations by companies such as Groq (a leading player in fast token generation), Samba Nova (which serves Llama 3.1 405B tokens at an impressive 114 tokens per second), and wafer-scale computation startup Cerebras (which just announced a new offering this week), as well as the semiconductor giants NVIDIA, AMD, Intel, and Qualcomm, will drive further price cuts.

When building applications, I find it useful to design to where the technology is going rather than only where it has been. Based on the technology roadmaps of multiple software and hardware companies — which include improved semiconductors, smaller models, and algorithmic innovation in inference architectures — I’m confident that token prices will continue to fall rapidly.

This means that even if you build an agentic workload that isn’t entirely economical, falling token prices might make it economical at some point. As I wrote previously, being able to process many tokens is particularly important for agentic workloads, which must call a model many times before generating a result. Further, even agentic workloads are already quite affordable for many applications. Let's say you build an application to assist a human worker, and it uses 100 tokens per second continuously: At $4/million tokens, you'd be spending only $1.44/hour – which is significantly lower than the minimum wage in the U.S. and many other countries.

So how can AI companies prepare?
- First, I continue to hear from teams that are surprised to find out how cheap LLM usage is when they actually work through cost calculations. For many applications, it isn’t worth too much effort to optimize the cost. So first and foremost, I advise teams to focus on building a useful application rather than on optimizing LLM costs.
- Second, even if an application is marginally too expensive to run today, it may be worth deploying in anticipation of lower prices.
- Finally, as new models get released, it might be worthwhile to periodically examine an application to decide whether to switch to a new model either from the same provider (such as switching from GPT-4 to the latest GPT-4o-2024-08-06) or a different provider, to take advantage of falling prices and/or increased capabilities.

Because multiple providers now host Llama 3.1 and other open-weight models, if you use one of these models, it might be possible to switch between providers without too much testing (though implementation details — specifically quantization, does mean that different offerings of the model do differ in performance). When switching between models, unfortunately, a major barrier is still the difficulty of implementing evals, so carrying out regression testing to make sure your application will still perform after you swap in a new model can be challenging. However, as the science of carrying out evals improves, I’m optimistic that this will become easier.

[Original text (with links): AI Restores ALS Patient's Voice, AI Lobby Grows, and more ]

2/11
Why are we considering 4 and 4o to be the same tokens though, if they arent..

3/11
Let's hope SB-1047 proponents realize that open-source is already vital for customers and to avoid price gouging.

4/11
OpenAI gets a lot of 💩for announcing but not releasing their innovation - which is most likely not their fault BTW. But, they have given us GPT-4O-mini with amazing price-performance. I'm not sure most people realize how awesome it is!

5/11
As AI models become more affordable, it's a great time to explore new possibilities and build innovative applications without worrying too much about costs.

6/11
Prices will continue to go down. LLMs will rapidly become commodities. The value will be created at the application level.
Why Large Language Models Are A Commodity Now And What It Means For The AI Space

7/11
The key lesson from your post is to work on the application of LLM on a use case and, for the time being, accept the high cost

8/11
Mark Zuckerberg is the best. 👍

9/11
the most important factor is the size of the model really declined. GPT 4 is 1800B MoE, while GPT 4o maybe 100B I guess. The second is the inference optimization, various means like quantization, batching and cache.The hardware price doesn't decline that fast.

10/11
With easier access to models, there might be a push for more transparency in how decisions are made, affecting both model selection and application design

11/11
Another confirmation that using @LangChainAI , although a pretty krufty library, is a good move


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 
Top