The A.I Megathread (LLM , GPT , Development)

bnew · Apr 7, 2024

1/6
Intel presents LLaVA-Gemma

Accelerating Multimodal Foundation Models with a Compact Language Model

We train a suite of multimodal foundation models (MMFM) using the popular LLaVA framework with the recently released Gemma family of large language models (LLMs). Of particular

2/6
interest is the 2B parameter Gemma model, which provides opportunities to construct capable small-scale MMFMs. In line with findings from other papers in this space, we test the effect of ablating three design features: pretraining the connector, utilizing a more powerful image

3/6
backbone, and increasing the size of the language backbone. The resulting models, which we call LLaVA-Gemma, exhibit moderate performance on an array of evaluations, but fail to improve past the current comparably sized SOTA models. Closer analysis of performance shows mixed

4/6
effects; skipping pretraining tends to reduce performance, larger vision models sometimes improve performance, and increasing language model size has inconsistent effects. We publicly release training recipes, code and weights for our models for the LLaVA-Gemma models.

5/6
paper page:

6/6
daily papers:

Paper page - LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model
Daily Papers - Hugging Face

bnew · Apr 7, 2024

1/1
The Power Era in AI and LLMs

Are advanced LLMs edging out even the need for category fine-tuning?

Data and model training are at the heart of the competition with large language models (LLMs). But a compelling narrative is unfolding, one that could very well redefine our approach to crafting artificial intelligence (AI) solutions tailored to specialized domains like finance and medicine.

The protagonists of our story? The generically trained, such as GPT-4 and Claude 3 Opus, are now being benchmarked against the "age-old" practice of fine-tuning models for domain-specific tasks.

The financial sector, with its intricate jargon and nuanced operations, serves as the perfect arena for this showdown. Traditionally, the path to excellence in financial text analytics involved fine-tuning models with domain-specific data, a method akin to giving a neural network a crash course in finance. However, a study from last year suggests a different story. And with the current rapid progress on "generic" models, this may be very important from performance and financial perspectives.

The Untrained Titans

Imagine, if you will, a world where an AI model not explicitly trained on financial data can navigate the complex labyrinths of financial text analytics, outperforming its fine-tuned counterparts. This isn't a fragment of sci-fi imagination anymore. GPT-4, with its vast, generalist training, is making this a reality. And that's just the beginning. Claude 3 Opus is gaining momentum, and the pending launch of GPT-5 further supports this trend.

These models, trained on a diverse array of internet text, have shown an astonishing ability to grasp and perform tasks across various domains, finance included, without the need for additional training. It's as if they've absorbed the internet's collective knowledge, enabling them to be jacks of all trades and, surprisingly, masters, too.

Fine-tuning has been the go-to for achieving peak performance in domain-specific tasks. By tailoring a model to understand the subtleties of financial language, one could expect it to excel in tasks ranging from sentiment analysis to complex question-answering. However, this approach comes with its own set of challenges. The need for domain-specific datasets, the computational resources for training, and the risk of overfitting to a particular domain are but a few hurdles on this path.

An Empirical Verdict

This 2023 study has tested these models across a variety of financial text analytics tasks, from sentiment analysis to question-answering. The results? ChatGPT and GPT-4 not only held their own but, in many cases, outshone the fine-tuned models. Particularly noteworthy is the GPT-4 performance, which showcases significant improvement over ChatGPT across nearly all financial benchmarks. This leap in capability suggests that as these LLMs evolve, their need for domain-specific fine-tuning may diminish.

Prompting Power

Beyond the raw computational prowess and expansive knowledge of the latest large language models, the art and science of prompting emerge as a pivotal layer in unlocking their full potential. The nuanced craft of prompt engineering transforms the way we harness these digital titans, bridging the gap between human ingenuity and AI's vast capabilities. This synergy between sophisticated human prompts and the model's strengths introduces a collaborative dimension to AI interaction, where the precision of the prompt dictates the relevance and depth of the model's response. As we refine our ability to communicate with these models through prompts, we're not just leveraging AI; we're engaging in a dynamic partnership that amplifies our collective intelligence, marking a significant leap forward in our journey with artificial intelligence.

Implications and Musings

What does this mean for the future of AI in specialized domains? Are we approaching a point where the flexibility and general prowess of LLMs could reduce the necessity for fine-tuning? This prospect is both exciting and a bit unsettling. On the one hand, the ability to deploy highly capable AI models without extensive domain-specific training could democratize AI, making powerful tools more accessible across various fields, from finance to medicine. On the other, it raises questions about the future of custom model development and the unique value it offers, particularly in the context of new studies and real-time data.

Technological Muscle Flexing

In this next step in AI's evolution, the rapid advancements in LLMs stand as a testament to the breakneck speed of progress in the field, edging us ever closer to the tantalizing horizon of artificial general intelligence (AGI). As these models, brimming with the potential of human-like understanding and capability, flex their technological muscles, we find ourselves at the cusp of what could be a defining moment for LLMs. The journey towards AGI, marked by this spectrum of advancements, promises to be transformative, reshaping our interaction with technology and establishing the "power era" of intelligent machines.

The Power Era in AI and LLMs

#LLMs #AI #AGI
@openai
@AnthropicAI
#GPT4 #ChatGPT #Claude3
@BrianRoemmele

1/1
The Power Era in AI and LLMs

Are advanced LLMs edging out even the need for category fine-tuning?

Data and model training are at the heart of the competition with large language models (LLMs). But a compelling narrative is unfolding, one that could very well redefine our…

bnew · Apr 7, 2024

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.

www.technologyreview.com

ARTIFICIAL INTELLIGENCE

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.

By Will Douglas Heavenarchive page

March 4, 2024

A photo illustration showing speech bubbles full of data.

SARAH ROGERS/MITTR | GETTY

Two years ago, Yuri Burda and Harri Edwards, researchers at the San Francisco–based firm OpenAI, were trying to find out what it would take to get a language model to do basic arithmetic. They wanted to know how many examples of adding up two numbers the model needed to see before it was able to add up any two numbers they gave it. At first, things didn’t go too well. The models memorized the sums they saw but failed to solve new ones.

By accident, Burda and Edwards left some of their experiments running far longer than they meant to—days rather than hours. The models were shown the example sums over and over again, way past the point when the researchers would otherwise have called it quits. But when the pair at last came back, they were surprised to find that the experiments had worked. They’d trained a language model to add two numbers—it had just taken a lot more time than anybody thought it should.

Curious about what was going on, Burda and Edwards teamed up with colleagues to study the phenomenon. They found that in certain cases, models could seemingly fail to learn a task and then all of a sudden just get it, as if a lightbulb had switched on. This wasn’t how deep learning was supposed to work. They called the behavior grokking.

“It’s really interesting,” says Hattie Zhou, an AI researcher at the University of Montreal and Apple Machine Learning Research, who wasn’t involved in the work. “Can we ever be confident that models have stopped learning? Because maybe we just haven’t trained for long enough.”

The weird behavior has captured the imagination of the wider research community. “Lots of people have opinions,” says Lauro Langosco at the University of Cambridge, UK. “But I don’t think there’s a consensus about what exactly is going on.”

With hopes and fears about the technology running wild, it's time to agree on what it can and can't do.

Grokking is just one of several odd phenomena that have AI researchers scratching their heads. The largest models, and large language models in particular, seem to behave in ways textbook math says they shouldn’t. This highlights a remarkable fact about deep learning, the fundamental technology behind today’s AI boom: for all its runaway success, nobody knows exactly how—or why—it works.

“Obviously, we’re not completely ignorant,” says Mikhail Belkin, a computer scientist at the University of California, San Diego. “But our theoretical analysis is so far off what these models can do. Like, why can they learn language? I think this is very mysterious.”

The biggest models are now so complex that researchers are studying them as if they were strange natural phenomena, carrying out experiments and trying to explain the results. Many of those observations fly in the face of classical statistics, which had provided our best set of explanations for how predictive models behave.

So what, you might say. In the last few weeks, Google DeepMind has rolled out its generative models across most of its consumer apps. OpenAI wowed people with Sora, its stunning new text-to-video model. And businesses around the world are scrambling to co-opt AI for their needs. The tech works—isn’t that enough?

But figuring out why deep learning works so well isn’t just an intriguing scientific puzzle. It could also be key to unlocking the next generation of the technology—as well as getting a handle on its formidable risks.

“These are exciting times,” says Boaz Barak, a computer scientist at Harvard University who is on secondment to OpenAI’s superalignment team for a year. “Many people in the field often compare it to physics at the beginning of the 20th century. We have a lot of experimental results that we don’t completely understand, and often when you do an experiment it surprises you.”

Old code, new tricks

Most of the surprises concern the way models can learn to do things that they have not been shown how to do. Known as generalization, this is one of the most fundamental ideas in machine learning—and its greatest puzzle. Models learn to do a task—spot faces, translate sentences, avoid pedestrians—by training with a specific set of examples. Yet they can generalize, learning to do that task with examples they have not seen before. Somehow, models do not just memorize patterns they have seen but come up with rules that let them apply those patterns to new cases. And sometimes, as with grokking, generalization happens when we don’t expect it to.

Large language models in particular, such as OpenAI’s GPT-4 and Google DeepMind’s Gemini, have an astonishing ability to generalize. “The magic is not that the model can learn math problems in English and then generalize to new math problems in English,” says Barak, “but that the model can learn math problems in English, then see some French literature, and from that generalize to solving math problems in French. That’s something beyond what statistics can tell you about.”

When Zhou started studying AI a few years ago, she was struck by the way her teachers focused on the how but not the why. “It was like, here is how you train these models and then here’s the result,” she says. “But it wasn’t clear why this process leads to models that are capable of doing these amazing things.” She wanted to know more, but she was told there weren’t good answers: “My assumption was that scientists know what they’re doing. Like, they’d get the theories and then they’d build the models. That wasn’t the case at all.”

The rapid advances in deep learning over the last 10-plus years came more from trial and error than from understanding. Researchers copied what worked for others and tacked on innovations of their own. There are now many different ingredients that can be added to models and a growing cookbook filled with recipes for using them. “People try this thing, that thing, all these tricks,” says Belkin. “Some are important. Some are probably not.”

“It works, which is amazing. Our minds are blown by how powerful these things are,” he says. And yet for all their success, the recipes are more alchemy than chemistry: “We figured out certain incantations at midnight after mixing up some ingredients,” he says.

Overfitting

The problem is that AI in the era of large language models appears to defy textbook statistics. The most powerful models today are vast, with up to a trillion parameters (the values in a model that get adjusted during training). But statistics says that as models get bigger, they should first improve in performance but then get worse. This is because of something called overfitting.

When a model gets trained on a data set, it tries to fit that data to a pattern. Picture a bunch of data points plotted on a chart. A pattern that fits the data can be represented on that chart as a line running through the points. The process of training a model can be thought of as getting it to find a line that fits the training data (the dots already on the chart) but also fits new data (new dots).

A straight line is one pattern, but it probably won’t be too accurate, missing some of the dots. A wiggly line that connects every dot will get full marks on the training data, but won’t generalize. When that happens, a model is said to overfit its data.

An exclusive conversation with Ilya Sutskever on his fears for the future of AI and why they’ve made him change the focus of his life’s work.

According to classical statistics, the bigger a model gets, the more prone it is to overfitting. That’s because with more parameters to play with, it’s easier for a model to hit on wiggly lines that connect every dot. This suggests there’s a sweet spot between under- and overfitting that a model must find if it is to generalize. And yet that’s not what we see with big models. The best-known example of this is a phenomenon known as double descent.

The performance of a model is often represented in terms of the number of errors it makes: as performance goes up, error rate goes down (or descends). For decades, it was believed that error rate went down and then up as models got bigger: picture a U-shaped curve with the sweet spot for generalization at the lowest point. But in 2018, Belkin and his colleagues found that when certain models got bigger, their error rate went down, then up—and then down again (a double descent, or W-shaped curve). In other words, large models would somehow overrun that sweet spot and push through the overfitting problem, getting even better as they got bigger.

A year later, Barak coauthored a paper showing that the double-descent phenomenon was more common than many thought. It happens not just when models get bigger but also in models with large amounts of training data or models that are trained for longer. This behavior, dubbed benign overfitting, is still not fully understood. It raises basic questions about how models should be trained to get the most out of them.

Researchers have sketched out versions of what they think is going on. Belkin believes there’s a kind of Occam’s razor effect in play: the simplest pattern that fits the data—the smoothest curve between the dots—is often the one that generalizes best. The reason bigger models keep improving longer than it seems they should could be that bigger models are more likely to hit upon that just-so curve than smaller ones: more parameters means more possible curves to try out after ditching the wiggliest.

“Our theory seemed to explain the basics of why it worked,” says Belkin. “And then people made models that could speak 100 languages and it was like, okay, we understand nothing at all.” He laughs: “It turned out we weren’t even scratching the surface.”

For Belkin, large language models are a whole new mystery. These models are based on transformers, a type of neural network that is good at processing sequences of data, like words in sentences.

There’s a lot of complexity inside transformers, says Belkin. But he thinks at heart they do more or less the same thing as a much better understood statistical construct called a Markov chain, which predicts the next item in a sequence based on what’s come before. But that isn’t enough to explain everything that large language models can do. “This is something that, until recently, we thought should not work,” says Belkin. “That means that something was fundamentally missing. It identifies a gap in our understanding of the world.”

Belkin goes further. He thinks there could be a hidden mathematical pattern in language that large language models somehow come to exploit: “Pure speculation but why not?”

“The fact that these things model language is probably one of the biggest discoveries in history,” he says. “That you can learn language by just predicting the next word with a Markov chain—that’s just shocking to me.”

bnew · Apr 7, 2024

Start small

Researchers are trying to figure it out piece by piece. Because large models are too complex to study themselves, Belkin, Barak, Zhou, and others experiment instead on smaller (and older) varieties of statistical model that are better understood. Training these proxies under different conditions and on various kinds of data and observing what happens can give insight into what’s going on. This helps get new theories off the ground, but it is not always clear if those theories will hold for larger models too. After all, it is in the complexity of large models that many of the weird behaviors reside.

Is a theory of deep learning coming? Daniel Hsu, a computer scientist at Columbia University who was one of Belkin’s coauthors on the double-descent paper, doesn’t expect all the answers anytime soon. “We have better intuition now,” he says. “But really explaining everything about why neural networks have this kind of unexpected behavior? We’re still far from doing that.”

Exclusive conversations that take us behind the scenes of a cultural phenomenon.

In 2016, Chiyuan Zhang at MIT and colleagues at Google Brain published an influential paper titled “Understanding Deep Learning Requires Rethinking Generalization.” In 2021, five years later, they republished the paper, calling it “Understanding Deep Learning (Still) Requires Rethinking Generalization.” What about in 2024? “Kind of yes and no,” says Zhang. “There has been a lot of progress lately, though probably many more questions arise than get resolved.”

Meanwhile, researchers continue to wrestle even with the basic observations. In December, Langosco and his colleagues presented a paper at NeurIPS, a top AI conference, in which they claimed that grokking and double descent are in fact aspects of the same phenomenon. “You eyeball them and they look kind of similar,” says Langosco. He believes that an explanation of what’s going on should account for both.

At the same conference, Alicia Curth, who studies statistics at the University of Cambridge, and her colleagues argued that double descent is in fact an illusion. “It didn’t sit very well with me that modern machine learning is some kind of magic that defies all the laws that we’ve established so far,” says Curth. Her team argued that the double-descent phenomenon—where models appear to perform better, then worse, and then better again as they get bigger—arises because of the way the complexity of the models was measured.

Belkin and his colleagues used model size—the number of parameters—as a measure of complexity. But Curth and her colleagues found that the number of parameters might not be a good stand-in for complexity because adding parameters sometimes makes a model more complex and sometimes makes it less so. It depends what the values are, how they get used during training, and how they interact with others—much of which stays hidden inside the model. “Our takeaway was that not all model parameters are created equal,” says Curth.

In short, if you use a different measure for complexity, large models might conform to classical statistics just fine. That’s not to say there isn’t a lot we don’t understand about what happens when models get bigger, says Curth. But we already have all the math we need to explain it.

A great mystery of our time

It's true that such debates can get into the weeds. Why does it matter whether AI models are underpinned by classical statistics or not?

One answer is that better theoretical understanding would help build even better AI or make it more efficient. At the moment, progress has been fast but unpredictable. Many things that OpenAI’s GPT-4 can do came as a surprise even to the people who made it. Researchers are still arguing over what it can and cannotachieve. “Without some sort of fundamental theory, it’s very hard to have any idea what we can expect from these things,” says Belkin.

Barak agrees. “Even once we have the models, it is not straightforward even in hindsight to say exactly why certain capabilities emerged when they did,” he says.

This isn’t only about managing progress—it’s about anticipating risk, too. Many of the researchers working on the theory behind deep learning are motivated by safety concerns for future models. “We don’t know what capabilities GPT-5 will have until we train it and test it,” says Langosco. “It might be a medium-size problem right now, but it will become a really big problem in the future as models become more powerful.”

Barak works on OpenAI’s superalignment team, which was set up by the firm’s chief scientist, Ilya Sutskever, to figure out how to stop a hypothetical superintelligence from going rogue. “I’m very interested in getting guarantees,” he says. “If you can do amazing things but you can’t really control it, then it’s not so amazing. What good is a car that can drive 300 miles per hour if it has a shaky steering wheel?”

But beneath all that there’s also a grand scientific challenge. “Intelligence is definitely up there as one of the great mysteries of our time,” says Barak.

“We’re a very infant science,” he says. “The questions that I’m most excited about this month might be different to the questions that I’m most excited about next month. We are still discovering things. We very much need to experiment and get surprised.”

bnew · Apr 7, 2024

1/7
ChatGLM-Math

Improving Math Problem-Solving in Large Language Models with a Self-Critique Pipeline

Large language models (LLMs) have shown excellent mastering of human language, but still struggle in real-world applications that require mathematical problem-solving. While many

2/7
strategies and datasets to enhance LLMs' mathematics are developed, it remains a challenge to simultaneously maintain and improve both language and mathematical capabilities in deployed LLM systems. In this work, we tailor the Self-Critique pipeline, which addresses the

3/7
challenge in the feedback learning stage of LLM alignment. We first train a general Math-Critique model from the LLM itself to provide feedback signals. Then, we sequentially employ rejective fine-tuning and direct preference optimization over the LLM's own generations for

4/7
data collection. Based on ChatGLM3-32B, we conduct a series of experiments on both academic and our newly created challenging dataset, MathUserEval. Results show that our

5/7
pipeline significantly enhances the LLM's mathematical problem-solving while still improving its language ability, outperforming LLMs that could be two times larger.

6/7
paper page:

7/7
daily papers:

https://huggingface.co/BlinkDL/rwkv-6-world…
https://stanford.zoom.us/j/99922151759?pwd=dW5CcUtVYkNybGZGY0hMWUZtVkZBZz09…
CS25: Tranformers United!

Paper page - ChatGLM-Math: Improving Math Problem-Solving in Large Language Models with a Self-Cr...
Daily Papers - Hugging Face

bnew · Apr 7, 2024

1/6
(1/2) Introducing LL3M: Large Language, Multimodal, and Moe Model Open Research Plan

GitHub - jiasenlu/LL3M: LL3M: Large Language and Multi-Modal Model in Jax

LL3M: Large Language and Multi-Modal Model in Jax. Contribute to jiasenlu/LL3M development by creating an account on GitHub.

github.com

With the following goals:
- Build an open-sourced codebase in Jax / Flax that supports large-scale training in LLM, LMM, and MoE models.

- Record and share the journey and tricks to implement and train such a model.

- Come up with something new and exciting -- A multimodal version of Branch-Train-MiX initialized with a learnable model soup.

(More release milestones are listed below)

2/6
(2/2) Here are five release milestones :

1: Language Model and Seqio Dataloader for Dolma dataset that supports existing open-weight LLM, such as LLaMA2, Mistral, Gemma, and Phi.

2: Multimodal LLM and Dataloader that supports Llava 1.5 and Llava 1.6.

3: Learnable Model soup…

3/6
Thanks for the suggestion! Jax/Flax has very good data model parallelism that make it easy to implement MoE type of model. I also mainly test the code based on TPU.

4/6
Thanks!

5/6
Not right now, maybe can use the discussions in (jiasenlu LL3M · Discussions)[/URL]

6/6
Current codebase is based on EazyLM (GitHub - young-geng/EasyLM: Large language models (LLMs) made easy, EasyLM is a one stop solution for pre-training, finetuning, evaluating and serving LLMs in JAX/Flax.)[/URL] but will add much more functions!

GitHub - jiasenlu/LL3M: LL3M: Large Language and Multi-Modal Model in Jax
https://github.com/TRI-ML/prismatic-vlms…
https://github.com/jiasenlu/LL3M/discussions…
https://github.com/young-geng/EasyLM…

GitHub - TRI-ML/prismatic-vlms: A flexible and efficient codebase for training visually-conditioned...
jiasenlu LL3M · Discussions
GitHub - stanford-crfm/levanter: Legible, Scalable, Reproducible Foundation Models with Named...
GitHub - young-geng/EasyLM: Large language models (LLMs) made easy, EasyLM is a one stop solution...

bnew · Apr 7, 2024

1/1
Multi-Conditional Ranking with Large Language Models

Defines the multi-conditional ranking (MCR) task, introduces a benchmark to evaluate it, and proposes a method to enhance large language models' performance on MCR.

[2404.00211] Multi-Conditional Ranking with Large Language Models

GitHub - megagonlabs/MCR

Contribute to megagonlabs/MCR development by creating an account on GitHub.

github.com

bnew · Apr 7, 2024

1/2
1/n Recursive Reasoning: This AI Unlocks Complexity Through Smart Decomposition

Large language models (LLMs) have shown remarkable capabilities in performing complex decision-making tasks that require planning and adaptation to interactive environments. However, existing approaches for using LLMs as decision-making agents face significant limitations when dealing with highly complex tasks.

The predominant methods can be categorized into two types - iterative executors and plan-and-execute approaches. Iterative executors like ReAct determine the next action for the LLM based on the current state, but struggle with compositional complexity and long action-observation trajectories as tasks become harder. On the other hand, plan-and-execute methods create an upfront high-level plan by having an LLM decompose the task into sub-tasks to be executed by another LLM. While this reduces complexity, these approaches lack flexibility - if any sub-task is too complex for the executor LLM, the overall task fails.

To address these shortcomings, the paper introduces ADAPT (As-Needed Decomposition and Planning with Language Models), a novel approach that can recursively decompose complex sub-tasks further as needed during execution. ADAPT consists of separate LLM modules for planning and execution, coordinated by a controller program. If the executor LLM fails on a sub-task, the planner LLM decomposes it further into smaller steps to align with the executor's capabilities.

The key strength of ADAPT lies in its dynamic, recursive decomposition structure inspired by hierarchical planning. However, unlike classical methods that rely on hand-specified domain knowledge, ADAPT leverages the broad world knowledge encoded in LLMs to autonomously decompose tasks. This allows it to adapt not just to different LLMs with varying execution abilities, but also to the inherent complexity of tasks.

The authors evaluate ADAPT on three diverse interactive environment datasets - ALFWorld (household tasks), WebShop (online shopping), and a new compositional game TextCraft for crafting Minecraft recipes. Using GPT-3.5 as the base LLM, ADAPT demonstrates significant performance gains over strong baselines like ReAct, Plan-and-Solve, and the adaptive Reflexion method. On ALFWorld, WebShop, and TextCraft, ADAPT achieves up to 28.3%, 27%, and 33% absolute improvements in success rates respectively compared to ReAct.

Through extensive analysis, the paper establishes the importance of ADAPT's recursive decomposition in achieving these results. The extent of decomposition automatically adjusts based on the execution capabilities of the LLM employed as well as the inherent complexity of tasks. This ability to dynamically adapt its hierarchical decomposition to the situation at hand is the key innovation that enables ADAPT's state-of-the-art performance.

In summary, the ADAPT paper presents a elegant solution to a critical limitation in using LLMs for complex, interactive decision-making tasks. By integrating recursive task decomposition with LLM-based planning and execution, ADAPT paves the way for more capable and robust AI agents that can gracefully handle complex real-world environments. Despite some limitations like the lack of multi-step lookahead, the promising results position ADAPT as an important step towards leveraging the remarkable knowledge and reasoning capabilities of LLMs for sequential decision-making problems.

2/2
2/n Here is the source code. Thank you for subscribing!

bnew · Apr 8, 2024

1/8
Updated my BitNet CPU project with inference speed estimation benchmark results on workstation Xeon processors: GitHub - catid/bitnet_cpu: Experiments with BitNet inference on CPU - It's about 40 tokens/second. AMD Ryzen 5 7535HS CPU achieves about 7.4 tokens/second for Gemma 3B.

2/8
This is using multi-threading, and unrolled AVX-512 instructions. So, probably this is a kind of upper bound on the performance rather than what you could expect with a full implementation.

3/8
BitNet speedup on AVX2: lithium0003 on Github suggested using the _mm256_sign_epi8() intrinsic to greatly speed up the AVX2 version. It's now running at 28 tokens/second vs 40 tokens/second on AVX5-12. Code is checked in

4/8
Got 50 tokens/second on my 7950X

5/8
Yeah the weights remain as 2 bits through the whole kernel (it never decompresses them)

6/8
Thanks!

7/8
Sounds cool can't wait to see your results

8/8
Seems ALU bound yeah

https://github.com/catid/bitnet_cpu…
https://huggingface.co/collections/m-a-p/chinese-tiny-llm-660d0133dff6856f94ce0fc6…
[2404.04167] Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model

GitHub - catid/bitnet_cpu: Experiments with BitNet inference on CPU

bnew · Apr 8, 2024

1/1
Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model

- Presents CT-LLM, a 2B LLM
- Open-sourcing the full process of training, including a detailed data processing procedure

hf: Chinese Tiny LLM - a m-a-p Collection
abs: [2404.04167] Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model

bnew · Apr 8, 2024

1/2
One of the best blogs I've read on LDMs and LCMs in a while. Fun to read, filled with figures made on @yacineMTB
's @dingboard_
, balances math and code to go with it well, and an awesome explanation of some rather tricky concepts.

explained: latent consistency models | Notion

https://arxiv.org/abs/2310.04378

naklecha.notion.site

Awesome stuff! @naklecha

2/2
Looking forward to it! Would love to collaborate sometime too

@yacineMTB
@dingboard_
https://naklecha.notion.site/explained-latent-consistency-models-13a9290c0fd3427d8d1a1e0bed97bde2…
@naklecha
https://github.com/bethgelab/frequency_determines_performance…
https://huggingface.co/datasets/bethgelab/Let-It-Wag…
[2404.04125] No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance
@3blue1brown

explained: latent consistency models | Notion

bnew · Apr 8, 2024

1/3
Watched a super interesting talk on Ring Attention, probably the magic behind Gemini's 1 million context window

You organize your devices (GPU/TPU) in a ring, each computing a part of the final attention output

Each device needs to see all keys/values to produce its part

1/2

2/3
The idea is that the attention output can be computed blockwise (by splitting on the sequence dimension). Each device computes the updated queries of a chunk of the sequence by sending/receiving keys/values

This is a great repo to understand it in code:

3/3
This was the talk, not sure if there's a recording, maybe the slides could be shared https://twitter.com/neurosp1ke/status/1776555529566875674 cc @neurosp1ke (btw thanks for the talk!)

@exists_forall

ring-flash-attention/test/test_ring_flash_attn_func.py at main · zhuzilin/ring-flash-attention
arxiv.org
Striped Attention: Faster Ring Attention for Causal Transformers
To help address the growing demand for ever-longer sequence lengths in transformer models, Liu et al. recently proposed Ring Attention, an exact attention algorithm capable of overcoming...

bnew · Apr 8, 2024

booooooom

Total number of model repos on Hugging Face (private included)

bnew · Apr 8, 2024

1/4
We have decided to update text-generation-inference (TGI)'s license.

We switch the license from HFOIL (our custom license) back to Apache 2, hence making the library fully open-source.

Read below for why we are making this change

2/4
in July 2023, we wanted to experiment with a custom license for this specific project in order to protect our commercial solutions from companies with bigger means than we do, who would just host an exact copy of our cloud services.

The experiment however wasn't successful.

It…

3/4
In the spirit of learning from our experiments fast, we have decided to revert to a more standard, full open-source license, and we'll keep this codebase under Apache 2 in the future.

The change also applies to text-embeddings-inference, our optimized embedding library.

4/4
We welcome everyone and anyone's contributions to the codebase going forward.

With ,

Julien

bnew · Apr 8, 2024

1/5
Introducing 𝐒𝐚𝐢𝐥𝐨𝐫: new family of 𝐋𝐋𝐌𝐬 ranging from 0.5B to 7B params!
Pre-trained from Qwen1.5 models, it demonstrates exceptional performance on South-East Asian languages.
Takes us one step closer to multilingual LLMs that serve the needs of a region & beyond!

2/5
Key tech deets:

Data curation: Merging short examples, document-level code-switching, aggressive data cleaning & dedupe
Tokenization robustness: BPE dropout for dealing with prompt variations
Optimizing data mixture: Automatically balancing capabilities across languages

3/5
Want to experience the power of Sailor firsthand?
Dive deep
Project: GitHub - sail-sg/sailor-llm: Sailor: Open Language Models for South-East Asia
Demo:[/URL] Sailor Chat - a Hugging Face Space by sail
Reach[/URL] out to
@sivil_taram on how Sailor can revolutionize multilingual NLP!
Join us in pushing the boundaries with LLM apps!

4/5
𝐂𝐇𝐀𝐌𝐏: 𝐂𝐨𝐧𝐭𝐫𝐨𝐥𝐥𝐚𝐛𝐥𝐞 𝐚𝐧𝐝 𝐂𝐨𝐧𝐬𝐢𝐬𝐭𝐞𝐧𝐭 𝐇𝐮𝐦𝐚𝐧 𝐈𝐦𝐚𝐠𝐞 𝐀𝐧𝐢𝐦𝐚𝐭𝐢𝐨𝐧

Model weights on
@huggingface
:[/URL] fudan-generative-ai/champ · Hugging Face
Code:[/URL] GitHub - fudan-generative-vision/champ: Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance

Curious[/URL] to see Champ in action? Build your app & submit for Spaces GPU grants!

5/5
Champ incorporates rendered depth images, normal maps, & semantic maps from SMPL sequences, along with skeleton-based motion guidance

Multi-layer motion fusion module seamlessly fuses shape & motion latent representations

High-quality & temporally coherent animations!

https://github.com/sail-sg/sailor-llm…
https://huggingface.co/spaces/sail/Sailor-7B-Chat…
@sivil_taram
@huggingface
https://huggingface.co/fudan-generative-ai/champ…
https://github.com/fudan-generative-vision/champ…

The A.I Megathread (LLM , GPT , Development)

Veteran

Veteran

Veteran

Large language models can do jaw-dropping things. But nobody knows exactly why.​

Old code, new tricks​

Overfitting​

Veteran

Start small​

A great mystery of our time​

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Large language models can do jaw-dropping things. But nobody knows exactly why.

Old code, new tricks

Overfitting

Start small

A great mystery of our time