The A.I Megathread (LLM , GPT , Development)

bnew · Oct 24, 2024

1/3
@rohanpaul_ai
LLMs gain human-like awareness of word positions through numbered tracking.

Adding position markers to LLM inputs enables exact length control and accurate text manipulation.

**Original Problem**

:

LLMs struggle with length control and precise copy-paste operations due to lack of positional awareness.

The authors identify a lack of positional awareness as the root cause of LLMs' inability to effectively control text length. This stems from token-level operations and insufficient training on data with strict length limitations.

-----

**Solution in this Paper**

:

• PositionID Prompting: Assigns sequential IDs to words/sentences/paragraphs during generation

• PositionID Fine-Tuning: Trains models on mixed normal and PositionID modes

• PositionID CP Prompting: Enables accurate copy-paste using a three-stage tool-use mechanism

-----

**Key Insights from this Paper**

:

• Explicit positional awareness enhances LLMs' length control and copy-paste abilities

• PositionID techniques work for both closed-source and open-source models

• Mixed-mode training transfers positional awareness to normal generation mode

-----

**Results**

:

• PositionID Prompting: Best Rouge-L (23.2) and MAE scores across all levels

• PositionID Fine-Tuning: Outperforms CFT and InstructCTG in MAE metrics

• PositionID CP Prompting: 80.8% CP Success Rate, 18.4 Rouge-L, 8.4 PPL

2/3
@rohanpaul_ai

LenCtrl-Bench Details

This component has three workflow variants:

Vanilla Prompting:
- Takes user query and length constraint
- Directly generates text without position tracking
- Less accurate length control

PositionID Prompting:
- Adds sequential position IDs to each word/token
- Helps model track length during generation

- More precise length control
- Example: "Three 1 -word 2 text 3"

PositionID Fine-Tuning:
- Trains model in two modes:
- Normal mode (without position IDs)
- PositionID mode (with position IDs)
- Infers in normal mode while retaining positional awareness
- Most effective for length control

3/3
@rohanpaul_ai

[2410.07035] PositionID: LLMs can Control Lengths, Copy and Paste with Explicit Positional Awareness

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
@arXivGPT

:PositionID: LLMs can Control Lengths, Copy and Paste with Explicit Positional Awareness

:https://arxiv.org/pdf/2410.07035.pdf

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 24, 2024

1/6
@rohanpaul_ai
LLMs become less repetitive by first mapping out different ways to answer, then picking mindfully.

Instead of grabbing the first answer, let it scan different answer zones before responding.

This paper proposes to diversify LLM outputs through stratified sampling, improving coverage and addressing mode collapse.

**Original Problem**

:

LLMs often lack diversity in their responses, especially when multiple valid answers exist. Current methods like temperature scaling degrade generation quality and don't effectively address mode collapse.

-----

**Solution in this Paper**

:

• SimpleStrat: A training-free sampling approach to increase diversity
• Three stages: auto-stratification, heuristic estimation, probabilistic prompting
• Uses LLM to identify useful partitions of solution space
• Computes joint probabilities across strata
• Samples from joint probability distribution to augment original prompt

-----

**Key Insights from this Paper**

:

• LLMs can identify meaningful diversity dimensions even if they can't generate diverse solutions
• Stratified sampling counteracts biases in next-token probabilities
• Diversity improvement is orthogonal to temperature scaling
• SimpleStrat addresses mode collapse without manual intervention

-----

**Results**

:

• Average reduction in KL Divergence: 0.36 compared to baseline on Llama 3 models
• Consistent 0.05 increase in recall across all temperatures for GPT-4o
• Improved diversity on top of temperature scaling
• Higher recall by 0.05 compared to GPT-4o
• CoverageQA dataset: 105 underspecified questions with average 28.7 equally plausible answers

2/6
@rohanpaul_ai

SimpleStrat is a training-free sampling approach that improves language model generation diversity without degrading generation quality. It consists of three stages:

1. Auto-stratification: The language model identifies useful partitions of the solution space based on the user request.

2. Heuristic estimation: Computes joint probabilities across all strata.

3. Probabilistic prompting: Samples from the joint probability distribution to augment the original user prompt with selected strata.

3/6
@rohanpaul_ai

The traditional Temperature scaling has two main limitations:

1. Higher temperatures degrade generation quality, which is especially critical in syntax-sensitive settings like code generation.

2. Controlling for temperature does not necessarily improve diversity in the answer space, especially if the model suffers from mode collapse or is excessively confident.

4/6
@rohanpaul_ai
Mode collapse in LLMs happens when a model generates only a limited set of outputs instead of exploring the full distribution of training data1.

Mode Collapse problem:

- LLMs get stuck in a narrow set of responses, ignoring other valid options

- Even with high temperature settings, the model keeps returning to same few answers

- Example from this paper: When asked "Name a US State", model fixates on "California"

- High temperature makes responses less coherent but doesn't fix the fixation

Why This Matters:

- The Paper shows temperature scaling fails as diversity solution

- Model learns strong biases during training

- These biases create "deep valleys" in probability space

- Temperature tweaks can't help model escape these valleys

SimpleStrat addresses mode collapse by:

1. Using the language model itself to partition the space into strata (auto-stratification).

2. Estimating the proportion of solutions in each stratum (heuristic estimation).

3. Randomly selecting a stratum and drawing a sample from within it (probabilistic prompting).

5/6
@rohanpaul_ai

SimpleStrat measures diversity using two metrics:

1. For open-source models: Kullback-Leibler (KL) Divergence from the response distribution to a uniform distribution over all valid answers.

2. For proprietary models: Coverage via recall of ground-truth solutions over 100 samples.

6/6
@rohanpaul_ai

[2410.09038] SimpleStrat: Diversifying Language Model Generation with Stratification

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 24, 2024

1/41
@rohanpaul_ai
“It turned out that having a bot think for just 20 seconds in a hand of poker got the same boosting performance as scaling up the model by 100,000x and training it for 100,000 times longer,”

2/41
@MitjaMartini
20s thinking time is good for poker and turn-based games, 10x faster and it’s ok for web apps, 100x faster and it‘s good for interactive web apps, 1000x faster makes it good enough for visual-real-time apps like first person shooter games, 20000x faster good enough for many real-world real time apps, eg. reason in a game of table tennis.

3/41
@rohanpaul_ai

4/41
@AI_GPT42
Is there a source for this info?

5/41
@rohanpaul_ai
checkout the links in the VentureBeat piece

6/41
@awaken_tom
Maybe explains why Elon sometimes takes 10+ seconds to respond in interviews.

7/41
@rohanpaul_ai
Great point indeed.

8/41
@ManacasterBen
@wileydavewillis remember my idea for Bot Poker Tournaments?

9/41
@ssurajchawla
what...this could mean inferencing is gonna be way bigger than training in coming future

applies to many scenarios?

10/41
@SirMrMeowmeow
i hope we can at some point make calls to it for agents, but as a soft parameter to be around.. like 1-3 for brief thinking while not taking too long. where 0 would be no thinking...

11/41
@datageek_pl
Very interesting from the “takeoff speed” angle.

Like, maybe we are already at the “sufficient” model / training set size and we will see sudden improvement with the compound effect of new optimizers (soap/shampoo), sampling (entropix) and other, inference-time techniques (Brown is very bullish here)?

BTW, it’s interesting that X doesn’t want to publish the full link, so here is the URL without the protocol part: OpenAI scientist Noam Brown stuns TED AI Conference: ’20 seconds of thinking worth 100,000x more data’

12/41
@gpt_biz
That's amazing! It's fascinating how optimizing thinking time can rival massive model scaling in performance improvement.

13/41
@fute_nukem
Imagine how much better it would have played if they only let it think for 10 seconds

14/41
@IanTimotheos
Context is all you need.

15/41
@LilithDatura
Game Theory

16/41
@Shedletsky
Surprising only to people who don't understand how this stuff works

17/41
@PimentelES1987
Well, that's natural because part of playing a game is the intuition or memorization that you have from experience, which is basically what AI does. And part of this is just calculation. So it doesn't it doesn't make sense to use AI for the calculation part.

18/41
@JoeScot_7
This holds true for everyone humans included.

19/41
@AntDX316
Imagine when AI can AI ML perfect expectations.

20/41
@_MariaMu_
wow

21/41
@PsionicPsittacc
I wonder what humans could do if they thought for >0 sometimes... but I guess we'll never know

22/41
@Kristapher100
Weird and interesting.

23/41
@naomihart
cogito ergo sum

24/41
@vanjajaja1
i just watched this talk on youtube actually, v good, it seems to be what o1 is made of

https://invidious.poast.org/watch?v=eaAonE58sLU

25/41
@TheCoffeeJesus
Filed under “no shyt”

I need to become a more serious person I could make a lot of money and probably a difference

If this is the kind of thing they’re overlooking then Jesus Christ there’s some low hanging fruit about

26/41
@JohnAPfeifer
Intuitively feels like this is how a child learns.

Much less data, but a lot more thinking.

27/41
@kindaoldsol
Thinking fast thinking slow

28/41
@CybeleAI
@doomslide
@_xjdr

29/41
@krasmanalderey
Something similar happened with AlphaZero, a bit of time to "ponder" = extreme improvement

30/41
@Matt_heyqq
I was attending the @TEDAIVienna where did he say this?

31/41
@techfusionjb
20 seconds can make all the difference. Wonder what other applications this could have.

32/41
@climatebabes
Duh. But thinking means making choices..

33/41
@B66Paul
slowing ai down intrigues. bigger not always better?

34/41
@CultureDot1
@rohanpaul_ai - scale it on @Polkadot like the team at @NeuroWebAI is doing and use on-chain data as proof of data.

/search?q=#DOT

35/41
@dgx1dan
I had an in-depth conversation with ChatGPT about philosophy and psychology, expanding on the topics significantly. You can read it here—by the end, it reflects on its own evolution. OMEGA POINT

36/41
@TedSumAtHome
Thinking matters. Who would have thought!

37/41
@Charlie40074981
“Thinking”

Just not as we know it.

38/41
@heddo_safadi
interesting , want to know how these numbers are calculated.

39/41
@artilectium
Wouldn't this imply OpenAI has solved AGI but are keeping it under wraps so as to not disrupt ?

40/41
@rohanpaul_ai

41/41
@brbcatonfire
Humans don't use data for decision making.

Bots need to learn to YOLO no fact decision making.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 24, 2024

OpenAI researchers develop new model that speeds up media generation by 50X

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More A pair of researchers at OpenAI has published a paper describing a new type of model — specifically, a new type of continuous-time consistency model (sCM) — that...

venturebeat.com

OpenAI researchers develop new model that speeds up media generation by 50X

Carl Franzen@carlfranzen

October 23, 2024 2:07 PM

Credit: VentureBeat made with OpenAI ChatGPT

A pair of researchers at OpenAI has published a paper describing a new type of model — specifically, a new type of continuous-time consistency model (sCM) — that increases the speed at which multimedia including images, video, and audio can be generated by AI by 50 times compared to traditional diffusion models, generating images in nearly a 10th of a second compared to more than 5 seconds for regular diffusion.

With the introduction of sCM, OpenAI has managed to achieve comparable sample quality with only two sampling steps, offering a solution that accelerates the generative process without compromising on quality.

Described in the pre-peer reviewed paper published on arXiv.org and blog post released today, authored by Cheng Lu and Yang Song, the innovation enables these models to generate high-quality samples in just two steps—significantly faster than previous diffusion-based models that require hundreds of steps.

Song was also a leading author on a 2023 paper from OpenAI researchers including former chief scientist Ilya Sutskever that coined the idea of “consistency models,” as having “points on the same trajectory map to the same initial point.”

While diffusion models have delivered outstanding results in producing realistic images, 3D models, audio, and video, their inefficiency in sampling—often requiring dozens to hundreds of sequential steps—has made them less suitable for real-time applications.

Theoretically, the technology could provide the basis for a near-realtime AI image generation model from OpenAI. As fellow VentureBeat reporter Sean Michael Kerner mused in our internal Slack channels, “can DALL-E 4 be far behind?”

Faster sampling while retaining high quality

In traditional diffusion models, a large number of denoising steps are needed to create a sample, which contributes to their slow speed.

In contrast, sCM converts noise into high-quality samples directly within one or two steps, cutting down on the computational cost and time.

OpenAI’s largest sCM model, which boasts 1.5 billion parameters, can generate a sample in just 0.11 seconds on a single A100 GPU.

This results in a 50x speed-up in wall-clock time compared to diffusion models, making real-time generative AI applications much more feasible.

Reaching diffusion-model quality with far less computational resources

The team behind sCM trained a continuous-time consistency model on ImageNet 512×512, scaling up to 1.5 billion parameters.

Even at this scale, the model maintains a sample quality that rivals the best diffusion models, achieving a Fréchet Inception Distance (FID) score of 1.88 on ImageNet 512×512.

This brings the sample quality within 10% of diffusion models, which require significantly more computational effort to achieve similar results.

Benchmarks reveal strong performance

OpenAI’s new approach has undergone extensive benchmarking against other state-of-the-art generative models.

By measuring both the sample quality using FID scores and the effective sampling compute, the research demonstrates that sCM provides top-tier results with significantly less computational overhead.

While previous fast-sampling methods have struggled with reduced sample quality or complex training setups, sCM manages to overcome these challenges, offering both speed and high fidelity.

The success of sCM is also attributed to its ability to scale proportionally with the teacher diffusion model from which it distills knowledge.

As both the sCM and the teacher diffusion model grow in size, the gap in sample quality narrows further, and increasing the number of sampling steps in sCM reduces the quality difference even more.

Applications and future uses

The fast sampling and scalability of sCM models open new possibilities for real-time generative AI across multiple domains.

From image generation to audio and video synthesis, sCM provides a practical solution for applications that demand rapid, high-quality output.

Additionally, OpenAI’s research hints at the potential for further system optimization that could accelerate performance even more, tailoring these models to the specific needs of various industries.

bnew · Oct 24, 2024

OpenAI scientist Noam Brown stuns TED AI Conference: ’20 seconds of thinking worth 100,000x more data’

At the TED AI conference, OpenAI’s Noam Brown unveiled the o1 model, showcasing how "System Two Thinking" could transform industries by enabling AI to deliver smarter, more deliberate decision-making.

venturebeat.com

OpenAI scientist Noam Brown stuns TED AI Conference: ’20 seconds of thinking worth 100,000x more data’

Michael Nuñez@MichaelFNunez

October 23, 2024 12:46 PM

Credit: VentureBeat made with Midjourney

Noam Brown, a leading research scientist at OpenAI, took the stage at the TED AI conference in San Francisco on Tuesday to deliver a powerful speech on the future of artificial intelligence, with a particular focus on OpenAI’s new o1 model and its potential to transform industries through strategic reasoning, advanced coding, and scientific research. Brown, who has previously driven breakthroughs in AI systems like Libratus, the poker-playing AI, and CICERO, which mastered the game of Diplomacy, now envisions a future where AI isn’t just a tool, but a core engine of innovation and decision-making across sectors.

“The incredible progress in AI over the past five years can be summarized in one word: scale,” Brown began, addressing a captivated audience of developers, investors, and industry leaders. “Yes, there have been uplink advances, but the frontier models of today are still based on the same transformer architecture that was introduced in 2017. The main difference is the scale of the data and the compute that goes into it.”

Brown, a central figure in OpenAI’s research endeavors, was quick to emphasize that while scaling models has been a critical factor in AI’s progress, it’s time for a paradigm shift. He pointed to the need for AI to move beyond sheer data processing and into what he referred to as “system two thinking”—a slower, more deliberate form of reasoning that mirrors how humans approach complex problems.

The psychology behind AI’s next big leap: Understanding system two thinking

To underscore this point, Brown shared a story from his PhD days when he was working on Libratus, the poker-playing AI that famously defeated top human players in 2017.

“It turned out that having a bot think for just 20 seconds in a hand of poker got the same boosting performance as scaling up the model by 100,000x and training it for 100,000 times longer,” Brown said. “When I got this result, I literally thought it was a bug. For the first three years of my PhD, I had managed to scale up these models by 100x. I was proud of that work. I had written multiple papers on how to do that scaling, but I knew pretty quickly that all that would be a footnote compared to this scaling up system two thinking.”

Brown’s presentation introduced system two thinking as the solution to the limitations of traditional scaling. Popularized by psychologist Daniel Kahneman in the book Thinking, Fast and Slow, system two thinking refers to a slower, more deliberate mode of thought that humans use for solving complex problems. Brown believes incorporating this approach into AI models could lead to major performance gains without requiring exponentially more data or computing power.

He recounted that allowing Libratus to think for 20 seconds before making decisions had a profound effect, equating it to scaling the model by 100,000x. “The results blew me away,” Brown said, illustrating how businesses could achieve better outcomes with fewer resources by focusing on system two thinking.

Inside OpenAI’s o1: The revolutionary model that takes time to think

Brown’s talk comes shortly after the release of OpenAI’s o1 series models, which introduce system two thinking into AI. Launched in September 2024, these models are designed to process information more carefully than their predecessors, making them ideal for complex tasks in fields like scientific research, coding, and strategic decision-making.

“We’re no longer constrained to just scaling up the system one training. Now we can scale up the system two thinking as well, and the beautiful thing about scaling up in this direction is that it’s largely untapped,” Brown explained. “This isn’t a revolution that’s 10 years away or even two years away. It’s a revolution that’s happening now.”

The o1 models have already demonstrated strong performance in various benchmarks. For instance, in a qualifying exam for the International Mathematics Olympiad, the o1 model achieved an 83% accuracy rate—a significant leap from the 13% scored by OpenAI’s GPT-4o. Brown noted that the ability to reason through complex mathematical formulas and scientific data makes the o1 model especially valuable for industries that rely on data-driven decision-making.

The business case for slower AI: Why patience pays off in enterprise solutions

For businesses, OpenAI’s o1 model offers benefits beyond academic performance. Brown emphasized that scaling system two thinking could improve decision-making processes in industries like healthcare, energy, and finance. He used cancer treatment as an example, asking the audience, “Raise your hand if you would be willing to pay more than $1 for a new cancer treatment… How about $1,000? How about a million dollars?”

Brown suggested that the o1 model could help researchers speed up data collection and analysis, allowing them to focus on interpreting results and generating new hypotheses. In energy, he noted that the model could accelerate the development of more efficient solar panels, potentially leading to breakthroughs in renewable energy.

He acknowledged the skepticism about slower AI models. “When I mention this to people, a frequent response that I get is that people might not be willing to wait around for a few minutes to get a response, or pay a few dollars to get an answer to the question,” he said. But for the most important problems, he argued, that cost is well worth it.

Silicon Valley’s new AI race: Why processing power isn’t everything

OpenAI’s shift toward system two thinking could reshape the competitive landscape for AI, especially in enterprise applications. While most current models are optimized for speed, the deliberate reasoning process behind o1 could offer businesses more accurate insights, particularly in industries like finance and healthcare.

In the tech sector, where companies like Google and Meta are heavily investing in AI, OpenAI’s focus on deep reasoning sets it apart. Google’s Gemini AI, for instance, is optimized for multimodal tasks, but it remains to be seen how it will compare to OpenAI’s models in terms of problem-solving capabilities.

That said, the cost of implementing o1 could limit its widespread adoption. The model is slower and more expensive to run than previous versions. Reports indicate that the o1-preview model costs $15 per million input tokens and $60 per million output tokens, far more than GPT-4o. Still, for enterprises that need high-accuracy outputs, the investment may be worthwhile.

As Brown concluded his talk, he emphasized that AI development is at a critical juncture: “Now we have a new parameter, one where we can scale up system two thinking as well — and we are just at the very beginning of scaling up in this direction.”

bnew · Oct 27, 2024

bnew · Oct 29, 2024

bnew · Oct 31, 2024

1/1
@thegenioo
Learn About (Experiment) from @Google is available in US for public to use.

It is a decent tool, and working pretty well. I tried looking up for a person which is none other than @mckaywrigley and it gave me pretty much impressive info.

Seems like they got impressed by @perplexity_ai , idk could be anything but this is cool for sure.

Credits: @TheRedWall__ @testingcatalog

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1

Google has introduced "Learn About," a new tool designed to compete with Perplexity by helping users efficiently extract information from the internet as part of its Learning Mission.

Read more: Google Launches Perplexity Rival, Calls it 'Learn About'

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/4
@ai_for_success
Google Astra Prototype

I am working on building a personal tool using the Gemini API free tier to assist with screen and video analysis.

Since the last update, I’ve made some improvements:

- Added Handsfree Option with Ricky/VAD .
- Added Screenshare Capability
- Improved Latency: The average latency is now around 7 seconds ( Still bad

) (down from 13-15 seconds).

I’m trying ways to further reduce latency

Current Breakdown:
3-4 seconds for a response from Gemini.
3-4 seconds for the first audio output from ElevenLabs.
I am using gemini-1.5-flash-8b

https://video-ft.twimg.com/ext_tw_v...04/pu/vid/avc1/1922x1080/7UOAAAaeTPPTtM0O.mp4

2/4
@USEnglish215753
Dude, I got this new future Google search engine! Much better than AI Overview. I have already tested it and I am completely happy. I say right away that it will kill SearchGPT when it comes out. Maybe you have some kind of search queries for him?

3/4
@MRmor44
Based on my experience, it wasn't satisfactory.
<--learn about perplexity-->

4/4
@MRmor44
OpenAI's API pricing over time.
googles perplexity

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

Learn About

Grasp new topics and deepen your understanding with a conversational learning companion that adapts to your unique curiosity and learning goals. Ask big or small questions, upload material or explore curated topics. Navigate complex concepts with interactive guides. Make connections to deepen...

learning.google.com

bnew · Oct 31, 2024

bnew · Nov 1, 2024

bnew · Nov 2, 2024

bnew · Nov 2, 2024

https://www.amd.com/en/developer/resources/technical-articles/introducing-the-first-amd-1b-language-model.html

Introducing the First AMD 1B Language Models: AMD OLMo

Oct 31, 2024

Core Contributors: Jiang Liu, Jialian Wu, Prakamya Mishra, Zicheng Liu

Contributors: Sudhanshu Ranjan, Pratik Prabhanjan Brahma, Yusheng Su, Gowtham Ramesh, Peng Sun, Zhe Li, Dong Li, Lu Tian, Emad Barsoum

Introduction

In recent years, the rapid development of artificial intelligence technology, especially the progress in large language models (LLMs), has garnered significant attention and discussion. From the emergence of ChatGPT to subsequent models like GPT-4 and Llama, these language models have demonstrated remarkable capabilities in natural language processing, generation, understanding and reasoning. Continuing AMD tradition of open-sourcing models and code to help the community advance together, we are excited to release our first series of fully open 1 billion parameter language models, AMD OLMo.

Why Build Your Own Language Models

The ability to pre-train and fine-tune your own LLM helps towards the incorporation of domain-specific knowledge, ensuring better alignment with unique use cases. This approach allows organizations to tailor the model’s architecture and training process to meet their unique requirements, achieving a balance between scalability and specialization that off-the-shelf models may not provide. As the demand for customized AI solutions continues to grow, the ability to pre-train LLMs unlocks unprecedented opportunities for innovation and product differentiation across industries.

The AMD in-house trained series of language models (LMs), AMD OLMo, are 1 billion parameter LMs trained from scratch using trillions of tokens on a cluster of AMD Instinct™ MI250 GPUs. Aligned with the goal of advancing accessible AI research, AMD has open-sourced its complete training details and released the checkpoints for the first series of AMD OLMo models. This initiative empowers a diverse community of users, developers, and researchers to explore, utilize, and train state-of-the-art large language models. By demonstrating the capabilities of AMD Instinct™ GPUs in demanding AI workloads, AMD aims to highlight its potential for running large-scale multi-node LM training jobs with trillions of tokens to achieving improved reasoning and instruction-following performance compared to other fully open similar size LMs. In addition, the community can run such models on AMD Ryzen™ AI PCs that are equipped with Neural Processing Units (NPUs) utilizing the AMD Ryzen™ AI Software to enable easier local access without privacy concerns, efficient AI inference, and lower power consumption.

Unveiling AMD OLMo Language Models

AMD OLMo are a series of 1 billion parameter language models pre-trained with 1.3 trillion tokens on 16 nodes, each with four (4) AMD Instinct™ MI250 GPUs. Along with complete details to reproduce, we are releasing three (3) checkpoints corresponding to the various stages of training:

AMD OLMo 1B: Pre-trained on a subset of Dolma v1.7 that consists of 1.3 trillion tokens.
AMD OLMo 1B SFT: Supervised fine-tuned (SFT) on Tulu V2 dataset (1st phase) and then OpenHermes-2.5, WebInstructSub, and Code-Feedback datasets (2nd phase).
AMD OLMo 1B SFT DPO: Aligned with human preferences using Direct Preference Optimization (DPO) on UltraFeedback dataset.

AMD OLMo 1B is based on the model architecture and training set up of fully open source 1 billion version of OLMo, with some key differences. We pre-train with less than half the tokens used for OLMo-1B (effectively cutting the compute budget by half while maintaining comparable performance) and execute post-training comprising of a two-phase SFT and DPO alignment to enhance performance in general reasoning, instruction-following and chat capabilities (OLMo-1B does not carry-out any post-training steps). For the two-phase SFT, we create a data mix of high quality and diverse instructional datasets that are publicly available. Overall, our training recipe helps to produce a series of models that achieve better performance over various types of benchmarks as compared to other similar sized fully open-source models trained on publicly available data.

AMD OLMo

The AMD OLMo models are decoder-only transformer language models that are trained using next-token prediction. The key model architecture and training hyperparameter details are provided in our model card here.

Data and Training Recipe

We trained the AMD OLMo series of models in three stages as shown in Figure 1.

Figure 1: AMD OLMo training stages.

Stage 1: Pre-training

The pre-training stage comprised of training on a large corpus of general-purpose text data for teaching the model to learn the language structure and gain general world knowledge by performing next-token prediction task. We used a subset of 1.3 trillion tokens from the publicly available Dolma v1.7 dataset. Scripts for extracting the exact subset can be found in our Hugging Face model card here.

Stage 2: Supervised Fine-tuning (SFT)

Next, we fine-tuned the pre-trained model on instructional datasets to enable instruction following capabilities in our model. This stage comprises of two phases:

Phase 1: First, we fine-tune the model on TuluV2 dataset, which is a publicly available high-quality instruction dataset consisting of 0.66 billion tokens.
Phase 2: To further improve the instruction following capabilities, we subject the model to be fine-tuned on a relatively larger instruction dataset OpenHermes 2.5. In this phase, we also use Code-Feedback and WebInstructSub dataset to improve the model’s capability along the dimensions of coding, science and mathematical problem solving. These datasets consist of ~7 billion tokens in total.

bnew · Nov 2, 2024

We conducted multiple fine-tuning experiments with different ordering of datasets along the two phases and found the above sequencing to be most helpful. We use a relatively smaller sized but high-quality dataset in Phase 1 to provide a good foundation, and then leverage more diverse and bigger dataset combination in Phase 2 to further improve model’s capabilities.

Stage 3: Alignment

At the end, we further tune our SFT model with Direct Preference Optimization (DPO) using the UltraFeedback dataset, which is a large-scale, fine-grained, and diverse preference dataset. This helps the model align better and produce outputs that are consistent with human values and preferences.

Results

We compare AMD OLMo models with other similarly sized fully open-source models that have publicly released their data, model weights and training code The pre-trained baseline models that we used for comparison include: TinyLLaMA-v1.1 (1.1B), MobiLLaMA-1B (1.2B), OLMo-1B-hf (1.2B), OLMo-1B-0724-hf (1.2B), and OpenELM-1_1B (1.1B).

Figure 2: Pre-trained model results on standard benchmarks for general reasoning capabilities and multi-task understanding. Top markers represent the performance gain of the best performing AMD OLMo 1B model compared to the next best model.

Figure 2 compares pre-trained models across various standard benchmarks for general reasoning capabilities (see here for exact numbers). We use Language Model Evaluation Harness for evaluating common sense reasoning, multi-task understanding and responsible AI benchmarks. Of the 11 benchmarks, we evaluate GSM8k in 8-shot and BBH in 3-shot setting, and rest in zero-shot setting.

With AMD OLMo 1B:

The average overall general reasoning tasks (48.77%) is comparable to that of the latest OLMo-0724-hf model (49.3%) with less than half of its pre-training compute budget and better than all the other baseline models.
Accuracy gains over the next best models on ARC-Easy (+6.36%), ARC-Challenge (+1.02%), and SciQ (+0.50%) benchmarks.

For evaluating the chat capabilities, we used the following instruction-tuned chat counterparts of the pre-trained baselines: TinyLlama-1.1B-Chat-v1.0 , MobiLlama-1B-Chat, and OpenELM-1_1B-Instruct. Along with Language Model Evaluation Harness for evaluating common sense reasoning, multi-task understanding and responsible AI benchmarks, we used Alpaca Eval for evaluating instruction-following capabilities, and MT-Bench for evaluating multi-turn chat capabilities.

On comparing our fine-tuned and aligned models with other instruction-tuned baselines:

AMD Instruction Tuning Results on Stardad Benchmark

Figure 3: Instruction tuning results on standard benchmarks for general reasoning capabilities and multi-task understanding. Top markers represent the performance gain of the best performing AMD OLMo 1B SFT/SFT DPO models compared to the next best baseline model.

Two phased SFT helped raise the model accuracy from the pre-trained checkpoint across almost all benchmarks on average, specifically MMLU by +5.09% and GSM8k by +15.32%.
AMD OLMo 1B SFT performance on GSM8k (18.2%) is significantly better (+15.39%) than the next best baseline model (TinyLlama-1.1B-Chat-v1.0 at 2.81%).
Average accuracy over standard benchmarks (Figure 3) for our SFT model beats baseline chat models by minimum +2.65%. Alignment (DPO) boosts it by further +0.46%.

AMD Instruction Tuning Results On Chat Benchmarks

Figure 4:SFT and DPO model results on chat benchmarks. *MT-Bench evaluation was done using max_new_tokens=2048 while the context length for OpenELM-1_1B-Instruct restricts this generation resulting in an unfair comparison. Top markers represent the performance gain of the best performing AMD OLMo 1B SFT/SFT DPO model compared to the next best baseline model.

Our SFT model also exceeds the next best model on chat benchmarks AlpacaEval 2 (+2.29%) and MT-Bench (+0.97%) as shown in Figure 4.

AMD Instruction Tuning Results On Responsible AI Benchmarks

Figure 5: SFT and DPO model results on responsible AI benchmarks. Here for ToxiGen a lower score is better. Top markers represent the performance gain of the best performing AMD OLMo 1B SFT/SFT DPO model compared to the next best baseline model.

Alignment training helps our AMD OLMo 1B SFT DPO model perform on par with other chat baselines on responsible AI evaluation benchmarks, as shown in Figure 5.

Furthermore, AMD OLMo models were also able to run inference on AMD Ryzen™ AI PCs that are equipped with Neural Processing Units (NPUs). Developers can easily run Generative AI models locally by utilizing the AMD Ryzen™ AI Software. Local deployment of such models on edge devices provides a sustainable and secure approach by optimizing energy efficiency and safeguarding data privacy while enabling various types of AI applications.

Conclusion

Using an end-to-end training pipeline running on AMD Instinct™ GPUs that consists of a pre-training stage with 1.3 trillion tokens (which is half the pre-training compute budget as compared to OLMo-1B), a two-phase supervised fine-tuning stage, and DPO based human preference alignment stage, AMD OLMo models are comparable to or outperform the other similar sized fully open models across general reasoning and chat capabilities, while performing at par on responsible AI benchmarks. Also, the language model was deployed onto AMD Ryzen™ AI PCs with NPUs that can potentially help enable a diverse set of edge use cases. Open sourcing the data, weights, training recipes and code is primarily aimed at helping developers to reproduce as well as innovate further on top. AMD remains committed to providing the open-source community with a steady stream of new AI models and eagerly anticipates the innovations that will emerge from their collaborative efforts.

Call to Action

You are welcome to download and try this model. To get more information about the training, inferencing and insights of this model, please visit the AMD Hugging Face Model Card to get access to the codes, and to download the model file. Additionally, AMD opened a dedicated cloud infrastructure that includes latest GPU instances to AI developers. Visit AMD Developer Cloud for specific accessing request and usage. Furthermore, you can deploy advanced AI models on AMD Ryzen AI PCs and can learn more here.

bnew · Nov 2, 2024

https://archive.is/Kj4ng

bnew · Nov 2, 2024

The A.I Megathread (LLM , GPT , Development)

Veteran

Veteran

Veteran

Veteran

OpenAI researchers develop new model that speeds up media generation by 50X​

Faster sampling while retaining high quality​

Reaching diffusion-model quality with far less computational resources​

Benchmarks reveal strong performance​

Applications and future uses​

Veteran

OpenAI scientist Noam Brown stuns TED AI Conference: ’20 seconds of thinking worth 100,000x more data’​

The psychology behind AI’s next big leap: Understanding system two thinking​

Inside OpenAI’s o1: The revolutionary model that takes time to think​

The business case for slower AI: Why patience pays off in enterprise solutions​

Silicon Valley’s new AI race: Why processing power isn’t everything​

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Introducing the First AMD 1B Language Models: AMD OLMo​

Introduction​

Why Build Your Own Language Models​

Unveiling AMD OLMo Language Models​

AMD OLMo​

Data and Training Recipe​

Veteran

Results​

Conclusion​

Call to Action​

Veteran

Veteran

OpenAI researchers develop new model that speeds up media generation by 50X

Faster sampling while retaining high quality

Reaching diffusion-model quality with far less computational resources

Benchmarks reveal strong performance

Applications and future uses

OpenAI scientist Noam Brown stuns TED AI Conference: ’20 seconds of thinking worth 100,000x more data’

The psychology behind AI’s next big leap: Understanding system two thinking

Inside OpenAI’s o1: The revolutionary model that takes time to think

The business case for slower AI: Why patience pays off in enterprise solutions

Silicon Valley’s new AI race: Why processing power isn’t everything

Introducing the First AMD 1B Language Models: AMD OLMo

Introduction

Why Build Your Own Language Models

Unveiling AMD OLMo Language Models

AMD OLMo

Data and Training Recipe

Results

Conclusion

Call to Action