The A.I Megathread (LLM , GPT , Development)

bnew · Sep 5, 2024

1/4
Improving 2D Feature Representations by 3D-Aware Fine-Tuning

proj: Improving 2D Feature Representations by 3D-Aware Fine-Tuning
abs: [2407.20229] Improving 2D Feature Representations by 3D-Aware Fine-Tuning

2/4
The study aims to improve the 3D understanding abilities of 2D vision foundation models, which are typically trained on 2D image data without considering the underlying 3D structure of the visual world.

The researchers demonstrate that incorporating the 3D-aware fine-tuned features improves the performance of 2D vision models on semantic segmentation and depth estimation tasks, both on in-domain and out-of-domain datasets. The fine-tuned features exhibit cleaner and more detailed feature maps compared to the original 2D features.

full paper: Improving 2D Feature Representations by 3D-Aware Fine-Tuning

3/4
Thanks @arankomatsuzaki for sharing!

4/4
Fascinating approach to improving 2D feature representations! I'm curious to see how this 3D-aware fine-tuning method will impact downstream tasks like semantic segmentation.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 5, 2024

1/6
Your LLM may be sparser than you thought!

Excited to announce TEAL, a simple training-free method that achieves up to 40-50% model-wide activation sparsity on Llama-2/3 and Mistral models. Combined with a custom kernel, we achieve end-to-end speedups of up to 1.53x-1.8x!

2/6
Activation sparsity results in decoding speedup by avoiding the transfer of weights associated with zero-valued activations. Older work exploits the high emergent sparsity (~95%!) in the hidden states of ReLU-based LLMs. However, modern SwiGLU-based LLMs aren't very sparse (<1%).

3/6
It turns out they’re “almost” sparse -- activations across the model tend to cluster around zero. We don’t have a rigorous explanation, but we suspect a large part of this is due to LayerNorm. By clipping entries close to zero, we can potentially reintroduce activation sparsity.

4/6
Indeed, we can! In practice, we set thresholds for each matrix to clip a certain percent of activations based on their magnitudes. On perplexity and on a variety of downstream tasks, we’re able to achieve up to 40-50% model-wide sparsity with minimal degradation.

5/6
To benchmark end-to-end performance, we integrate a custom sparse GEMV kernel with GPT-Fast. We measure wall-clock speedups of 1.53x and 1.8x at 40% and 50% sparsity respectively.

6/6
Paper: [2408.14690] Training-Free Activation Sparsity in Large Language Models
Code: GitHub - FasterDecoding/TEAL
Blog: TEAL: Training-Free Activation Sparsity in Large Language Models

Thanks to collaborators that made this possible (Pragaash, @tianle_cai , @HanGuo97 , Prof. Yoon Kim, @ben_athi)!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 5, 2024

1/4
Meta presents Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

- Can generate images and text on a par with similar scale diffusion models and language models
- Compresses each image to just 16 patches

[2408.11039] Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

2/4
The paper introduces Transfusion, a method for training a single multi-modal model that can understand and generate both discrete (text) and continuous (image) data. This is in contrast to previous approaches that either use separate models for different modalities or quantize continuous data into discrete tokens.

The experiments show that Transfusion models consistently outperform the Chameleon baseline, which discretizes images and trains on the mixed text-image sequence. Transfusion achieves significantly better performance on text-to-image generation, requiring only about 30% of the compute of Chameleon to reach the same level of performance. Transfusion also performs better on text-only tasks, despite modeling text in the same way as Chameleon.

full paper: Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

3/4
Looking forward

to read this paper

4/4
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 5, 2024

1/3
We released phi 3.5: mini+MoE+vision

A better mini model with multilingual support: microsoft/Phi-3.5-mini-instruct · Hugging Face

A new MoE model:microsoft/Phi-3.5-MoE-instruct · Hugging Face

A new vision model supporting multiple images: microsoft/Phi-3.5-vision-instruct · Hugging Face

2/3
All 3 models are now live on Azure AI Studio too:
Phi-3.5-mini-instruct: Azure AI Studio
Phi-3.5-MoE-instruct: Azure AI Studio
Phi-3.5-vision-instruct: Azure AI Studio

3/3
I think it does. Please feel free to share your feedback.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 5, 2024

1/6
Finally released Hermes 3! A 405b, 70b, and 8b version!

You can find all the details and resources about Hermes 3 on our website:
Hermes 3 - NOUS RESEARCH

We have a H3 405B bot running in our discord, join and try it out here: Join the Nous Research Discord Server!

Or give it a try on @LambdaAPI here: https://lambda.chat/chatui/

2/6
yep!

3/6
The 4bit 70B gguf can fit on 40GB vram - unfofrtunately not a single gpu

4/6
405B - our FP8

5/6
fp8

6/6
One tweet below ^_^

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/8
Introducing 𝐇𝐞𝐫𝐦𝐞𝐬 𝟑: The latest version in our Hermes series, a generalist language model 𝐚𝐥𝐢𝐠𝐧𝐞𝐝 𝐭𝐨 𝐲𝐨𝐮.

Hermes 3 - NOUS RESEARCH

Hermes 3 is available in 3 sizes, 8, 70, and 405B parameters. Hermes has improvements across the board, but with particular capability improvements in roleplaying, agentic tasks, more reliable function calling, multi-turn chats, long context coherence and more.

We published a technical report detailing new capabilities, training run information and more:

Paper: https://nousresearch.com/wp-content/uploads/2024/08/Hermes-3-Technical-Report.pdf

This model was trained in collaboration with our great partners @LambdaAPI, and they are now offering it for free in a chat interface here: https://lambda.chat/chatui/

You can also chat with Hermes 405B on our discord, join here: Join the Nous Research Discord Server!

Hermes 3 was a project built with the help of @Teknium1, @TheEmozilla, @nullvaluetensor, @karan4d, @huemin_art, and an uncountable number of people and work in the Open Source community.

2/8
Hermes 3 performs strongly against Llama-3.1 Instruct Models, but with a focus on aligning the model to you, instead of a company or external policy - meaning less censorship and more steerability - with additional capabilities like agentic XML, scratchpads, roleplaying prowess, and more. Step level reasoning and planning, internal monologues, improved RAG, and even LLM as a judge capabilities were also targeted.

Below are benchmark comparisons between Hermes 3 and Llama-3.1 Instruct and a sample of utilizing the agentic XML tags:

3/8
Lambda's Hermes 3 Announcement Post: Unveiling Hermes 3: The First Full-Parameter Fine-Tuned Llama 3.1 405B Model is on Lambda’s Cloud

Nous' blog post on our experience discovering emergent behavior with 405B:
Freedom at the Frontier: Hermes 3 - NOUS RESEARCH

Hermes 3 405B was trained with @LambdaAPI's new 1-Click Cluster offering, check it out here: Lambda GPU Cloud | 1-Click Clusters

Check out our reference inference code for Hermes Function Calling here: GitHub - NousResearch/Hermes-Function-Calling

Thanks to all the other organizations who helped bring this together, including @weights_biases, @neuralmagic, @vllm_project, @huggingface, @WeAreFireworks, @AiEleuther, @togethercompute, @AIatMeta, and many more

4/8
Special shoutouts to @intrstllrninja for all the work on making function calling real, robust, and useful

and a special thanks to our designer @StudioMilitary for the cover art and all the other designs that Nous uses!

5/8
Will work on @MistralAI's 12b model soon!

6/8
Hermes 3 - a NousResearch Collection

7/8
@karan4d gave it a personality all its own there lol

Lambda has a default assistant personality system chat

8/8
Believe Lambda is also hosting an api version, will update when clear

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 5, 2024

1/10
SEAL Leaderboard Update

We added 3 new models to the SEAL Leaderboards:

- GPT-4o-latest (gpt-4o-2024-08-06)
- Gemini 1.5 Pro (Aug 27, 2024) (gemini-1.5-pro-exp-0827)
- Mistral Large 2 (mistral-large-2407)

More detailed results in thread

2/10
GPT-4o-latest (gpt-4o-2024-0806) ranks above the prior GPT-4o (May 2024) on:

- Math (now #2 behind Claude 3.5 Sonnet)
- Coding (now #2 behind Claude 3.5 Sonnet)

but actually performed worse than the May model on Instruction Following (now ranked #8) and Spanish (now ranked #3)

3/10
Gemini-1.5-pro-exp-0827 ranks higher than the May model version on all leaderboards:

- Instruction Following (now #3 behind Llama and Claude)
- Coding (now #4 behind Claude, GPT-4o, and Mistral)
- Math (now #7)

4/10
Mistral Large 2 (mistral-large-2407) also improves on Mistral Large on all leaderboards, and performed quite well:

- Coding (#3 behind Claude and GPT-4o)
- Instruction Following (now #6)
- Math (now #8)
- Spanish (now #4 behind GPT-4o May, Gemini 1.5 Pro, and GPT-4o August)

5/10
Some insights from the leaderboards:

CODING

We notice that GPT-4o (August 2024) performs the best on Code Correctness among all the models, but Claude outdoes it on Prompt Adherence

6/10
INSTRUCTION FOLLOWING (FACTUALITY):

We find that GPT-4 Turbo is still the most factual model among all of the models we've evaluated.

With the trend towards smaller, distilled models, we notice that it seems to come at the cost of performance on factuality.

7/10
INSTRUCTION FOLLOWING:

Claude 3.5 Sonnet and Llama 4.1 405B Instruct stand out as models with BOTH:

- very high (>0.9) Main Request fulfillment
- very high (>0.7) Constraint fulfillment

Also, all Claude models take a hit for Structural Clarity in writing style

8/10
MULTILINGUAL (SPANISH)

Lastly, one interesting pattern we noticed is that every model seemed to perform worse at Instruction Following (Main Request fulfillment and Constraint fulfillment) in Spanish versus English.

This implies there's still meaningful headroom for models to improve on multilingual capabilities.

9/10
Given that our Math leaderboard (GSM1K) is now relatively saturated (most models score >90), we are working on producing a new Math benchmark to properly discern between models.

10/10
Expect more updates, including the addition of Grok 2, in the coming weeks!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 5, 2024

1/2
Over the coming days, start creating and chatting with Gems: customizable versions of Gemini that act as topic experts.

We’re also launching premade Gems for different scenarios - including Learning coach to break down complex topics and Coding partner to level up your skills → New in Gemini: Custom Gems and improved image generation with Imagen 3

2/2
Your Gem can remember a detailed set of instructions to help you save time on tasks and accomplish your goals with less effort.

Try it now on Gemini Advanced or Gemini for Google Workspace → New in Gemini: Custom Gems and improved image generation with Imagen 3

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 5, 2024

1/3
Uploaded more 4bit bnb quants to unsloth (Unsloth AI) for 4x faster downloading!
1. @NousResearch Hermes 8, 70 & 405b
2. @cohere Command R 32b, R+104b
3. @akjindal53244 Llama 3.1 Storm
4. Reuploaded Llama 3.1 405b - 50% less VRAM use for inference since KV cache was duplicated

2/3
Thanks for working on Llama 3.1 Storm! Great work!

3/3
Oh would that be helpful? Was also thinking of fp8, but both will require some extra compute to make it work unlike bitsandbytes

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 5, 2024

1/9
Are there any good papers on increasing the “creativity” of an LLM through prompt engineering?

I’m looking for ways to produce more unique ideas/concepts without just randomly pulling words from a wordlist to seed the models context

2/9
One technique I've had some success with is to just have the LLM start with free writing, just doing stream of thought nonsense at higher temperature. I then submit the normal prompt and tell it to use the tree writing as non-literal inspiration to help creativity.

3/9
I wonder if the base models would be particularly good at this type of entropy generation

4/9
Adjusting temperature

5/9
It can help a lot to ask it to pretend to be an eccentric expert in the field. You can even name an actual person if you want to flavor the creativity in a specific way.

"You are Richard Feynman and you're trying to figure out why this vacuum cleaner is broken." etc

6/9
I’ve tried some: plot existing ideas on a 3d graph. Choose the labels for each, your answer should be far away from existing ideas when plotted. Prompt it to apply combinations of scientific seemingly unrelated disciplines. Hit or miss but it’s a good starting point sometimes

7/9
I'm diving into the latest research on prompt engineering and LLMs - have you explored the concept of "adversarial prompts" to stimulate more creative outputs?

8/9
Use a base model, instead of an instruction-tuned one

9/9

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 5, 2024

1/10
Releasing OLMoE - the first good Mixture-of-Experts LLM that's 100% open-source
- 1B active, 7B total params for 5T tokens
- Best small LLM & matches more costly ones like Gemma, Llama
- Open Model/Data/Code/Logs + lots of analysis & experiments

[2409.02060] OLMoE: Open Mixture-of-Experts Language Models

1/9

2/10
OLMoE Performance

- Best model with 1B active params making it the cheapest LM for many use cases
- Exceeds many larger models like DeepSeek, Llama2 & Qwen-Chat

2/9

3/10
OLMoE Efficiency

Thanks to Mixture-of-Experts, better data & hyperparams, OLMoE is much more efficient than OLMo 7B:
- >4x less training FLOPs
- >5x less params used per forward pass
i.e. cheaper training + cheaper inference!

3/9

4/10
OLMoE Experiments

1) Expert granularity i.e. we use 64 small experts per layer
2) Dropless token choice beats expert choice routing
3) Shared experts worse for us
4) Sparse upcycling not helpful in our regime

Check the paper for more

4/9

5/10
OLMoE Analysis 1

With 250 ckpts & all open, interesting analysis research can be done w/ OLMoE! Our early investigations:

Potential domain specialization in OLMoE with some experts active much more for arXiv/GitHub (less so for Mixtral…maybe as it was upcycled?)

5/9

6/10
OLMoE Analysis 2

Potential specialization on token IDs, e.g. Expert 43 seems to be a geography expert & 3 a family expert..

Check the paper for analysis on router saturation, expert co-activation & more details

6/9

7/10
Thank you for reading!

Paper with links to model, data, code, etc: [2409.02060] OLMoE: Open Mixture-of-Experts Language Models ; all is meant to be free & open

Thanks to prior work on MoEs including Qwen, JetMoE, OpenMoE & thanks to @allen_ai @ContextualAI for enabling OLMoE

7/9

8/10
Thanks to an amazing team: @soldni @mechanicaldirk @kylelostat @jacobcares @sewon__min @WeijiaShi2 @epwalsh Oyvind Tafjord @gu_yuling @natolambert @shanearora0 @AkshytaB93 Dustin Schwenk @_awettig @Tim_Dettmers @huybery @douwekiela Ali Farhadi @nlpnoah @PangWeiKoh

8/9

9/10
@apsdehal @HannaHajishirzi Also grateful to many colleagues & others @ArmenAgha @adityakusupati @AnanyaHarsh @vwxyzjn @strubell @faeze_brh @hamishivi @KarelDoostrlnck @IanMagnusson @jayc481 @liujc1998 @pdasigi @Sahil1V @StasBekman @valentina__py @yanaiela @yizhongwyz

9/9

10/10
Thanks! My understanding is that expert choice routing breaks autoregressive text generation as it is ambiguous which expert receives the next token. But since we don't use it to generate and don't evaluate it on text generation, we don't run into this problem.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 5, 2024

1/8
JPEG-LM: LLMs as Image Generators with Canonical Codec Representations

- Directly model images and videos via canonical codecs (e.g., JPEG, AVC/H.264)
- More effective than pixel-based modeling and VQ baselines (yields a 31% reduction in FID)

[2408.08459] JPEG-LM: LLMs as Image Generators with Canonical Codec Representations

2/8
@_xjdr better than base64 lol

3/8
The paper aims to show that conventional LLM architectures can be used as generalized models for visual generation by directly modeling canonical codec representations. The authors hypothesize that this approach can mitigate the sequence length infeasibility seen in previous pixel-based methods while being simpler and more effective compared to sophisticated vector quantization techniques.

full paper: JPEG-LM: LLMs as Image Generators with Canonical Codec Representations

4/8
Big caveat: the comparison is between 1B baseline models and the Llama 2 7B architecture

5/8
AI Summary: This paper introduces J PEG -LM, an autoregressive large language model (LLM) designed for image and video generation by utilizing canonical codec representations such as JPEG and AVC/H.264. By d...
JPEG-LM: LLMs as Image Generators with Canonical Codec Representations

6/8
Dark mode for this paper

JPEG-LM: LLMs as Image Generators with Canonical Codec Representations

7/8
JPEG DCTs images within 8px*8px blocks so there seems to be no information leaking from masked region to the prompt region...
JPEG - Wikipedia.

8/8
JPEG-LM: LLMs as Image Generators with Canonical Codec Representations

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 5, 2024

1/1
Automated Design of Agentic Systems

Presents Meta Agent Search to demonstrate that we can use agents to invent novel and powerful agent designs by programming in code

proj: ADAS
abs: [2408.08435] Automated Design of Agentic Systems
github: GitHub - ShengranHu/ADAS: Automated Design of Agentic Systems

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 5, 2024

1/1
Introducing a new open dataset release, Hermes Function Calling V1, the datamix that gave Hermes 2 Pro its tool use and structured output capabilities.

HuggingFace Repo: NousResearch/hermes-function-calling-v1 · Datasets at Hugging Face

The dataset includes single and multiturn Function Calling and Structured JSON Output datasets, and an updated version of @GlaiveAI's function calling dataset, perfect for training LLMs to be better agents!

Also check out our Hermes Function Calling repo for more information on the format and how to use models trained with this data here: GitHub - NousResearch/Hermes-Function-Calling

And for help or questions, join our Discord at Join the Nous Research Discord Server!!

This work is a culmination of the contributions of @intrstllrninja, @Teknium1, Glaive AI, @TheodoreGalanos , and many others who provided assistance along the way.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 5, 2024

1/9
This morning the US Government announced that OpenAI and Anthropic have signed a formal collaboration on AI safety research, testing and evaluation. Under the deal the USAISI will have access to major new models from both OpenAI and Anthropic prior to release.

2/9

3/9
U.S. AI Safety Institute Signs Agreements Regarding AI Safety Research, Testing and Evaluation With Anthropic and OpenAI

4/9
Coverage from Reuters.

5/9
https://www.reuters.com/technology/...-with-us-govt-ai-research-testing-2024-08-29/

6/9
USAISI, which I'm choosing to pronounce You - Sigh - See, was established Nov 1st under the AI Executive Order. It falls under the National Institute of Standards and Technology (NIST) which in turn is an agency of the United States Department of Commerce.

7/9
At the Direction of President Biden, Department of Commerce to Establish U.S. Artificial Intelligence Safety Institute to Lead Efforts on AI Safety

8/9
Statement from Mr A:

9/9
I've seen no details beyond what was said today, I don't think anything is public yet. Even something like access to a 'major' new model has a lot of ambiguity. Some people would say that's not until GPT-5/Claude 4.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 5, 2024

1/1
I'm teaching a new course! AI Python for Beginners is a series of four short courses that teach anyone to code, regardless of current technical skill. We are offering these courses free for a limited time.

Generative AI is transforming coding. This course teaches coding in a way that’s aligned with where the field is going, rather than where it has been:

(1) AI as a Coding Companion. Experienced coders are using AI to help write snippets of code, debug code, and the like. We embrace this approach and describe best-practices for coding with a chatbot. Throughout the course, you'll have access to an AI chatbot that will be your own coding companion that can assist you every step of the way as you code.

(2) Learning by Building AI Applications. You'll write code that interacts with large language models to quickly create fun applications to customize poems, write recipes, and manage a to-do list. This hands-on approach helps you see how writing code that calls on powerful AI models will make you more effective in your work and personal projects.

With this approach, beginning programmers can learn to do useful things with code far faster than they could have even a year ago.

Knowing a little bit of coding is increasingly helping people in job roles other than software engineers. For example, I've seen a marketing professional write code to download web pages and use generative AI to derive insights; a reporter write code to flag important stories; and an investor automate the initial drafts of contracts.

With this course you’ll be equipped to automate repetitive tasks, analyze data more efficiently, and leverage AI to enhance your productivity.

If you are already an experienced developer, please help me spread the word and encourage your non-developer friends to learn a little bit of coding.

I hope you'll check out the first two short courses here! AI Python for Beginners - DeepLearning.AI

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

The A.I Megathread (LLM , GPT , Development)

More options

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran