The A.I Megathread (LLM , GPT , Development)

bnew · Dec 9, 2023

Computer Science > Computation and Language

[Submitted on 27 Nov 2023 (this version), latest version 4 Dec 2023 (v2)]

YUAN 2.0: A Large Language Model with Localized Filtering-based Attention

Shaohua Wu, Xudong Zhao, Shenling Wang, Jiangang Luo, Lingjun Li, Xi Chen, Bing Zhao, Wei Wang, Tong Yu, Rongguo Zhang, Jiahua Zhang, Chao Wang

In this work, the Localized Filtering-based Attention (LFA) is introduced to incorporate prior knowledge of local dependencies of natural language into Attention. Based on LFA, we develop and release Yuan 2.0, a large language model with parameters ranging from 2.1 billion to 102.6 billion. A data filtering and generation method is presented to build pretraining and fine-tuning dataset in high quality. A distributed training method with non-uniform pipeline parallel, data parallel, and optimizer parallel is proposed, which greatly reduces the bandwidth requirements of intra-node communication, and achieves good performance in large-scale distributed training. Yuan 2.0 models display impressive ability in code generation, math problem-solving, and chat compared with existing models. The latest version of YUAN 2.0, including model weights and source code, is accessible at Github.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Cite as:	arXiv:2311.15786 [cs.CL]
	(or arXiv:2311.15786v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2311.15786 Focus to learn more

Submission history

From: Tong Yu [view email]
[v1] Mon, 27 Nov 2023 13:01:59 UTC (1,242 KB)
[v2] Mon, 4 Dec 2023 10:20:57 UTC (1,245 KB)

https://arxiv.org/pdf/2311.15786v1.pdf

bnew · Dec 9, 2023

[2311.16452] Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine

Computer Science > Computation and Language

[Submitted on 28 Nov 2023]

Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine

Harsha Nori, Yin Tat Lee, Sheng Zhang, Dean Carignan, Richard Edgar, Nicolo Fusi, Nicholas King, Jonathan Larson, Yuanzhi Li, Weishung Liu, Renqian Luo, Scott Mayer McKinney, Robert Osazuwa Ness, Hoifung Poon, Tao Qin, Naoto Usuyama, Chris White, Eric Horvitz

Generalist foundation models such as GPT-4 have displayed surprising capabilities in a wide variety of domains and tasks. Yet, there is a prevalent assumption that they cannot match specialist capabilities of fine-tuned models. For example, most explorations to date on medical competency benchmarks have leveraged domain-specific training, as exemplified by efforts on BioGPT and Med-PaLM. We build on a prior study of GPT-4's capabilities on medical challenge benchmarks in the absence of special training. Rather than using simple prompting to highlight the model's out-of-the-box capabilities, we perform a systematic exploration of prompt engineering. We find that prompting innovation can unlock deeper specialist capabilities and show that GPT-4 easily tops prior leading results for medical benchmarks. The prompting methods we explore are general purpose, and make no specific use of domain expertise, removing the need for expert-curated content. Our experimental design carefully controls for overfitting during the prompt engineering process. We introduce Medprompt, based on a composition of several prompting strategies. With Medprompt, GPT-4 achieves state-of-the-art results on all nine of the benchmark datasets in the MultiMedQA suite. The method outperforms leading specialist models such as Med-PaLM 2 by a significant margin with an order of magnitude fewer calls to the model. Steering GPT-4 with Medprompt achieves a 27% reduction in error rate on the MedQA dataset over the best methods to date achieved with specialist models and surpasses a score of 90% for the first time. Beyond medical problems, we show the power of Medprompt to generalize to other domains and provide evidence for the broad applicability of the approach via studies of the strategy on exams in electrical engineering, machine learning, philosophy, accounting, law, nursing, and clinical psychology.

Comments:	21 pages, 7 figures
Subjects:	Computation and Language (cs.CL)
ACM classes:	I.2.7
Cite as:	arXiv:2311.16452 [cs.CL]
	(or arXiv:2311.16452v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2311.16452 Focus to learn more

Submission history

From: Eric Horvitz [view email]
[v1] Tue, 28 Nov 2023 03:16:12 UTC (2,654 KB)

https://arxiv.org/pdf/2311.16452.pdf

bnew · Dec 9, 2023

https://archive.is/IEwKe

https://archive.is/rPsLv

Weyaxi/OpenHermes-2.5-neural-chat-v3-2-Slerp · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

OpenHermes-2.5-neural-chat-v3-2-Slerp

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric	Value
Avg.	70.2
ARC (25-shot)	67.49
HellaSwag (10-shot)	85.42
MMLU (5-shot)	64.13
TruthfulQA (0-shot)	61.05
Winogrande (5-shot)	80.3
GSM8K (5-shot)	63.08

TIES-Merging: Resolving Interference When Merging Models

Transfer learning - i.e., further fine-tuning a pre-trained model on a downstream task - can confer significant advantages, including improved downstream performance, faster convergence, and better sample efficiency. These advantages have led to a proliferation of task-specific fine-tuned...

arxiv.org

Computer Science > Machine Learning

[Submitted on 2 Jun 2023 (v1), last revised 27 Oct 2023 (this version, v2)]

TIES-Merging: Resolving Interference When Merging Models

Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, Mohit Bansal

Transfer learning - i.e., further fine-tuning a pre-trained model on a downstream task - can confer significant advantages, including improved downstream performance, faster convergence, and better sample efficiency. These advantages have led to a proliferation of task-specific fine-tuned models, which typically can only perform a single task and do not benefit from one another. Recently, model merging techniques have emerged as a solution to combine multiple task-specific models into a single multitask model without performing additional training. However, existing merging methods often ignore the interference between parameters of different models, resulting in large performance drops when merging multiple models. In this paper, we demonstrate that prior merging techniques inadvertently lose valuable information due to two major sources of interference: (a) interference due to redundant parameter values and (b) disagreement on the sign of a given parameter's values across models. To address this, we propose our method, TRIM, ELECT SIGN & MERGE (TIES-Merging), which introduces three novel steps when merging models: (1) resetting parameters that only changed a small amount during fine-tuning, (2) resolving sign conflicts, and (3) merging only the parameters that are in alignment with the final agreed-upon sign. We find that TIES-Merging outperforms several existing methods in diverse settings covering a range of modalities, domains, number of tasks, model sizes, architectures, and fine-tuning settings. We further analyze the impact of different types of interference on model parameters, and highlight the importance of resolving sign interference. Our code is available at this https URL

Comments:	Published at NeurIPS 2023, 23 Pages, 13 Figures, 14 Tables
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2306.01708 [cs.LG]
	(or arXiv:2306.01708v2 [cs.LG] for this version)
	[2306.01708] TIES-Merging: Resolving Interference When Merging Models Focus to learn more

Submission history

From: Prateek Yadav [view email]
[v1] Fri, 2 Jun 2023 17:31:32 UTC (365 KB)
[v2] Fri, 27 Oct 2023 01:09:31 UTC (567 KB)

https://arxiv.org/pdf/2306.01708.pdf

bnew · Dec 9, 2023

https://archive.is/LmIPz

QuIP#

QuIP#: QuIP with Lattice Codebooks

Albert Tseng*, Jerry Chee*, Qingyao Sun, Volodymyr Kuleshov, and Chris De Sa

Large language models (LLMs) exhibit amazing performance on a wide variety of tasks such as text modeling and code generation. However, they are also very large. For example Llama 2 70B has 70 billion parameters that require 140GB of memory to store in half precision. This presents many challenges, such as needing multiple GPUs just to serve a single LLM. To address these issues, researchers have developed compression methods that reduce the size of models without destroying performance.

One class of methods, post-training quantization, compresses trained model weights into lower precision formats to reduce memory requirements. For example, quantizing a model from 16 bit to 2 bit precision would reduce the size of the model by 8x, meaning that even Llama 2 70B would fit on a single 24GB GPU. In this work, we introduce QuIP#, which combines lattice codebooks with incoherence processing to create state-of-the-art 2 bit quantized models. These two methods allow QuIP# to significantly close the gap between 2 bit quantized LLMs and unquantized 16 bit models.

Quantization results on Llama 2 70B. QuIP# achieves near-native performance at 2 bits, outperforming all other presented baselines.

Method	Precision	Wiki ↓	C4 ↓	ArcE ↑	PiQA ↑
Native	16 bit	3.120	5.533	0.597	0.809
OPTQ	3 bit	4.577	6.838	0.544	0.786
OPTQ	2 bit	109.820	62.692	0.253	0.505
QuIP	2 bit	5.574	8.268	0.544	0.751
QuIP#	2 bit	4.156	6.545	0.595	0.785

☞

Our method, QuIP#, creates 2 bit LLMs that achieve near-native performance, a previously unseen result. We provide a full suite of 2 bit Llama 1 and 2 models quantized using QuIP#, as well as a full codebase that allows users to quantize and deploy their own models. We also provide CUDA kernels that accelerate inference for QuIP# models. Our code is available here.

bnew · Dec 9, 2023

https://archive.is/T5GQU

bnew · Dec 9, 2023

https://archive.is/2HWxI

ntroduce EAGLE, a new method for fast LLM decoding based on compression:
- 3x

than vanilla
- 2x

than Lookahead (on its benchmark)
- 1.6x

than Medusa (on its benchmark)
- provably maintains text distribution
- trainable (in 1~2 days) and testable on RTX 3090s

Playground: Gradio
Blog: EAGLE
Code: GitHub - SafeAILab/EAGLE: EAGLE: Lossless Acceleration of LLM Decoding by Feature Extrapolation

First Principle: Compression! @YiMaTweets We find that the sequence of second-top-layer features is compressible, making the prediction of subsequent feature vectors from previous ones easy by a small model.

Acknowledge: This project is greatly inspired by the Medusa team (@tianle_cai @yli3521 @ZhengyangGeng @Hongwu_Peng @tri_dao), the Lookahead team (@haozhangml @lmsysorg), and others.

Joint work with Yuhui Li and Chao Zhang

bnew · Dec 9, 2023

bnew · Dec 9, 2023

bnew · Dec 9, 2023

bnew · Dec 9, 2023

bnew · Dec 9, 2023

bnew · Dec 9, 2023

bnew · Dec 10, 2023

Dan York (@danyork@mastodon.social)

If you have 6 minutes, this is a brilliant summary of the different types of “AI” (which she says we should really just call “automation”) - https://youtu.be/eK0md9tQ1KY?si=Xmwyudu1ex-4RU9s It’s from the excellent @emilymbender@dair-community.social #AI #ChatGPT #ArtificalIntelligence #LLM...

mastodon.social

https://archive.is/PFeNJ

How should regulators think about "AI"?

bnew · Dec 10, 2023

Microsoft’s Edge Copilot AI can’t really summarize every YouTube video

It leans on text from subtitles and transcriptions.

www.theverge.com

Microsoft’s Edge Copilot AI can’t really summarize every YouTube video

The chatbot’s summarization feature relies on preprocessed video data or existing subtitles and transcripts.

By Amrita Khalid, one of the authors of audio industry newsletter Hot Pod. Khalid has covered tech, surveillance policy, consumer gadgets, and online communities for more than a decade.

Dec 8, 2023, 9:18 PM EST|7 Comments / 7 New

Image: The Verge

One feature added to Microsoft’s AI Copilot in the Edge browser this week is the ability to generate text summaries of videos. But Edge Copilot’s time-saving feature is still fairly limited and only works on pre-processed videos or those with subtitles, as Mikhail Parakhin, Microsoft’s CEO of advertising and web services, explained.

As spotted by MSPowerUser, Parakhin writes, “In order for it to work, we need to pre-process the video. If the video has subtitles - we can always fallback on that, if it does not and we didn’t preprocess it yet - then it won’t work,” in response to a question.

In other words, on its own Edge Copilot doesn’t so much summarize videos as it summarizes the text transcripts of the videos. Copilot can also perform a similar function throughout Microsoft 365, including summarizing Teams video meetings and calls for customer service agents — and in both cases, the audio needs to be transcribed first by Microsoft. Copilot on Microsoft Stream can also summarize any video, but again, it requires users to generate a written transcript.

Screen_Shot_2023_12_08_at_3.58.46_PM.png

Microsoft

The conversation started after designer Pietro Schirano posted a screen recording of Edge Copilot summarizing a YouTube video about the GTA VI trailer. In this case, Copilot appeared to be doing its job perfectly. The user in the recording presses the “Generate video summary” button in the Copilot sidebar, and mere seconds later, Copilot churns one out, complete with highlights and timestamps.

Of course, many platforms, including YouTube and Vimeo, can automatically generate transcripts and subtitles — if users enable the feature. After The Verge asked Parakhin on X if we could assume most publicly available videos (i.e. YouTube) weren’t pre-processed, he replied: “Should work for most videos.”

Copilot is just the latest example of the generative AI race Microsoft is competing in with Google (and others). Last month, Google upgraded the YouTube extension for its Bard chatbot to enable it to summarize the content of a video and surface specific information from it. Just this week, Google announced a major Gemini update that has its own issues — the company’s editing may have misrepresented some of the AI’s capabilities in a demo, and it doesn’t always have its facts straight.

Parakhin has been candid about the various stages of Copilot’s evolution on social media. While on a plane on Tuesday morning, the machine learning expert posted on X: “Adding ability for Edge Copilot to use information in videos - on a flight.”

bnew · Dec 10, 2023

bnew said:
well "good" depends on what you intend to use them for. LLM's have their strengths and weaknesses. there are some specialized/fine-tuned ones and general models like chatgpt. GPT-4 is superior overall but some open source ones have surpassed chatgpt 3.5 turbo. heres a list of LLM leaderboards and benchmark sites that list many models.

LLM Benchmarks

Human-readable benchmarks of 60+ open-source and proprietary LLMs.

benchmarks.llmonitor.com

Open LLM Leaderboard - a Hugging Face Space by HuggingFaceH4

Discover amazing ML apps made by the community

huggingface.co

Big Code Models Leaderboard - a Hugging Face Space by bigcode

This application allows users to search for and evaluate code models based on various metrics. Users can submit new models for evaluation and view detailed results in a leaderboard format.

huggingface.co

Ayumi Benchmark Version 3 - Results

basic info:

index - LocalLLaMA

Subreddit to discuss about Llama, the large language model created by Meta AI.

old.reddit.com

opensource LLM demo sites:

Chat with Open Large Language Models

Chat with Open Large Language Models

https://chat.deepseek.com/coder

The A.I Megathread (LLM , GPT , Development)

Veteran

Computer Science > Computation and Language​

YUAN 2.0: A Large Language Model with Localized Filtering-based Attention​

Submission history​

Veteran

Computer Science > Computation and Language​

Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine​

Submission history​

Veteran

OpenHermes-2.5-neural-chat-v3-2-Slerp​

Open LLM Leaderboard Evaluation Results​

Computer Science > Machine Learning​

TIES-Merging: Resolving Interference When Merging Models​

Submission history​

Veteran

QuIP#: QuIP with Lattice Codebooks​

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Microsoft’s Edge Copilot AI can’t really summarize every YouTube video​

The chatbot’s summarization feature relies on preprocessed video data or existing subtitles and transcripts.​

Veteran

Computer Science > Computation and Language

YUAN 2.0: A Large Language Model with Localized Filtering-based Attention

Submission history

Computer Science > Computation and Language

Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine

Submission history

OpenHermes-2.5-neural-chat-v3-2-Slerp

Open LLM Leaderboard Evaluation Results

Computer Science > Machine Learning

TIES-Merging: Resolving Interference When Merging Models

Submission history

QuIP#: QuIP with Lattice Codebooks

Microsoft’s Edge Copilot AI can’t really summarize every YouTube video

The chatbot’s summarization feature relies on preprocessed video data or existing subtitles and transcripts.