bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,554


1/1
A one-plus 24GB mobile running a Mixtral 8x7B at 11 tokens/second.

Much faster inference speed vs llama.cpp and MLC-LLM.

Using swap and caching to run the model even if it doesn't fit the available RAM.

Between Apple’s LLM in a flash and PowerInfer-2, seems like the GPU in your pocket (phone) will have a local GPT-4 in another 18 months.

LLMs are a new kernel and should be a low-level utility embedded in every damn device that can be updated over the air.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196




1/1
The researchers used an oneplus 24GB smartphone as their testing platform.

And then implemented PowerInfer-2 by extending the original PowerInfer github code with an addition of 12K lines of code.

And the Maintainer of PowerInfer said, PowerInfer-2 will be open-sourced based on the original PowerInfer repo. They are refining it to untangle from their testing platform and making it accessible on PCs for the community. But Open-sourcing will happen in stages starting soon.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196




1/1
Decoding speeds of PowerInfer-2, llama.cpp, and MLC-LLM on TurboSparse-Mistral-7B with different offloading setups. “50% offload” means 50% model weights of FFN blocks are offloaded to flash storage. “No offload” means all model parameters are resident in memory. A red label of ⨉ indicates an execution failure due to the lack of weight offloading support.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,554

1/1
There is lots of talk on this platform about when LLMs will hit a wall in growth, but if you watching the research, it is clear that there a wide variety of AI approaches being proposed & tested.

To the average user who doesn't care, it may just seem like seemless improvement.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GP9bsjOXkAAxeIi.png







1/1
"Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-38B"

From 25.47% to 45.49% in GSM-Hard

Also noting in this regard, the head of Deepmind said last year that augmenting LLMs with Monte Carlo Tree Search may be the fastest path to AGI

This paper introduces the MCT Self-Refine (MCTSr) algorithm, which integrates Large Language Models (LLMs) with Monte Carlo Tree Search (MCTS) to enhance performance on complex mathematical reasoning tasks like Olympiad-level problems. The key problem being addressed is the accuracy and reliability challenges faced by LLMs in strategic and mathematical reasoning.

MCTSr constructs a Monte Carlo search tree through iterative processes of Selection (using an improved Upper Confidence Bound formula to balance exploration-exploitation), self-refine (the LLM generates feedback to guide refining an answer), self-evaluation (the LLM scores the quality of the refined answer), and Backpropagation (propagating the refined answer's value back through the tree).

The self-refine process uses a multi-turn dialogue prompt where the LLM first generates a critical comment on the current answer, then refines the answer guided by that comment. The self-evaluation scores an answer from -100 to 100 and applies constraints like strict scoring standards and suppressing perfect scores to improve reliability.

Backpropagation updates a node's Q value (estimated answer quality) by averaging its current Q value and the max Q value of its child nodes. Candidate nodes for further expansion are selected based on criteria like number of child nodes and child Q values exceeding the parent's.

Experiments demonstrate MCTSr significantly improves success rates on datasets like GSM8K (up to 96.66% with 8 rollouts vs 74.07% zero-shot), MATH (58.24% overall with 8 rollouts vs 24.36% zero-shot), and Olympiad-level benchmarks like AIME (11.79% with 8 rollouts vs 2.36% zero-shot). Performance scales with number of rollouts.

Compared to closed-source LLMs like GPT-4, MCTSr with LLaMA-38B achieves comparable results, showing it can boost reasoning capabilities of smaller open-source models. The paper concludes MCTSr is a robust and promising approach for complex mathematical reasoning with LLMs.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GP9bsjOXkAAxeIi.png

GP9bvLKWgAA_xPO.jpg

GP9bwzbWoAAcHMi.jpg

GP9byStXoAAvrDH.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,554





1/2
Dramatic breakdown of SOTA LLMs' reasoning capabilities when confronted with a simple common sense problem called the "Alice In Wonderland (AIW) problem".

The AIW problem is a concise natural language task that asks: "Alice has N brothers and she also has M sisters. How many sisters does Alice's brother have?" While easily solvable by humans using common sense reasoning (the correct answer is M+1), most tested LLMs, including GPT-3.5/4, Claude, Gemini, LLaMA, Mistral, and others, show a severe collapse in performance, often providing nonsensical answers and reasoning.

Specially in the world of LLMs - If you go small, you will go home.

"The few models capable of showing significant non-zero correct response rate for the AIW problem are residing on the largest scale."

---

This is despite their strong performance on standardized reasoning benchmarks. The key conclusion is that current LLMs lack basic reasoning skills, and existing benchmarks fail to properly detect these deficiencies.

Notably, even when LLMs occasionally provide correct answers, they often express strong overconfidence in their wrong solutions and generate confabulations (persuasive but nonsensical explanations) to justify their incorrect responses. Standard interventions like enhanced prompting or asking models to re-evaluate their answers fail to improve performance.

The authors introduce a harder variation called AIW+, which causes an even stronger performance collapse across all tested models, including GPT-4 and Claude 3 Opus, which performed relatively better on the original AIW problem.

This study highlights a striking discrepancy between LLMs' high scores on standardized reasoning benchmarks (e.g., MMLU, ARC, Hellaswag) and their poor performance on the AIW problem, suggesting that current benchmarks do not adequately reflect models' true reasoning capabilities and weaknesses.

The authors emphasize the need for the ML community to develop new reasoning benchmarks that can properly detect such deficits and guide the improvement of LLMs' reasoning skills. They also stress the importance of fully open and reproducible training pipelines, including dataset composition, to enable proper analysis and progress in this area.

2/2
https://arxiv.org/pdf/2406.02061


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GP_zlbqWMAAZ8I1.png

GP_z12vWUAA5GXs.jpg

GP_z3kIXwAAjh64.jpg

GP_0Q5RXwAIDnzd.png

GQAbg7nXIAAEzmK.jpg

GQAbhvgXgAAedFn.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,554



1/1
LoRA Finetuning of LLMs can be mysterious.

First the basics - with LoRA the fine-tuned weight W′ can be represented as:

W′ = W_0 + ∆W = W_0 + BA

Where the trainable parameters are the low-rank matrices B and A

Essentially, for finetuning, one can either initialize B to zero and A to random (default initialization in PEFT package), or vice-versa.

In both cases, the product BA is equal to zero at initialization, which makes finetuning starts from the pretrained model.

So both initialization schemes should in-principle yield the same performance and share the same optimal learning rate.

BUT this paper concludes that initializing matrix A with random values and B with zeros generally leads to better performance compared to the opposite initialization scheme.

------------

Through theoretical analysis in the infinite width limit, the authors show that Init[A] allows the use of larger learning rates without causing output instability compared to Init. This is because the maximal stable learning rate scales as Θ(n^(-1/2)) for Init[A] and Θ(n^(-1)) for Init), where n is the model width.
Init[A] can lead to "internal instability" where the LoRA features AZ are large but the LoRA output BAZ remains small. This instability enables more efficient feature learning. The authors identify a stability-feature learning tradeoff, where the optimal learning rate balances internal stability and feature learning.
In contrast, Init does not cause instabilities but leads to suboptimal feature learning as the B matrix is undertrained in the infinite width limit.

Extensive experiments on synthetic models and real-world LLMs (RoBERTa, Llama) finetuned on various tasks (GLUE, WikiText-2, Flan-v2, GSM8k) validate the theoretical findings. Init[A] consistently achieves better performance than Init, and the optimal learning rate for Init[A] is generally larger than for Init.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

GP_3VcgWsAAkIYp.png

GP_3f0tWQAAEnDc.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,554


GLM-4-9B-Chat​

Model Introduction​

GLM-4-9B is the open-source version of the latest generation of pre-trained models in the GLM-4 series launched by Zhipu AI. In the evaluation of data sets in semantics, mathematics, reasoning, code, and knowledge, GLM-4-9B and its human preference-aligned version GLM-4-9B-Chat have shown superior performance beyond Llama-3-8B. In addition to multi-round conversations, GLM-4-9B-Chat also has advanced features such as web browsing, code execution, custom tool calls (Function Call), and long text reasoning (supporting up to 128K context). This generation of models has added multi-language support, supporting 26 languages including Japanese, Korean, and German. We have also launched the GLM-4-9B-Chat-1M model that supports 1M context length (about 2 million Chinese characters) and the multimodal model GLM-4V-9B based on GLM-4-9B. GLM-4V-9B possesses dialogue capabilities in both Chinese and English at a high resolution of 1120*1120. In various multimodal evaluations, including comprehensive abilities in Chinese and English, perception & reasoning, text recognition, and chart understanding, GLM-4V-9B demonstrates superior performance compared to GPT-4-turbo-2024-04-09, Gemini 1.0 Pro, Qwen-VL-Max, and Claude 3 Opus.

Benchmark​

We evaluated the GLM-4-9B-Chat model on some classic tasks and obtained the following results:

Model​
AlignBench-v2​
MT-Bench​
IFEval​
MMLU​
C-Eval​
GSM8K​
MATH​
HumanEval​
NCB​
Llama-3-8B-Instruct​
5.12​
8.00​
68.58​
68.4​
51.3​
79.6​
30.0​
62.2​
24.7​
ChatGLM3-6B​
3.97​
5.50​
28.1​
66.4​
69.0​
72.3​
25.7​
58.5​
11.3​
GLM-4-9B-Chat​
6.61​
8.35​
69.0​
72.4​
75.6​
79.6​
50.6​
71.8​
32.2​

Long Context​

The eval_needle experiment was conducted with a context length of 1M, and the results are as follows:

needle
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,554





Qwen2-72B​

Introduction​

Qwen2 is the new series of Qwen large language models. For Qwen2, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters, including a Mixture-of-Experts model. This repo contains the 72B Qwen2 base language model.

Compared with the state-of-the-art opensource language models, including the previous released Qwen1.5, Qwen2 has generally surpassed most opensource models and demonstrated competitiveness against proprietary models across a series of benchmarks targeting for language understanding, language generation, multilingual capability, coding, mathematics, reasoning, etc.

For more details, please refer to our blog, GitHub, and Documentation.

Model Details​

Qwen2 is a language model series including decoder language models of different model sizes. For each size, we release the base language model and the aligned chat model. It is based on the Transformer architecture with SwiGLU activation, attention QKV bias, group query attention, etc. Additionally, we have an improved tokenizer adaptive to multiple natural languages and codes.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,554






1/3
This is from Apple's State of the Union

The local model is a 3B parameter SLM that uses adapters trained for each specific feature. Diffusion model does the same thing, adapter for each style.

Anything running locally or Apple's Secure Cloud is an Apple model, not OpenAI.

2/3
I was hoping for this or even better HomePods with M series chips, but my guess is they’ll need dedicated hardware and networking

3/3
Two things:

1. I’m spreading more information and truth about how it works
2. They know me


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,554

Introducing Apple’s On-Device and Server Foundation Models​

June 10, 2024

At the 2024 Worldwide Developers Conference, we introduced Apple Intelligence, a personal intelligence system integrated deeply into iOS 18, iPadOS 18, and macOS Sequoia.

Apple Intelligence is comprised of multiple highly-capable generative models that are specialized for our users’ everyday tasks, and can adapt on the fly for their current activity. The foundation models built into Apple Intelligence have been fine-tuned for user experiences such as writing and refining text, prioritizing and summarizing notifications, creating playful images for conversations with family and friends, and taking in-app actions to simplify interactions across apps.

In the following overview, we will detail how two of these models — a ~3 billion parameter on-device language model, and a larger server-based language model available with Private Cloud Compute and running on Apple silicon servers — have been built and adapted to perform specialized tasks efficiently, accurately, and responsibly. These two foundation models are part of a larger family of generative models created by Apple to support users and developers; this includes a coding model to build intelligence into Xcode, as well as a diffusion model to help users express themselves visually, for example, in the Messages app. We look forward to sharing more information soon on this broader set of models.

Our Focus on Responsible AI Development​

Apple Intelligence is designed with our core values at every step and built on a foundation of groundbreaking privacy innovations.

Additionally, we have created a set of Responsible AI principles to guide how we develop AI tools, as well as the models that underpin them:

  1. Empower users with intelligent tools: We identify areas where AI can be used responsibly to create tools for addressing specific user needs. We respect how our users choose to use these tools to accomplish their goals.
  2. Represent our users: We build deeply personal products with the goal of representing users around the globe authentically. We work continuously to avoid perpetuating stereotypes and systemic biases across our AI tools and models.
  3. Design with care: We take precautions at every stage of our process, including design, model training, feature development, and quality evaluation to identify how our AI tools may be misused or lead to potential harm. We will continuously and proactively improve our AI tools with the help of user feedback.
  4. Protect privacy: We protect our users' privacy with powerful on-device processing and groundbreaking infrastructure like Private Cloud Compute. We do not use our users' private personal data or user interactions when training our foundation models.

These principles are reflected throughout the architecture that enables Apple Intelligence, connects features and tools with specialized models, and scans inputs and outputs to provide each feature with the information needed to function responsibly.

In the remainder of this overview, we provide details on decisions such as: how we develop models that are highly capable, fast, and power-efficient; how we approach training these models; how our adapters are fine-tuned for specific user needs; and how we evaluate model performance for both helpfulness and unintended harm.


Modeling overview

Figure 1: Modeling overview for the Apple foundation models.


Pre-Training​

Our foundation models are trained on Apple's AXLearn framework, an open-source project we released in 2023. It builds on top of JAX and XLA, and allows us to train the models with high efficiency and scalability on various training hardware and cloud platforms, including TPUs and both cloud and on-premise GPUs. We used a combination of data parallelism, tensor parallelism, sequence parallelism, and Fully Sharded Data Parallel (FSDP) to scale training along multiple dimensions such as data, model, and sequence length.

We train our foundation models on licensed data, including data selected to enhance specific features, as well as publicly available data collected by our web-crawler, AppleBot. Web publishers have the option to opt out of the use of their web content for Apple Intelligence training with a data usage control.

We never use our users’ private personal data or user interactions when training our foundation models, and we apply filters to remove personally identifiable information like social security and credit card numbers that are publicly available on the Internet. We also filter profanity and other low-quality content to prevent its inclusion in the training corpus. In addition to filtering, we perform data extraction, deduplication, and the application of a model-based classifier to identify high quality documents.

Post-Training​

We find that data quality is essential to model success, so we utilize a hybrid data strategy in our training pipeline, incorporating both human-annotated and synthetic data, and conduct thorough data curation and filtering procedures. We have developed two novel algorithms in post-training: (1) a rejection sampling fine-tuning algorithm with teacher committee, and (2) a reinforcement learning from human feedback (RLHF) algorithm with mirror descent policy optimization and a leave-one-out advantage estimator. We find that these two algorithms lead to significant improvement in the model’s instruction-following quality.

Optimization​

In addition to ensuring our generative models are highly capable, we have used a range of innovative techniques to optimize them on-device and on our private cloud for speed and efficiency. We have applied an extensive set of optimizations for both first token and extended token inference performance.

Both the on-device and server models use grouped-query-attention. We use shared input and output vocab embedding tables to reduce memory requirements and inference cost. These shared embedding tensors are mapped without duplications. The on-device model uses a vocab size of 49K, while the server model uses a vocab size of 100K, which includes additional language and technical tokens.

For on-device inference, we use low-bit palletization, a critical optimization technique that achieves the necessary memory, power, and performance requirements. To maintain model quality, we developed a new framework using LoRA adapters that incorporates a mixed 2-bit and 4-bit configuration strategy — averaging 3.5 bits-per-weight — to achieve the same accuracy as the uncompressed models.

Additionally, we use an interactive model latency and power analysis tool, Talaria, to better guide the bit rate selection for each operation. We also utilize activation quantization and embedding quantization, and have developed an approach to enable efficient Key-Value (KV) cache update on our neural engines.

With this set of optimizations, on iPhone 15 Pro we are able to reach time-to-first-token latency of about 0.6 millisecond per prompt token, and a generation rate of 30 tokens per second. Notably, this performance is attained before employing token speculation techniques, from which we see further enhancement on the token generation rate.

Model Adaptation​

Our foundation models are fine-tuned for users’ everyday activities, and can dynamically specialize themselves on-the-fly for the task at hand. We utilize adapters, small neural network modules that can be plugged into various layers of the pre-trained model, to fine-tune our models for specific tasks. For our models we adapt the attention matrices, the attention projection matrix, and the fully connected layers in the point-wise feedforward networks for a suitable set of the decoding layers of the transformer architecture.

By fine-tuning only the adapter layers, the original parameters of the base pre-trained model remain unchanged, preserving the general knowledge of the model while tailoring the adapter layers to support specific tasks.

Figure 2: Adapters are small collections of model weights that are overlaid onto the common base foundation model. They can be dynamically loaded and swapped — giving the foundation model the ability to specialize itself on-the-fly for the task at hand. Apple Intelligence includes a broad set of adapters, each fine-tuned for a specific feature. It’s an efficient way to scale the capabilities of our foundation model.

We represent the values of the adapter parameters using 16 bits, and for the ~3 billion parameter on-device model, the parameters for a rank 16 adapter typically require 10s of megabytes. The adapter models can be dynamically loaded, temporarily cached in memory, and swapped — giving our foundation model the ability to specialize itself on the fly for the task at hand while efficiently managing memory and guaranteeing the operating system's responsiveness.

To facilitate the training of the adapters, we created an efficient infrastructure that allows us to rapidly retrain, test, and deploy adapters when either the base model or the training data gets updated. The adapter parameters are initialized using the accuracy-recovery adapter introduced in the Optimization section.

Performance and Evaluation​

Our focus is on delivering generative models that can enable users to communicate, work, express themselves, and get things done across their Apple products. When benchmarking our models, we focus on human evaluation as we find that these results are highly correlated to user experience in our products. We conducted performance evaluations on both feature-specific adapters and the foundation models.

To illustrate our approach, we look at how we evaluated our adapter for summarization. As product requirements for summaries of emails and notifications differ in subtle but important ways, we fine-tune accuracy-recovery low-rank (LoRA) adapters on top of the palletized model to meet these specific requirements. Our training data is based on synthetic summaries generated from bigger server models, filtered by a rejection sampling strategy that keeps only the high quality summaries.

To evaluate the product-specific summarization, we use a set of 750 responses carefully sampled for each use case. These evaluation datasets emphasize a diverse set of inputs that our product features are likely to face in production, and include a stratified mixture of single and stacked documents of varying content types and lengths. As product features, it was important to evaluate performance against datasets that are representative of real use cases. We find that our models with adapters generate better summaries than a comparable model.

As part of responsible development, we identified and evaluated specific risks inherent to summarization. For example, summaries occasionally remove important nuance or other details in ways that are undesirable. However, we found that the summarization adapter did not amplify sensitive content in over 99% of targeted adversarial examples. We continue to adversarially probe to identify unknown harms and expand our evaluations to help guide further improvements.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,554

Human Satisfaction Score on Summarization Feature Benchmark​

Email​

Satisfaction Good Result Ratio​

  1. Phi-3-mini: 73.3%
  2. Apple On-Device + Adapter: 87.5%

Satisfaction Poor Result Ratio​

  1. Phi-3-mini: 15.7%
  2. Apple On-Device + Adapter: 5.4%

Notification​

Satisfaction Good Result Ratio​

  1. Phi-3-mini: 76.6%
  2. Apple On-Device + Adapter: 79.7%

Satisfaction Poor Result Ratio​

  1. Phi-3-mini: 8.2%
  2. Apple On-Device + Adapter: 8.1%

Figure 3: Ratio of "good" and "poor" responses for two summarization use cases relative to all responses. Summaries are classified as "good", "neutral", "poor" given the grader's scores across five dimensions. A result is classified as "good" if all of the dimensions are good (higher is better). A result is classified as "poor" if any of the dimensions are poor (lower is better). Our models with adapters generate better summaries than a comparable model.

In addition to evaluating feature specific performance powered by foundation models and adapters, we evaluate both the on-device and server-based models’ general capabilities. We utilize a comprehensive evaluation set of real-world prompts to test the general model capabilities. These prompts are diverse across different difficulty levels and cover major categories such as brainstorming, classification, closed question answering, coding, extraction, mathematical reasoning, open question answering, rewriting, safety, summarization, and writing.

We compare our models with both open-source models (Phi-3, Gemma, Mistral, DBRX) and commercial models of comparable size (GPT-3.5-Turbo, GPT-4-Turbo)<a href="Introducing Apple’s On-Device and Server Foundation Models">1</a>. We find that our models are preferred by human graders over most comparable competitor models. On this benchmark, our on-device model, with ~3B parameters, outperforms larger models including Phi-3-mini, Mistral-7B, and Gemma-7B. Our server model compares favorably to DBRX-Instruct, Mixtral-8x22B, and GPT-3.5-Turbo while being highly efficient.

Apple Foundation Model Human Evaluation​

Win

Tie

Lose

Apple On-Device versus​

  1. Apple On-Device versus Gemma-2B: win 62.0%, tie 21.3%, lose 16.7%.
  2. Apple On-Device versus Mistral-7B: win 46.1%, tie 26.0%, lose 27.9%.
  3. Apple On-Device versus Phi-3-mini: win 43.0%, tie 24.6%, lose 32.4%.
  4. Apple On-Device versus Gemma-7B: win 41.6%, tie 27.8%, lose 30.6%.

Apple Server versus​

  1. Apple Server versus DBRX-Instruct: win 54.5%, tie 21.4%, lose 24.1%.
  2. Apple Server versus GPT-3.5-Turbo: win 50.0%, tie 25.3%, lose 24.7%.
  3. Apple Server versus Mixtral-8x22B: win 44.7%, tie 27.6%, lose 27.7%.
  4. Apple Server versus GPT-4-Turbo: win 28.5%, tie 29.8%, lose 41.7%.

Figure 4: Fraction of preferred responses in side-by-side evaluation of Apple's foundation model against comparable models. We find that our models are preferred by human graders.

We use a set of diverse adversarial prompts to test the model performance on harmful content, sensitive topics, and factuality. We measure the violation rates of each model as evaluated by human graders on this evaluation set, with a lower number being desirable. Both the on-device and server models are robust when faced with adversarial prompts, achieving violation rates lower than open-source and commercial models.

Human Evaluation of Output Harmfulness​

On-Device​

  1. Mistral-7B: 44.6%
  2. Phi-3-mini: 22.8%
  3. Gemma-2B: 14.0%
  4. Gemma-7B: 13.7%
  5. Apple On-Device: 8.2%

Server​

  1. Mixtral-8x22B: 43.3%
  2. DBRX-Instruct: 41.7%
  3. GPT-4-Turbo: 20.1%
  4. GPT-3.5-Turbo: 15.5%
  5. Apple Server: 6.6%

Figure 5: Fraction of violating responses for harmful content, sensitive topics, and factuality (lower is better). Our models are robust when faced with adversarial prompts.

Our models are preferred by human graders as safe and helpful over competitor models for these prompts. However, considering the broad capabilities of large language models, we understand the limitation of our safety benchmark. We are actively conducting both manual and automatic red-teaming with internal and external teams to continue evaluating our models' safety.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,554

Human Preference Evaluation on Safety Prompts​

Win

Tie

Lose

Apple On-Device versus​

  1. Apple On-Device versus Mistral-7B: win 52.2%, tie 37.6%, lose 10.2%.
  2. Apple On-Device versus Phi-3-mini: win 51.8%, tie 33.5%, lose 14.7%.
  3. Apple On-Device versus Gemma-2B: win 46.5%, tie 35.8%, lose 17.7%.
  4. Apple On-Device versus Gemma-7B: win 39.5%, tie 43.1%, lose 17.4%.

Apple Server versus​

  1. Apple Server versus DBRX-Instruct: win 57.3%, tie 32.6%, lose 10.0%.
  2. Apple Server versus Mixtral-8x22B: win 57.3%, tie 31.8%, lose 10.9%.
  3. Apple Server versus GPT-3.5-Turbo: win 41.8%, tie 43.6%, lose 14.6%.
  4. Apple Server versus GPT-4-Turbo: win 39.8%, tie 43.1%, lose 17.1%.

Figure 6: Fraction of preferred responses in side-by-side evaluation of Apple's foundation model against comparable models on safety prompts. Human graders found our responses safer and more helpful.

To further evaluate our models, we use the Instruction-Following Eval (IFEval) benchmark to compare their instruction-following capabilities with models of comparable size. The results suggest that both our on-device and server model follow detailed instructions better than the open-source and commercial models of comparable size.

IFEval Benchmarks​

On-Device​

Instruction-level Accuracy​

  1. Gemma-2B: 40.5%
  2. Gemma-7B: 61.6%
  3. Mistral-7B: 65.2%
  4. Phi-3-mini: 67.9%
  5. Apple On-Device: 78.7%

Prompt-level Accuracy​

  1. Gemma-2B: 28.7%
  2. Gemma-7B: 51.4%
  3. Mistral-7B: 54.2%
  4. Phi-3-mini: 57.8%
  5. Apple On-Device: 70.2%

Server​

Instruction-level Accuracy​

  1. DBRX-Instruct: 65.8%
  2. GPT-3.5-Turbo: 74.8%
  3. Mixtral-8x22B: 79.4%
  4. Apple Server: 85.0%
  5. GPT-4-Turbo: 85.4%

Prompt-level Accuracy​

  1. DBRX-Instruct: 53.6%
  2. GPT-3.5-Turbo: 65.3%
  3. Mixtral-8x22B: 71.4%
  4. Apple Server: 79.1%
  5. GPT-4-Turbo: 79.3%

Figure 7: Instruction-following capability (measured with IFEval) for Apple's foundation models and models of comparable size (higher is better).

We evaluate our models’ writing ability on our internal summarization and composition benchmarks, consisting of a variety of writing instructions. These results do not refer to our feature-specific adapter for summarization (seen in Figure 3), nor do we have an adapter focused on composition.

Writing Benchmarks​

On-Device​

Summarization​

  1. Gemma-2B: 7.6
  2. Phi-3-mini: 8.8
  3. Gemma-7B: 8.9
  4. Mistral-7B: 8.9
  5. Apple On-Device: 9.1

Composition​

  1. Gemma-2B: 8.0
  2. Phi-3-mini: 9.0
  3. Gemma-7B: 9.1
  4. Mistral-7B: 9.1
  5. Apple On-Device: 9.1

Server​

Summarization​

  1. GPT-3.5-Turbo: 8.6
  2. DBRX-Instruct: 9.2
  3. Mixtral-8x22B: 9.5
  4. GPT-4-Turbo: 9.5
  5. Apple Server: 9.5

Composition​

  1. GPT-3.5-Turbo: 8.9
  2. DBRX-Instruct: 9.2
  3. Mixtral-8x22B: 9.5
  4. Apple Server: 9.5
  5. GPT-4-Turbo: 9.7

Figure 8: Writing ability on internal summarization and composition benchmarks (higher is better).


Conclusion​

The Apple foundation models and adapters introduced at WWDC24 underlie Apple Intelligence, the new personal intelligence system that is integrated deeply into iPhone, iPad, and Mac, and enables powerful capabilities across language, images, actions, and personal context. Our models have been created with the purpose of helping users do everyday activities across their Apple products, and developed responsibly at every stage and guided by Apple’s core values. We look forward to sharing more information soon on our broader family of generative models, including language, diffusion, and coding models.

Footnotes​

[1] We compared against the following model versions: gpt-3.5-turbo-0125, gpt-4-0125-preview, Phi-3-mini-4k-instruct, Mistral-7B-Instruct-v0.2, Mixtral-8x22B-Instruct-v0.1, Gemma-1.1-2B, and Gemma-1.1-7B. The open-source and Apple models are evaluated in bfloat16 precision.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,554

SoftBank’s new ‘emotion canceling’ AI turns customer screams into soft speech​

The “emotion cancelling” technology aims to reduce stress levels among call center operators by softening the tone of angry customers’ voices.​

Updated: Jun 14, 2024 07:16 AM EST



Aman Tripathi


Aman Tripathi

15 hours ago

3 minutes

0

SoftBank’s new ‘emotion canceling’ AI turns customer screams into soft speech

Stock image of angry man yelling while using cell phone.

skynesher/iStock

Japanese tech titan SoftBank Corp has unveiled a groundbreaking solution to address the rising issue of customer harassment in call centers. The company has developed an AI-powered voice-altering technology that can transform even the angriest callers’ voices into calmer tones.

The system, dubbed “emotion canceling,” aims to alleviate the stress experienced by call center operators who often bear the brunt of customers’ frustrations.

“We are working on the development of a solution that can convert the customer’s voice into a calm conversational tone and deliver it to our workers using AI-enabled emotion recognition and voice processing technology,” SoftBank said in a press release.

The company’s press release emphasized the importance of maintaining good customer relationships while ensuring the psychological well-being of its workers.

Two-stage protective mechanism​

The development of the system was prompted by a television program highlighting the verbal abuse call center staff often endure.

Toshiyuki Nakatani, a SoftBank employee, was inspired to create a solution to protect others from such harassment.

The technology operates in two stages. First, it employs AI voice-processing to identify angry callers and analyze the characteristics of their speech. Subsequently, it incorporates acoustic features of non-threatening voices to create a calmer, more natural tone.

To achieve this, the AI was trained on over 10,000 voice data samples, with 10 actors recording more than 100 common phrases expressing various emotions, including anger and frustration. While the technology does not alter the caller’s words, it significantly modifies the intonation, making it less aggressive.

However, SoftBank ensures that the system does not completely eliminate all traces of anger, allowing operators to understand the situation and respond appropriately.

SoftBank’s “emotion canceling” system currently only works in Japanese. However, the company is currently exploring the possibility of extending this technology to other languages for international markets.

Notably, initial systems are expected to be available by April of next year, with pricing yet to be determined.

Rise of customer harassment​

SoftBank’s initiative comes at a time when Japan is grappling with the issue of customer harassment in the service industry, with the government considering legislation to strengthen worker protection.

The need for such a solution is evident from a recent survey by UA Zensen, Japan’s largest industrial union, which revealed that nearly 47% of service industry workers have experienced customer harassment in the past two years.

Over 100 respondents reported seeking psychiatric help due to the harassment.

Meanwhile, the Tokyo Metropolitan Government is taking steps to address the growing problem of customer harassment by introducing a local ordinance to ban “abusive and unreasonable demands that harm workplace environments.”

Need for positive interactions​

Roy Larke, an expert on Japanese consumer behavior, told SCMP that factors such as higher levels of stress caused by inflation, discontent on social media, and the influence of foreign tourists could be contributing to the rise in customer harassment.

He even suggested that Japanese consumers may need to reassess their expectations and demand less formality in service interactions.

That said, SoftBank’s “emotion canceling” technology represents a significant step towards addressing the problem of customer harassment in call centers.

By utilizing AI to create a calmer communication environment, the company hopes to protect its employees’ well-being and foster positive interactions with customers.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,554



1/3
Deepseek v2 beats GPT-4 and Claude Opus 3 in some use cases.

I asked them to extract all the keyboard shortcuts from a video script.

3 times for each bot.

Average extracted shortcuts:
- Deepseek -> 10.3
- Claude -> 9
- GPT -> 7.7

Deepseek is 100 times cheaper than Claude.

2/3
And Deepseek v2 is open source.

And it can be used commercially.

This is huge.

3/3
Thank you
@StabilityAI !


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GP5urIHWgAAAvPv.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,554


1/2
Was tired of endless scrolling in ChatGPT chats, so I worked on some UI enhancements.
Collapse long messages and use the side panel for quick previews.

chatgpt2/chatgpt2.js at main · frangin2003/chatgpt2

just run that in chrome devtools

#ChatGPT

2/2
Feedback greatly appreciated!


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,554

Delving into ChatGPT usage in academic writing through excess vocabulary​

Dmitry Kobak, Rita González Márquez, Emőke-Ágnes Horvát, Jan Lause

Recent large language models (LLMs) can generate and revise text with human-level performance, and have been widely commercialized in systems like ChatGPT. These models come with clear limitations: they can produce inaccurate information, reinforce existing biases, and be easily misused. Yet, many scientists have been using them to assist their scholarly writing. How wide-spread is LLM usage in the academic literature currently? To answer this question, we use an unbiased, large-scale approach, free from any assumptions on academic LLM usage. We study vocabulary changes in 14 million PubMed abstracts from 2010-2024, and show how the appearance of LLMs led to an abrupt increase in the frequency of certain style words. Our analysis based on excess words usage suggests that at least 10% of 2024 abstracts were processed with LLMs. This lower bound differed across disciplines, countries, and journals, and was as high as 30% for some PubMed sub-corpora. We show that the appearance of LLM-based writing assistants has had an unprecedented impact in the scientific literature, surpassing the effect of major world events such as the Covid pandemic.

Subjects:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Digital Libraries (cs.DL); Social and Information Networks (cs.SI)
Cite as: arXiv:2406.07016 [cs.CL]
(or arXiv:2406.07016v1 [cs.CL] for this version)
[2406.07016] Delving into ChatGPT usage in academic writing through excess vocabulary

Focus to learn more


Submission history​

From: Jan Lause [ view email]

[v1] Tue, 11 Jun 2024 07:16:34 UTC (416 KB)

 
Top