bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,836



1/2
This @LaminiAI memory tuning looks quite incredible

Lamini Memory Tuning tunes a massive mixture of memory experts on any open-source LLM.

Each memory expert acts like a LoRA adapter that functionally operates as memory for the model.

Together, the memory experts specialize in a million different ways to ensure faithful and factual accuracy to the data that it was tuned on. Inspired by information retrieval, these million memory experts are equivalent to indices from which the model intelligently retrieves and routes.

At inference time, the model retrieves the most relevant experts at each layer and merges back into the base model to respond to the user query.

2/2
Sounds like Mixture of LoRA's

having multiple LoRAs and swapping them out per prompt.

LoRA, especially with this approach, will bring in a whole bunch of related knowledge, embedded in the weights rather using up context.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GQCN2qgWcAAzkKl.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,836







1/7
Introduce HumanPlus - Shadowing part

Humanoids are born for using human data. We build a real-time shadowing system using a single RGB camera and a whole-body policy for cloning human motion. Examples:
- boxing
- playing the piano/ping pong
- tossing
- typing

Open-sourced!

2/7
Which hardware platform should HumanPlus be embodied on?

We build our own 33-DoF humanoid with two dexterous hands using components:
- Inspire-Robots RH56DFX hands
-
@UnitreeRobotics H1 robot
-
@ROBOTIS
Dynamixel motors
-
@Razer
webcams

We open-source our hardware design.

3/7
Naively copying joints from humans to humanoids does not work due to gravity and different actuations.

We train a transformer-based whole-body RL policy in IsaacGym simulation with realistic physics using AMASS dataset containing 40 hours of human motion: AMASS

4/7
To retarget from humans to humanoids, we copy the corresponding Euler angles from SMPL-X to our humanoid model.

We use open-sourced SOTA human pose and hand estimation methods (thanks!)
- WHAM for body: WHAM
- HaMeR for hands: HaMeR

5/7
Compared with other teleoperation methods, shadowing
- is affordable
- requires only 1 human operator
- avoids singularities
- natively supports whole-body control

6/7
Shadowing is an efficient data collection pipeline.

We then perform supervised behavior cloning to train skill policies using egocentric vision, allowing humanoids to complete different tasks autonomously by imitating human skills.

7/7
This project is not possible without our team of experts, covering from computer graphics to robot learning to robot hardware:
- co-leads: @qingqing_zhao_ @Qi_Wu577- advisors: @chelseabfinn @GordonWetzstein

project website: HumanPlus: Humanoid Shadowing and Imitation from Humans
hardware:


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GP8poQOaoAAaT8w.jpg

GP8rzNmaMAABuj_.jpg

GP8t0N8bcAAuB0d.jpg

GP81fPRaIAAgNlP.png

GP9yUj4bsAEPyaU.jpg








1/6
Introduce HumanPlus - Autonomous Skills part

Humanoids are born for using human data. Imitating humans, our humanoid learns:
- fold sweatshirts
- unload objects from warehouse racks
- diverse locomotion skills (squatting, jumping, standing)
- greet another robot

Open-sourced!

2/6
We build our customized 33-DoF humanoid, and a data collection pipeline through real-time shadowing in the real world.

3/6
Using the data collected through shadowing, we then perform supervised behavior cloning to train skill policies using egocentric vision.

We introduce Humanoid Imitation Transformer. Based on ACT, HIT adds forward dynamics prediction on image feature space as a regularization.

4/6
Compared to baselines, HIT uses
- binocular vision, thus having implicit stereos for depth information
- visual feedback better, avoiding overfitting to proprioception given small-sized demos

5/6
Besides vision-based whole-body manipulation skills, our humanoid has strong locomotion skills:
- outperforming H1 default standing controller under strong perturbation forces
- enabling more whole-body skills like squatting and jumping

6/6
This project is not possible without our team of experts, covering from computer graphics to robot learning to robot hardware:
- co-leads: @qingqing_zhao_ @Qi_Wu577- advisors: @chelseabfinn@GordonWetzstein


project website: HumanPlus: Humanoid Shadowing and Imitation from Humans
hardware:


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GP9yyYia4AAR0-L.jpg

GP9y4p5bAAAWqCR.png

GP9y8hVbsAA67zY.png

GP9zBNnaoAA6r_-.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,836

1/1
There's a lack of datasets for long-form video understanding for a genuine long-form comprehension.

This one is quite impressive for authentic long-form video understanding - CinePile

305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects, including temporal comprehension, understanding human-object interactions, and reasoning about events or actions within a scene.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GNoERvXWEAAvJKA.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,836


1/1
A one-plus 24GB mobile running a Mixtral 8x7B at 11 tokens/second.

Much faster inference speed vs llama.cpp and MLC-LLM.

Using swap and caching to run the model even if it doesn't fit the available RAM.

Between Apple’s LLM in a flash and PowerInfer-2, seems like the GPU in your pocket (phone) will have a local GPT-4 in another 18 months.

LLMs are a new kernel and should be a low-level utility embedded in every damn device that can be updated over the air.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196




1/1
The researchers used an oneplus 24GB smartphone as their testing platform.

And then implemented PowerInfer-2 by extending the original PowerInfer github code with an addition of 12K lines of code.

And the Maintainer of PowerInfer said, PowerInfer-2 will be open-sourced based on the original PowerInfer repo. They are refining it to untangle from their testing platform and making it accessible on PCs for the community. But Open-sourcing will happen in stages starting soon.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/1
Decoding speeds of PowerInfer-2, llama.cpp, and MLC-LLM on TurboSparse-Mistral-7B with different offloading setups. “50% offload” means 50% model weights of FFN blocks are offloaded to flash storage. “No offload” means all model parameters are resident in memory. A red label of ⨉ indicates an execution failure due to the lack of weight offloading support.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,836

1/1
There is lots of talk on this platform about when LLMs will hit a wall in growth, but if you watching the research, it is clear that there a wide variety of AI approaches being proposed & tested.

To the average user who doesn't care, it may just seem like seemless improvement.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GP9bsjOXkAAxeIi.png







1/1
"Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-38B"

From 25.47% to 45.49% in GSM-Hard

Also noting in this regard, the head of Deepmind said last year that augmenting LLMs with Monte Carlo Tree Search may be the fastest path to AGI

This paper introduces the MCT Self-Refine (MCTSr) algorithm, which integrates Large Language Models (LLMs) with Monte Carlo Tree Search (MCTS) to enhance performance on complex mathematical reasoning tasks like Olympiad-level problems. The key problem being addressed is the accuracy and reliability challenges faced by LLMs in strategic and mathematical reasoning.

MCTSr constructs a Monte Carlo search tree through iterative processes of Selection (using an improved Upper Confidence Bound formula to balance exploration-exploitation), self-refine (the LLM generates feedback to guide refining an answer), self-evaluation (the LLM scores the quality of the refined answer), and Backpropagation (propagating the refined answer's value back through the tree).

The self-refine process uses a multi-turn dialogue prompt where the LLM first generates a critical comment on the current answer, then refines the answer guided by that comment. The self-evaluation scores an answer from -100 to 100 and applies constraints like strict scoring standards and suppressing perfect scores to improve reliability.

Backpropagation updates a node's Q value (estimated answer quality) by averaging its current Q value and the max Q value of its child nodes. Candidate nodes for further expansion are selected based on criteria like number of child nodes and child Q values exceeding the parent's.

Experiments demonstrate MCTSr significantly improves success rates on datasets like GSM8K (up to 96.66% with 8 rollouts vs 74.07% zero-shot), MATH (58.24% overall with 8 rollouts vs 24.36% zero-shot), and Olympiad-level benchmarks like AIME (11.79% with 8 rollouts vs 2.36% zero-shot). Performance scales with number of rollouts.

Compared to closed-source LLMs like GPT-4, MCTSr with LLaMA-38B achieves comparable results, showing it can boost reasoning capabilities of smaller open-source models. The paper concludes MCTSr is a robust and promising approach for complex mathematical reasoning with LLMs.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GP9bsjOXkAAxeIi.png

GP9bvLKWgAA_xPO.jpg

GP9bwzbWoAAcHMi.jpg

GP9byStXoAAvrDH.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,836





1/2
Dramatic breakdown of SOTA LLMs' reasoning capabilities when confronted with a simple common sense problem called the "Alice In Wonderland (AIW) problem".

The AIW problem is a concise natural language task that asks: "Alice has N brothers and she also has M sisters. How many sisters does Alice's brother have?" While easily solvable by humans using common sense reasoning (the correct answer is M+1), most tested LLMs, including GPT-3.5/4, Claude, Gemini, LLaMA, Mistral, and others, show a severe collapse in performance, often providing nonsensical answers and reasoning.

Specially in the world of LLMs - If you go small, you will go home.

"The few models capable of showing significant non-zero correct response rate for the AIW problem are residing on the largest scale."

---

This is despite their strong performance on standardized reasoning benchmarks. The key conclusion is that current LLMs lack basic reasoning skills, and existing benchmarks fail to properly detect these deficiencies.

Notably, even when LLMs occasionally provide correct answers, they often express strong overconfidence in their wrong solutions and generate confabulations (persuasive but nonsensical explanations) to justify their incorrect responses. Standard interventions like enhanced prompting or asking models to re-evaluate their answers fail to improve performance.

The authors introduce a harder variation called AIW+, which causes an even stronger performance collapse across all tested models, including GPT-4 and Claude 3 Opus, which performed relatively better on the original AIW problem.

This study highlights a striking discrepancy between LLMs' high scores on standardized reasoning benchmarks (e.g., MMLU, ARC, Hellaswag) and their poor performance on the AIW problem, suggesting that current benchmarks do not adequately reflect models' true reasoning capabilities and weaknesses.

The authors emphasize the need for the ML community to develop new reasoning benchmarks that can properly detect such deficits and guide the improvement of LLMs' reasoning skills. They also stress the importance of fully open and reproducible training pipelines, including dataset composition, to enable proper analysis and progress in this area.

2/2
https://arxiv.org/pdf/2406.02061


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GP_zlbqWMAAZ8I1.png

GP_z12vWUAA5GXs.jpg

GP_z3kIXwAAjh64.jpg

GP_0Q5RXwAIDnzd.png

GQAbg7nXIAAEzmK.jpg

GQAbhvgXgAAedFn.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,836



1/1
LoRA Finetuning of LLMs can be mysterious.

First the basics - with LoRA the fine-tuned weight W′ can be represented as:

W′ = W_0 + ∆W = W_0 + BA

Where the trainable parameters are the low-rank matrices B and A

Essentially, for finetuning, one can either initialize B to zero and A to random (default initialization in PEFT package), or vice-versa.

In both cases, the product BA is equal to zero at initialization, which makes finetuning starts from the pretrained model.

So both initialization schemes should in-principle yield the same performance and share the same optimal learning rate.

BUT this paper concludes that initializing matrix A with random values and B with zeros generally leads to better performance compared to the opposite initialization scheme.

------------

Through theoretical analysis in the infinite width limit, the authors show that Init[A] allows the use of larger learning rates without causing output instability compared to Init. This is because the maximal stable learning rate scales as Θ(n^(-1/2)) for Init[A] and Θ(n^(-1)) for Init), where n is the model width.
Init[A] can lead to "internal instability" where the LoRA features AZ are large but the LoRA output BAZ remains small. This instability enables more efficient feature learning. The authors identify a stability-feature learning tradeoff, where the optimal learning rate balances internal stability and feature learning.
In contrast, Init does not cause instabilities but leads to suboptimal feature learning as the B matrix is undertrained in the infinite width limit.

Extensive experiments on synthetic models and real-world LLMs (RoBERTa, Llama) finetuned on various tasks (GLUE, WikiText-2, Flan-v2, GSM8k) validate the theoretical findings. Init[A] consistently achieves better performance than Init, and the optimal learning rate for Init[A] is generally larger than for Init.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

GP_3VcgWsAAkIYp.png

GP_3f0tWQAAEnDc.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,836


GLM-4-9B-Chat​

Model Introduction​

GLM-4-9B is the open-source version of the latest generation of pre-trained models in the GLM-4 series launched by Zhipu AI. In the evaluation of data sets in semantics, mathematics, reasoning, code, and knowledge, GLM-4-9B and its human preference-aligned version GLM-4-9B-Chat have shown superior performance beyond Llama-3-8B. In addition to multi-round conversations, GLM-4-9B-Chat also has advanced features such as web browsing, code execution, custom tool calls (Function Call), and long text reasoning (supporting up to 128K context). This generation of models has added multi-language support, supporting 26 languages including Japanese, Korean, and German. We have also launched the GLM-4-9B-Chat-1M model that supports 1M context length (about 2 million Chinese characters) and the multimodal model GLM-4V-9B based on GLM-4-9B. GLM-4V-9B possesses dialogue capabilities in both Chinese and English at a high resolution of 1120*1120. In various multimodal evaluations, including comprehensive abilities in Chinese and English, perception & reasoning, text recognition, and chart understanding, GLM-4V-9B demonstrates superior performance compared to GPT-4-turbo-2024-04-09, Gemini 1.0 Pro, Qwen-VL-Max, and Claude 3 Opus.

Benchmark​

We evaluated the GLM-4-9B-Chat model on some classic tasks and obtained the following results:

Model​
AlignBench-v2​
MT-Bench​
IFEval​
MMLU​
C-Eval​
GSM8K​
MATH​
HumanEval​
NCB​
Llama-3-8B-Instruct​
5.12​
8.00​
68.58​
68.4​
51.3​
79.6​
30.0​
62.2​
24.7​
ChatGLM3-6B​
3.97​
5.50​
28.1​
66.4​
69.0​
72.3​
25.7​
58.5​
11.3​
GLM-4-9B-Chat​
6.61​
8.35​
69.0​
72.4​
75.6​
79.6​
50.6​
71.8​
32.2​

Long Context​

The eval_needle experiment was conducted with a context length of 1M, and the results are as follows:

needle
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,836





Qwen2-72B​

Introduction​

Qwen2 is the new series of Qwen large language models. For Qwen2, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters, including a Mixture-of-Experts model. This repo contains the 72B Qwen2 base language model.

Compared with the state-of-the-art opensource language models, including the previous released Qwen1.5, Qwen2 has generally surpassed most opensource models and demonstrated competitiveness against proprietary models across a series of benchmarks targeting for language understanding, language generation, multilingual capability, coding, mathematics, reasoning, etc.

For more details, please refer to our blog, GitHub, and Documentation.

Model Details​

Qwen2 is a language model series including decoder language models of different model sizes. For each size, we release the base language model and the aligned chat model. It is based on the Transformer architecture with SwiGLU activation, attention QKV bias, group query attention, etc. Additionally, we have an improved tokenizer adaptive to multiple natural languages and codes.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,836






1/3
This is from Apple's State of the Union

The local model is a 3B parameter SLM that uses adapters trained for each specific feature. Diffusion model does the same thing, adapter for each style.

Anything running locally or Apple's Secure Cloud is an Apple model, not OpenAI.

2/3
I was hoping for this or even better HomePods with M series chips, but my guess is they’ll need dedicated hardware and networking

3/3
Two things:

1. I’m spreading more information and truth about how it works
2. They know me


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,836

Introducing Apple’s On-Device and Server Foundation Models​

June 10, 2024

At the 2024 Worldwide Developers Conference, we introduced Apple Intelligence, a personal intelligence system integrated deeply into iOS 18, iPadOS 18, and macOS Sequoia.

Apple Intelligence is comprised of multiple highly-capable generative models that are specialized for our users’ everyday tasks, and can adapt on the fly for their current activity. The foundation models built into Apple Intelligence have been fine-tuned for user experiences such as writing and refining text, prioritizing and summarizing notifications, creating playful images for conversations with family and friends, and taking in-app actions to simplify interactions across apps.

In the following overview, we will detail how two of these models — a ~3 billion parameter on-device language model, and a larger server-based language model available with Private Cloud Compute and running on Apple silicon servers — have been built and adapted to perform specialized tasks efficiently, accurately, and responsibly. These two foundation models are part of a larger family of generative models created by Apple to support users and developers; this includes a coding model to build intelligence into Xcode, as well as a diffusion model to help users express themselves visually, for example, in the Messages app. We look forward to sharing more information soon on this broader set of models.

Our Focus on Responsible AI Development​

Apple Intelligence is designed with our core values at every step and built on a foundation of groundbreaking privacy innovations.

Additionally, we have created a set of Responsible AI principles to guide how we develop AI tools, as well as the models that underpin them:

  1. Empower users with intelligent tools: We identify areas where AI can be used responsibly to create tools for addressing specific user needs. We respect how our users choose to use these tools to accomplish their goals.
  2. Represent our users: We build deeply personal products with the goal of representing users around the globe authentically. We work continuously to avoid perpetuating stereotypes and systemic biases across our AI tools and models.
  3. Design with care: We take precautions at every stage of our process, including design, model training, feature development, and quality evaluation to identify how our AI tools may be misused or lead to potential harm. We will continuously and proactively improve our AI tools with the help of user feedback.
  4. Protect privacy: We protect our users' privacy with powerful on-device processing and groundbreaking infrastructure like Private Cloud Compute. We do not use our users' private personal data or user interactions when training our foundation models.

These principles are reflected throughout the architecture that enables Apple Intelligence, connects features and tools with specialized models, and scans inputs and outputs to provide each feature with the information needed to function responsibly.

In the remainder of this overview, we provide details on decisions such as: how we develop models that are highly capable, fast, and power-efficient; how we approach training these models; how our adapters are fine-tuned for specific user needs; and how we evaluate model performance for both helpfulness and unintended harm.


Modeling overview

Figure 1: Modeling overview for the Apple foundation models.


Pre-Training​

Our foundation models are trained on Apple's AXLearn framework, an open-source project we released in 2023. It builds on top of JAX and XLA, and allows us to train the models with high efficiency and scalability on various training hardware and cloud platforms, including TPUs and both cloud and on-premise GPUs. We used a combination of data parallelism, tensor parallelism, sequence parallelism, and Fully Sharded Data Parallel (FSDP) to scale training along multiple dimensions such as data, model, and sequence length.

We train our foundation models on licensed data, including data selected to enhance specific features, as well as publicly available data collected by our web-crawler, AppleBot. Web publishers have the option to opt out of the use of their web content for Apple Intelligence training with a data usage control.

We never use our users’ private personal data or user interactions when training our foundation models, and we apply filters to remove personally identifiable information like social security and credit card numbers that are publicly available on the Internet. We also filter profanity and other low-quality content to prevent its inclusion in the training corpus. In addition to filtering, we perform data extraction, deduplication, and the application of a model-based classifier to identify high quality documents.

Post-Training​

We find that data quality is essential to model success, so we utilize a hybrid data strategy in our training pipeline, incorporating both human-annotated and synthetic data, and conduct thorough data curation and filtering procedures. We have developed two novel algorithms in post-training: (1) a rejection sampling fine-tuning algorithm with teacher committee, and (2) a reinforcement learning from human feedback (RLHF) algorithm with mirror descent policy optimization and a leave-one-out advantage estimator. We find that these two algorithms lead to significant improvement in the model’s instruction-following quality.

Optimization​

In addition to ensuring our generative models are highly capable, we have used a range of innovative techniques to optimize them on-device and on our private cloud for speed and efficiency. We have applied an extensive set of optimizations for both first token and extended token inference performance.

Both the on-device and server models use grouped-query-attention. We use shared input and output vocab embedding tables to reduce memory requirements and inference cost. These shared embedding tensors are mapped without duplications. The on-device model uses a vocab size of 49K, while the server model uses a vocab size of 100K, which includes additional language and technical tokens.

For on-device inference, we use low-bit palletization, a critical optimization technique that achieves the necessary memory, power, and performance requirements. To maintain model quality, we developed a new framework using LoRA adapters that incorporates a mixed 2-bit and 4-bit configuration strategy — averaging 3.5 bits-per-weight — to achieve the same accuracy as the uncompressed models.

Additionally, we use an interactive model latency and power analysis tool, Talaria, to better guide the bit rate selection for each operation. We also utilize activation quantization and embedding quantization, and have developed an approach to enable efficient Key-Value (KV) cache update on our neural engines.

With this set of optimizations, on iPhone 15 Pro we are able to reach time-to-first-token latency of about 0.6 millisecond per prompt token, and a generation rate of 30 tokens per second. Notably, this performance is attained before employing token speculation techniques, from which we see further enhancement on the token generation rate.

Model Adaptation​

Our foundation models are fine-tuned for users’ everyday activities, and can dynamically specialize themselves on-the-fly for the task at hand. We utilize adapters, small neural network modules that can be plugged into various layers of the pre-trained model, to fine-tune our models for specific tasks. For our models we adapt the attention matrices, the attention projection matrix, and the fully connected layers in the point-wise feedforward networks for a suitable set of the decoding layers of the transformer architecture.

By fine-tuning only the adapter layers, the original parameters of the base pre-trained model remain unchanged, preserving the general knowledge of the model while tailoring the adapter layers to support specific tasks.

Figure 2: Adapters are small collections of model weights that are overlaid onto the common base foundation model. They can be dynamically loaded and swapped — giving the foundation model the ability to specialize itself on-the-fly for the task at hand. Apple Intelligence includes a broad set of adapters, each fine-tuned for a specific feature. It’s an efficient way to scale the capabilities of our foundation model.

We represent the values of the adapter parameters using 16 bits, and for the ~3 billion parameter on-device model, the parameters for a rank 16 adapter typically require 10s of megabytes. The adapter models can be dynamically loaded, temporarily cached in memory, and swapped — giving our foundation model the ability to specialize itself on the fly for the task at hand while efficiently managing memory and guaranteeing the operating system's responsiveness.

To facilitate the training of the adapters, we created an efficient infrastructure that allows us to rapidly retrain, test, and deploy adapters when either the base model or the training data gets updated. The adapter parameters are initialized using the accuracy-recovery adapter introduced in the Optimization section.

Performance and Evaluation​

Our focus is on delivering generative models that can enable users to communicate, work, express themselves, and get things done across their Apple products. When benchmarking our models, we focus on human evaluation as we find that these results are highly correlated to user experience in our products. We conducted performance evaluations on both feature-specific adapters and the foundation models.

To illustrate our approach, we look at how we evaluated our adapter for summarization. As product requirements for summaries of emails and notifications differ in subtle but important ways, we fine-tune accuracy-recovery low-rank (LoRA) adapters on top of the palletized model to meet these specific requirements. Our training data is based on synthetic summaries generated from bigger server models, filtered by a rejection sampling strategy that keeps only the high quality summaries.

To evaluate the product-specific summarization, we use a set of 750 responses carefully sampled for each use case. These evaluation datasets emphasize a diverse set of inputs that our product features are likely to face in production, and include a stratified mixture of single and stacked documents of varying content types and lengths. As product features, it was important to evaluate performance against datasets that are representative of real use cases. We find that our models with adapters generate better summaries than a comparable model.

As part of responsible development, we identified and evaluated specific risks inherent to summarization. For example, summaries occasionally remove important nuance or other details in ways that are undesirable. However, we found that the summarization adapter did not amplify sensitive content in over 99% of targeted adversarial examples. We continue to adversarially probe to identify unknown harms and expand our evaluations to help guide further improvements.odels are evaluated in bfloat16 precision.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,836

Delving into ChatGPT usage in academic writing through excess vocabulary​

Dmitry Kobak, Rita González Márquez, Emőke-Ágnes Horvát, Jan Lause

Recent large language models (LLMs) can generate and revise text with human-level performance, and have been widely commercialized in systems like ChatGPT. These models come with clear limitations: they can produce inaccurate information, reinforce existing biases, and be easily misused. Yet, many scientists have been using them to assist their scholarly writing. How wide-spread is LLM usage in the academic literature currently? To answer this question, we use an unbiased, large-scale approach, free from any assumptions on academic LLM usage. We study vocabulary changes in 14 million PubMed abstracts from 2010-2024, and show how the appearance of LLMs led to an abrupt increase in the frequency of certain style words. Our analysis based on excess words usage suggests that at least 10% of 2024 abstracts were processed with LLMs. This lower bound differed across disciplines, countries, and journals, and was as high as 30% for some PubMed sub-corpora. We show that the appearance of LLM-based writing assistants has had an unprecedented impact in the scientific literature, surpassing the effect of major world events such as the Covid pandemic.

Subjects:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Digital Libraries (cs.DL); Social and Information Networks (cs.SI)
Cite as: arXiv:2406.07016 [cs.CL]
(or arXiv:2406.07016v1 [cs.CL] for this version)
[2406.07016] Delving into ChatGPT usage in academic writing through excess vocabulary

Focus to learn more


Submission history​

From: Jan Lause [ view email]

[v1] Tue, 11 Jun 2024 07:16:34 UTC (416 KB)

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,836

OpenAI CEO says company could become for-profit corporation, The Information reports​

By Reuters

June 15, 20241:37 AM EDTUpdated an hour ago

54th WEF annual meeting in Davos

Sam Altman, CEO of OpenAI, attends the 54th annual meeting of the World Economic Forum, in Davos, Switzerland, January 18, 2024. REUTERS/Denis Balibouse/File Photo Purchase Licensing Rights, opens new tab
June 14 (Reuters) - OpenAI CEO Sam Altman told some shareholders that the company is considering changing its governance structure to a for-profit business that the firm's nonprofit board doesn't control, The Information reported on Friday.

One scenario Altman said the board is considering is a for-profit benefit corporation, which rivals such as Anthropic and xAI are using, the report said, citing a person who heard the comments.

The restructuring discussions are fluid and Altman and his fellow directors could ultimately decide to take a different approach, The Information added.

In response to Reuters' queries about the report, OpenAI said: "We remain focused on building AI that benefits everyone. The nonprofit is core to our mission and will continue to exist."
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,836




"We present DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language model that achieves performance comparable to GPT4-Turbo in code-specific tasks. Specifically, DeepSeek-Coder-V2 is further pre-trained from DeepSeek-Coder-V2-Base with 6 trillion tokens sourced from a high-quality and multi-source corpus. Through this continued pre-training, DeepSeek-Coder-V2 substantially enhances the coding and mathematical reasoning capabilities of DeepSeek-Coder-V2-Base, while maintaining comparable performance in general language tasks. Compared to DeepSeek-Coder, DeepSeek-Coder-V2 demonstrates significant advancements in various aspects of code-related tasks, as well as reasoning and general capabilities. Additionally, DeepSeek-Coder-V2 expands its support for programming languages from 86 to 338, while extending the context length from 16K to 128K."

 
Top