Large Language Models News & Discussions

bnew · Dec 25, 2023

LMOps/adaptllm at main · microsoft/LMOps

General technology for enabling AI capabilities w/ LLMs and MLLMs - microsoft/LMOps

github.com

Adapting Large Language Models via Reading Comprehension

This repo contains the model, code and data for our paper Adapting Large Language Models via Reading Comprehension

We explore continued pre-training on domain-specific corpora for large language models. While this approach enriches LLMs with domain knowledge, it significantly hurts their prompting ability for question answering. Inspired by human learning via reading comprehension, we propose a simple method to transform large-scale pre-training corpora into reading comprehension texts, consistently improving prompting performance across tasks in biomedicine, finance, and law domains. Our 7B model competes with much larger domain-specific models like BloombergGPT-50B. Moreover, our domain-specific reading comprehension texts enhance model performance even on general benchmarks, indicating potential for developing a general LLM across more domains.

Domain-specific LLMs

Our models of different domains are now available in Huggingface: biomedicine-LLM, finance-LLM and law-LLM, the performances of our AdaptLLM compared to other domain-specific LLMs are:

Domain-specific Tasks

To easily reproduce our results, we have uploaded the filled-in zero/few-shot input instructions and output completions of each domain-specific task: biomedicine-tasks, finance-tasks, and law-tasks.

bnew · Dec 25, 2023

https://archive.is/JB77E

GitHub - FlagOpen/TACO

Contribute to FlagOpen/TACO development by creating an account on GitHub.

github.com

[2312.14852] TACO: Topics in Algorithmic COde generation dataset

Computer Science > Artificial Intelligence

[Submitted on 22 Dec 2023]

TACO: Topics in Algorithmic COde generation dataset

Rongao Li (1 and 2), Jie Fu (1), Bo-Wen Zhang (1), Tao Huang (2), Zhihong Sun (2), Chen Lyu (2), Guang Liu (1), Zhi Jin (3), Ge Li (3) ((1) Beijing Academy of Artificial Intelligence, (2) School of Information Science and Engineering, Shandong Normal University, China, (3) Key Lab of HCST (PKU), MOE, SCS, Peking University, China)

We introduce TACO, an open-source, large-scale code generation dataset, with a focus on the optics of algorithms, designed to provide a more challenging training dataset and evaluation benchmark in the field of code generation models. TACO includes competition-level programming questions that are more challenging, to enhance or evaluate problem understanding and reasoning abilities in real-world programming scenarios. There are 25433 and 1000 coding problems in training and test set, as well as up to 1.55 million diverse solution answers. Moreover, each TACO problem includes several fine-grained labels such as task topics, algorithms, programming skills, and difficulty levels, providing a more precise reference for the training and evaluation of code generation models. The dataset and evaluation scripts are available on Hugging Face Hub (this https URL) and Github (this https URL).

Subjects:	Artificial Intelligence (cs.AI)
Cite as:	arXiv:2312.14852 [cs.AI]
	(or arXiv:2312.14852v1 [cs.AI] for this version)

Submission history

From: Bo-Wen Zhang [view email]
[v1] Fri, 22 Dec 2023 17:25:42 UTC (600 KB)

https://arxiv.org/pdf/2312.14852.pdf

bnew · Dec 25, 2023

https://archive.is/0zFT0

https://archive.is/hyY9s

GitHub - baaivision/Emu: Emu Series: Generative Multimodal Models from BAAI

Emu Series: Generative Multimodal Models from BAAI - baaivision/Emu

github.com

BAAI/Emu2-Gen · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Generative Multimodal Models are In-Context Learners

baaivision.github.io

Generative Multimodal Models are In-Context Learners

Quan Sun1*, Yufeng Cui1*, Xiaosong Zhang1*, Fan Zhang1*, Qiying Yu2,1*, Zhengxiong Luo1, Yueze Wang1, Yongming Rao1 Jingjing Liu2 Tiejun Huang1,3 Xinlong Wang1†
1Beijing Academy of Artificial Intelligence 2Tsinghua University 3Peking University
*equal contribution †project lead
arXiv Code Demo

HF Demo

HF Model

Abstract

The human ability to easily solve multimodal tasks in context (i.e., with only a few demonstrations or simple instructions), is what current multimodal systems have largely struggled to imitate. In this work, we demonstrate that the task-agnostic in-context learning capabilities of large multimodal models can be significantly enhanced by effective scaling-up. We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences with a unified autoregressive objective. Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning, such as visual prompting and object-grounded generation. The model sets a new record on multiple multimodal understanding tasks in few-shot settings. When instruction-tuned to follow specific instructions, Emu2 further achieves new state-of-the-art on challenging tasks such as question answering benchmarks for large multimodal models and open-ended subject-driven generation. These achievements demonstrate that Emu2 can serve as a base model and general-purpose interface for a wide range of multimodal tasks. Code and models are publicly available to facilitate future research.

Video

A strong multimodal few-shot learner

An impressive multimodal generalist

A skilled painter

Zero-shot subject-driven generation

Multimodal in-context learning

Strong multimodal understanding

Generate image from any prompt sequence

Generate video from any prompt sequence

A.I generated explanation:

Sure, let’s break down this abstract into simpler terms and provide some examples:

Multimodal tasks in context: This refers to tasks that involve multiple types of data (like text, images, and sound) and require understanding the context. For example, if you see a picture of a dog and read a story about a dog, you can understand that both are related. This is something humans do easily but is challenging for machines.
In-context learning capabilities of large multimodal models: This means the ability of large AI models to learn from the context in which they’re used. For instance, if an AI model is used to recommend movies, it might learn to suggest horror movies when it’s close to Halloween, based on the context of the time of year.
Emu2: This is the name of the new AI model introduced in the paper. It’s a large model with 37 billion parameters, which means it has a lot of capacity to learn from data.
Unified autoregressive objective: This is a fancy way of saying that Emu2 learns to predict the next piece of data (like the next word in a sentence) based on all the data it has seen so far.
Visual prompting and object-grounded generation: These are examples of tasks that Emu2 can do. Visual prompting might involve generating a description of an image, while object-grounded generation could involve writing a story about a specific object in an image.
Few-shot settings: This means that Emu2 can learn to do new tasks with only a few examples. For instance, if you show it a few examples of cats and then ask it to identify cats in other images, it can do this effectively.
Instruction-tuned: This means that Emu2 can be adjusted to follow specific instructions, like answering questions or generating text on a specific topic.
Code and models are publicly available: This means that the authors have shared their work publicly, so other researchers can use and build upon it.

I hope this helps! Let me know if you have any other questions.

bnew · Dec 25, 2023

bnew · Dec 25, 2023

https://archive.is/bdm79

https://archive.is/tGygh

https://archive.is/qRiiy

bnew · Dec 25, 2023

[2312.05230] Language Models, Agent Models, and World Models: The LAW for Machine Reasoning and Planning

Computer Science > Artificial Intelligence

[Submitted on 8 Dec 2023]

Language Models, Agent Models, and World Models: The LAW for Machine Reasoning and Planning

Zhiting Hu, Tianmin Shu

Despite their tremendous success in many applications, large language models often fall short of consistent reasoning and planning in various (language, embodied, and social) scenarios, due to inherent limitations in their inference, learning, and modeling capabilities. In this position paper, we present a new perspective of machine reasoning, LAW, that connects the concepts of Language models, Agent models, and World models, for more robust and versatile reasoning capabilities. In particular, we propose that world and agent models are a better abstraction of reasoning, that introduces the crucial elements of deliberate human-like reasoning, including beliefs about the world and other agents, anticipation of consequences, goals/rewards, and strategic planning. Crucially, language models in LAW serve as a backend to implement the system or its elements and hence provide the computational power and adaptability. We review the recent studies that have made relevant progress and discuss future research directions towards operationalizing the LAW framework.

Comments:	Position paper. Accompanying NeurIPS2023 Tutorial: this https URL
Subjects:	Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
Cite as:	arXiv:2312.05230 [cs.AI]
	(or arXiv:2312.05230v1 [cs.AI] for this version)
	[2312.05230] Language Models, Agent Models, and World Models: The LAW for Machine Reasoning and Planning Focus to learn more

Submission history

From: Zhiting Hu [view email]
[v1] Fri, 8 Dec 2023 18:25:22 UTC (981 KB)

https://arxiv.org/pdf/2312.05230.pdf

bnew · Dec 25, 2023

xDAN-AI/xDAN-L1-Chat-RL-v1 · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

**The first 7b model achieving an 8.8 on MT-Bench which is performance at Humanalities, Coding and Writing.**

xDAN-AI • > Discord • Twitter • Huggingface

########## First turn ##########

model	turn	score	size
gpt-4	1	8.95625	-
xDAN-L1-Chat-RL-v1	1	8.87500	7b
xDAN-L2-Chat-RL-v2	1	8.78750	30b
claude-v1	1	8.15000	-
gpt-3.5-turbo	1	8.07500	20b
vicuna-33b-v1.3	1	7.45625	33b
wizardlm-30b	1	7.13125	30b
oasst-sft-7-llama-30b	1	7.10625	30b
Llama-2-70b-chat	1	6.98750	70b

########## Second turn ##########

model	turn	score	size
gpt-4	2	9.025000	-
xDAN-L2-Chat-RL-v2	1	8.087500	30b
xDAN-L1-Chat-RL-v1	2	7.825000	7b
gpt-3.5-turbo	2	7.812500	20b
claude-v1	2	7.650000	-
wizardlm-30b	2	6.887500	30b
vicuna-33b-v1.3	2	6.787500	33b
Llama-2-70b-chat	2	6.725000	70b

########## Average turn##########

model	score	size
gpt-4	8.990625	-
xDAN-L2-Chat-RL-v2	8.437500	30b
xDAN-L1-Chat-RL-v1	8.350000	7b
gpt-3.5-turbo	7.943750	20b
claude-v1	7.900000	-
vicuna-33b-v1.3	7.121875	33b
wizardlm-30b	7.009375	30b
Llama-2-70b-chat	6.856250	70b

Prompt Template(Alpaca)

Instruction:"You are a helpful assistant named DAN.You are an expert in worldly knowledge, skilled in employing a probing questioning strategy, carefully considering each step before providing answers."

{Question}

Response:

Created By xDAN-AI at 2023-12-15

Eval by FastChat: GitHub - lm-sys/FastChat: An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.

Check: https://www.xdan.ai

bnew · Dec 25, 2023

https://archive.is/8NieV

[2312.07910] PromptBench: A Unified Library for Evaluation of Large Language Models

Computer Science > Artificial Intelligence

[Submitted on 13 Dec 2023]

PromptBench: A Unified Library for Evaluation of Large Language Models

Kaijie Zhu, Qinlin Zhao, Hao Chen, Jindong Wang, Xing Xie

The evaluation of large language models (LLMs) is crucial to assess their performance and mitigate potential security risks. In this paper, we introduce PromptBench, a unified library to evaluate LLMs. It consists of several key components that are easily used and extended by researchers: prompt construction, prompt engineering, dataset and model loading, adversarial prompt attack, dynamic evaluation protocols, and analysis tools. PromptBench is designed to be an open, general, and flexible codebase for research purposes that can facilitate original study in creating new benchmarks, deploying downstream applications, and designing new evaluation protocols. The code is available at: this https URL and will be continuously supported.

Comments:	An extension to PromptBench (arXiv:2306.04528) for unified evaluation of LLMs using the same name; code: this https URL
Subjects:	Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2312.07910 [cs.AI]
	(or arXiv:2312.07910v1 [cs.AI] for this version)
	[2312.07910] PromptBench: A Unified Library for Evaluation of Large Language Models Focus to learn more

Submission history

From: Jindong Wang [view email]
[v1] Wed, 13 Dec 2023 05:58:34 UTC (288 KB)

https://arxiv.org/pdf/2312.07910.pdf

GitHub - microsoft/promptbench: A unified evaluation framework for large language models

A unified evaluation framework for large language models - GitHub - microsoft/promptbench: A unified evaluation framework for large language models

github.com

PromptBench: A Unified Library for Evaluating and Understanding Large Language Models.
Paper · Documentation · Leaderboard · More papers

Table of Contents

News and Updates

[16/12/2023] Add support for Gemini, Mistral, Mixtral, Baichuan, Yi models.
[15/12/2023] Add detailed instructions for users to add new modules (models, datasets, etc.) examples/add_new_modules.md.
[05/12/2023] Published promptbench 0.0.1.

Introduction

PromptBench is a Pytorch-based Python package for Evaluation of Large Language Models (LLMs). It provides user-friendly APIs for researchers to conduct evaluation on LLMs. Check the technical report: [2312.07910] PromptBench: A Unified Library for Evaluation of Large Language Models.

What does promptbench currently provide?

Quick model performance assessment: We offer a user-friendly interface that allows for quick model building, dataset loading, and evaluation of model performance.
Prompt Engineering: We implemented several prompt engineering methods. For example: Few-shot Chain-of-Thought [1], Emotion Prompt [2], Expert Prompting [3] and so on.
Evaluating adversarial prompts: promptbench integrated prompt attacks [4], enabling researchers to simulate black-box adversarial prompt attacks on models and evaluate their robustness (see details here).
Dynamic evaluation to mitigate potential test data contamination: we integrated the dynamic evaluation framework DyVal [5], which generates evaluation samples on-the-fly with controlled complexity.

bnew · Dec 26, 2023

https://archive.is/50FgW

GitHub - Lightricks/LongAnimateDiff

Contribute to Lightricks/LongAnimateDiff development by creating an account on GitHub.

github.com

LongAnimateDiff

Sapir Weissbuch, Naomi Ken Korem, Daniel Shalem, Yoav HaCohen | Lightricks Research

We are pleased to release the "LongAnimateDiff" model, which has been trained to generate videos with a variable frame count, ranging from 16 to 64 frames. This model is compatible with the original AnimateDiff model.

We release two models:

The LongAnimateDiff model, capable of generating videos with frame counts ranging from 16 to 64. You can download the weights from either Google Drive or HuggingFace. For optimal results, we recommend using a motion scale of 1.28.
A specialized model designed to generate 32-frame videos. This model typically produces higher quality videos compared to the LongAnimateDiff model supporting 16-64 frames. Please download the weights from Google Drive or HuggingFace. For better results, use a motion scale of 1.15.

Results

Walking astronaut on the moon	A corgi dog skying on skis down a snowy mountain	A pizza spinning inside a wood fired pizza oven	A young man is dancing in a Paris nice street	A ship in the ocean	A hamster is riding a bicycle on the road

A drone is flying in the sky above a forest	A drone is flying in the sky above the mountains	A swan swims in the lake	A ginger woman in space future	Photo portrait of old lady with glasses	Small fish swimming in an aquarium

Lightricks/LongAnimateDiff · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

bnew · Dec 27, 2023

[2312.16171] Principled Instructions Are All You Need for Questioning LLaMA-1/2, GPT-3.5/4

Computer Science > Computation and Language

[Submitted on 26 Dec 2023]

Principled Instructions Are All You Need for Questioning LLaMA-1/2, GPT-3.5/4

Sondos Mahmoud Bsharat, Aidar Myrzakhan, Zhiqiang Shen

This paper introduces 26 guiding principles designed to streamline the process of querying and prompting large language models. Our goal is to simplify the underlying concepts of formulating questions for various scales of large language models, examining their abilities, and enhancing user comprehension on the behaviors of different scales of large language models when feeding into different prompts. Extensive experiments are conducted on LLaMA-1/2 (7B, 13B and 70B), GPT-3.5/4 to verify the effectiveness of the proposed principles on instructions and prompts design. We hope that this work provides a better guide for researchers working on the prompting of large language models. Project page is available at this https URL.

Comments:	Github at: this https URL
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2312.16171 [cs.CL]
	(or arXiv:2312.16171v1 [cs.CL] for this version)

Submission history

From: Zhiqiang Shen [view email]
[v1] Tue, 26 Dec 2023 18:59:33 UTC (1,127 KB)

https://arxiv.org/pdf/2312.16171.pdf

bnew · Dec 27, 2023

Microsoft Azure AI Services Launches Personal Voice for Customized AI Audio - WinBuzzer

Microsoft's Personal Voice lets users generate their own AI-based voice using a 60-second audio recording.

winbuzzer.com

Microsoft Azure AI Services Launches Personal Voice for Customized AI Audio

Microsoft's Personal Voice lets users generate their own AI-based voice using a 60-second audio recording.

By Luke Jones
November 17, 2023 3:41 pm CET

Microsoft-Azure-Peronal-Voice-696x375.jpg.webp

Microsoft has introduced Personal Voice, a new feature allowing individuals and businesses to generate AI-based voices using their own vocal samples. Announced during the Ignite 2023 developers conference, the Azure AI Speech service's latest addition is set to revolutionize the way AI voices are created, providing opportunities in gaming, language dubbing, and personalized voice assistants.

Simplifying Voice Synthesis

The new feature builds upon Microsoft's existing custom neural voice capabilities, streamlining the process for creating a synthetic voice that closely resembles a specific person's speech. Compared to traditional methods that may be complex or expensive, Personal Voice enables users to synthesize a voice that mirrors their own with just a 60-second audio recording.

This technological advancement is seen as particularly transformative for the entertainment industry, where it could be employed to dub an actor's voice across various languages, thereby maintaining a consistent vocal presence. In gaming, players might imbue their characters with a voice that reflects their actual speech, offering a more immersive experience.

Microsoft uses similar AI capabilities in the Skype TruVoice feature. Skype now supports real-time translation for video calls and the translation will use your personal voice (TruVoice). This means the person who hears the translation of you speaking will hear it in your actual voice.

Ethical Considerations and Availability

In light of potential misuses, such as creating deceptive audio clips, Microsoft emphasizes the importance of ethical conduct. Users must consent via a recorded statement, acknowledging they are aware that a digital version of their voice will be created and utilized. Adherence to Microsoft's established guidelines and conduct code is mandatory for all users.

Initially, Personal Voice will be accessible within limited regions, including West Europe, East US, and South East Asia. The company prepares for a public preview slated to become available on December 1. Microsoft's initiative arguably represents a step forward in naturalistic AI interactions by melding cutting-edge AI with the uniqueness of individual human voices. With meticulous guidelines and responsible usage, Personal Voice may soon redefine synthetic voice applications across various sectors.

What is personal voice? - Azure AI services

With personal voice, you can get AI generated replication of your voice (or users of your application) in a few seconds.

learn.microsoft.com

https://archive.is/FtZrl

bnew · Dec 28, 2023

Microsoft Copilot is now available as a ChatGPT-like app on Android

You don’t need the Bing mobile app anymore for Copilot.

www.theverge.com

Microsoft Copilot is now available as a ChatGPT-like app on Android

You no longer need the Bing mobile app to access Copilot on Android devices.

By Tom Warren, a senior editor covering Microsoft, PC gaming, console, and tech. He founded WinRumors, a site dedicated to Microsoft news, before joining The Verge in 2012.

Dec 26, 2023, 9:39 AM EST|33 Comments / 33 New

Image: Microsoft

Microsoft has quietly launched a dedicated Copilot app for Android. The new app is available in the Google Play Store, offering access to Microsoft’s AI-powered Copilot without the need for the Bing mobile app. Spotted by Neowin, Copilot on Android has been available for nearly a week, but an iOS version isn’t available just yet.

Microsoft’s Copilot app on Android is very similar to ChatGPT, with access to chatbot capabilities, image generation through DALL-E 3, and the ability to draft text for emails and documents. It also includes free access to OpenAI’s latest GPT-4 model, something you have to pay for if you’re using ChatGPT.

The Copilot interface on Android[/SIZE]

Image: Microsoft

The launch of the Copilot app for Android comes a little over a month after Microsoft rebranded Bing Chat to Copilot. Microsoft originally launched its AI push earlier this year inside its Bing search engine, integrating a ChatGPT-like interface into search results. While that’s still available, Microsoft has dropped the Bing Chat branding and allowed Copilot to be more of a standalone experience that also exists on its own dedicated domain over at copilot.microsoft.com — much like ChatGPT.

Launching mobile apps for Copilot seems like the next logical step of expanding this standalone Copilot experience, particularly as Bing Chat Enterprise was also rebranded to just Copilot. While there’s no sign of an iOS version of Copilot right now, I’d expect it’s right around the corner. Until then, you can always use the Bing app on an iPhone or iPad to access the existing Copilot features.

bnew · Dec 29, 2023

https://archive.is/oVxTS

Fast Inference of Mixture-of-Experts Language Models with Offloading

Artyom Eliseev Moscow Institute of Physics and Technology Yandex School of Data Analysis lavawolfiee@gmail.com Denis Mazur Moscow Institute of Physics and Technology Yandex Researchcore denismazur8@gmail.com

Abstract

With the widespread adoption of Large Language Models (LLMs), many deep learning practitioners are looking for strategies of running these models more efficiently. One such strategy is to use sparse Mixture-of-Experts (MoE) — a type of model architectures where only a fraction of model layers are active for any given input. This property allows MoE-based language models to generate tokens faster than their “dense” counterparts, but it also increases model size due to having multiple “experts”. Unfortunately, this makes state-of-the-art MoE language models difficult to run without high-end GPUs. In this work, we study the problem of running large MoE language models on consumer hardware with limited accelerator memory. We build upon parameter offloading algorithms and propose a novel strategy that accelerates offloading by taking advantage of innate properties of MoE LLMs. Using this strategy, we build can run Mixtral-8x7B with mixed quantization on desktop hardware and free-tier Google Colab instances.

https://arxiv.org/pdf/2312.17238.pdf

GitHub - dvmazur/mixtral-offloading: Run Mixtral-8x7B models in Colab or consumer desktops

Run Mixtral-8x7B models in Colab or consumer desktops - GitHub - dvmazur/mixtral-offloading: Run Mixtral-8x7B models in Colab or consumer desktops

github.com

bnew · Dec 29, 2023

https://archive.is/010kT

A Survey of Reasoning with Foundation Models

Reasoning, a crucial ability for complex problem-solving, plays a pivotal role in various real-world settings such as negotiation, medical diagnosis, and criminal investigation. It serves as a fundamental methodology in the field of Artificial General Intelligence (AGI). With the ongoing...

arxiv.org

Computer Science > Artificial Intelligence

[Submitted on 17 Dec 2023 (v1), last revised 26 Dec 2023 (this version, v4)]

A Survey of Reasoning with Foundation Models

Jiankai Sun, Chuanyang Zheng, Enze Xie, Zhengying Liu, Ruihang Chu, Jianing Qiu, Jiaqi Xu, Mingyu Ding, Hongyang Li, Mengzhe Geng, Yue Wu, Wenhai Wang, Junsong Chen, Zhangyue Yin, Xiaozhe Ren, Jie Fu, Junxian He, Wu Yuan, Qi Liu, Xihui Liu, Yu Li, Hao Dong, Yu Cheng, Ming Zhang, Pheng Ann Heng, Jifeng Dai, Ping Luo, Jingdong Wang, Ji-Rong Wen, Xipeng Qiu, Yike Guo, Hui Xiong, Qun Liu, Zhenguo Li

Reasoning, a crucial ability for complex problem-solving, plays a pivotal role in various real-world settings such as negotiation, medical diagnosis, and criminal investigation. It serves as a fundamental methodology in the field of Artificial General Intelligence (AGI). With the ongoing development of foundation models, there is a growing interest in exploring their abilities in reasoning tasks. In this paper, we introduce seminal foundation models proposed or adaptable for reasoning, highlighting the latest advancements in various reasoning tasks, methods, and benchmarks. We then delve into the potential future directions behind the emergence of reasoning abilities within foundation models. We also discuss the relevance of multimodal learning, autonomous agents, and super alignment in the context of reasoning. By discussing these future research directions, we hope to inspire researchers in their exploration of this field, stimulate further advancements in reasoning with foundation models, and contribute to the development of AGI.

Comments:	20 Figures, 160 Pages, 750+ References, Project Page this https URL
Subjects:	Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2312.11562 [cs.AI]
	(or arXiv:2312.11562v4 [cs.AI] for this version)
	[2312.11562] A Survey of Reasoning with Foundation Models Focus to learn more

Submission history

From: Ruihang Chu [view email]

[v1] Sun, 17 Dec 2023 15:16:13 UTC (3,868 KB)
[v2] Wed, 20 Dec 2023 07:25:58 UTC (3,867 KB)
[v3] Thu, 21 Dec 2023 13:21:59 UTC (3,870 KB)
[v4] Tue, 26 Dec 2023 11:31:54 UTC (3,872 KB)

https://arxiv.org/pdf/2312.11562v4.pdf

bnew · Dec 29, 2023

https://archive.is/2CAx6

[2312.13771] AppAgent: Multimodal Agents as Smartphone Users

Computer Science > Computer Vision and Pattern Recognition

[Submitted on 21 Dec 2023 (v1), last revised 22 Dec 2023 (this version, v2)]

AppAgent: Multimodal Agents as Smartphone Users

Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, Gang Yu

Recent advancements in large language models (LLMs) have led to the creation of intelligent agents capable of performing complex tasks. This paper introduces a novel LLM-based multimodal agent framework designed to operate smartphone applications. Our framework enables the agent to operate smartphone applications through a simplified action space, mimicking human-like interactions such as tapping and swiping. This novel approach bypasses the need for system back-end access, thereby broadening its applicability across diverse apps. Central to our agent's functionality is its innovative learning method. The agent learns to navigate and use new apps either through autonomous exploration or by observing human demonstrations. This process generates a knowledge base that the agent refers to for executing complex tasks across different applications. To demonstrate the practicality of our agent, we conducted extensive testing over 50 tasks in 10 different applications, including social media, email, maps, shopping, and sophisticated image editing tools. The results affirm our agent's proficiency in handling a diverse array of high-level tasks.

Comments:	Project Page is this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2312.13771 [cs.CV]
	(or arXiv:2312.13771v2 [cs.CV] for this version)
	[2312.13771] AppAgent: Multimodal Agents as Smartphone Users Focus to learn more

Submission history

From: Yucheng Han [view email]
[v1] Thu, 21 Dec 2023 11:52:45 UTC (10,766 KB)
[v2] Fri, 22 Dec 2023 02:29:17 UTC (10,766 KB)

GitHub - mnotgod96/AppAgent: AppAgent: Multimodal Agents as Smartphone Users, an LLM-based multimodal agent framework designed to operate smartphone apps.

AppAgent: Multimodal Agents as Smartphone Users, an LLM-based multimodal agent framework designed to operate smartphone apps. - GitHub - mnotgod96/AppAgent: AppAgent: Multimodal Agents as Smartphon...

github.com

About

AppAgent: Multimodal Agents as Smartphone Users, an LLM-based multimodal agent framework designed to operate smartphone apps.

Introduction

We introduce a novel LLM-based multimodal agent framework designed to operate smartphone applications.

Our framework enables the agent to operate smartphone applications through a simplified action space, mimicking human-like interactions such as tapping and swiping. This novel approach bypasses the need for system back-end access, thereby broadening its applicability across diverse apps.

Central to our agent's functionality is its innovative learning method. The agent learns to navigate and use new apps either through autonomous exploration or by observing human demonstrations. This process generates a knowledge base that the agent refers to for executing complex tasks across different applications.

Large Language Models News & Discussions

Veteran

Adapting Large Language Models via Reading Comprehension​

Domain-specific LLMs​

Domain-specific Tasks​

Veteran

Computer Science > Artificial Intelligence​

TACO: Topics in Algorithmic COde generation dataset​

Submission history​

Veteran

Generative Multimodal Models are In-Context Learners​

Abstract​

Video​

A strong multimodal few-shot learner​

An impressive multimodal generalist​

A skilled painter​

Multimodal in-context learning​

Strong multimodal understanding​

Generate image from any prompt sequence​

Generate video from any prompt sequence​

Veteran

Veteran

Veteran

Computer Science > Artificial Intelligence​

Language Models, Agent Models, and World Models: The LAW for Machine Reasoning and Planning​

Submission history​

Veteran

Instruction:"You are a helpful assistant named DAN.You are an expert in worldly knowledge, skilled in employing a probing questioning strategy, carefully considering each step before providing answers."​

Response:​

Created By xDAN-AI at 2023-12-15​

Eval by FastChat: GitHub - lm-sys/FastChat: An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.​

Check: https://www.xdan.ai​

Veteran

Computer Science > Artificial Intelligence​

PromptBench: A Unified Library for Evaluation of Large Language Models​

Submission history​

News and Updates​

Introduction​

What does promptbench currently provide?​

Veteran

LongAnimateDiff​

Results​

Veteran

Computer Science > Computation and Language​

Principled Instructions Are All You Need for Questioning LLaMA-1/2, GPT-3.5/4​

Submission history​

Veteran

Microsoft Azure AI Services Launches Personal Voice for Customized AI Audio​

Simplifying Voice Synthesis​

Ethical Considerations and Availability​

Veteran

Microsoft Copilot is now available as a ChatGPT-like app on Android​

You no longer need the Bing mobile app to access Copilot on Android devices.​

Veteran

Fast Inference of Mixture-of-Experts Language Models with Offloading​

Abstract​

Veteran

Computer Science > Artificial Intelligence​

A Survey of Reasoning with Foundation Models​

Submission history​

Veteran

Computer Science > Computer Vision and Pattern Recognition​

AppAgent: Multimodal Agents as Smartphone Users​

Submission history​

About​

Introduction​

Adapting Large Language Models via Reading Comprehension

Domain-specific LLMs

Domain-specific Tasks

Computer Science > Artificial Intelligence

TACO: Topics in Algorithmic COde generation dataset

Submission history

Generative Multimodal Models are In-Context Learners

Abstract

Video

A strong multimodal few-shot learner

An impressive multimodal generalist

A skilled painter

Multimodal in-context learning

Strong multimodal understanding

Generate image from any prompt sequence

Generate video from any prompt sequence

Computer Science > Artificial Intelligence

Language Models, Agent Models, and World Models: The LAW for Machine Reasoning and Planning

Submission history

Instruction:"You are a helpful assistant named DAN.You are an expert in worldly knowledge, skilled in employing a probing questioning strategy, carefully considering each step before providing answers."

Response:

Created By xDAN-AI at 2023-12-15

Eval by FastChat: GitHub - lm-sys/FastChat: An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.

Check: https://www.xdan.ai

Computer Science > Artificial Intelligence

PromptBench: A Unified Library for Evaluation of Large Language Models

Submission history

News and Updates

Introduction

What does promptbench currently provide?

LongAnimateDiff

Results

Computer Science > Computation and Language

Principled Instructions Are All You Need for Questioning LLaMA-1/2, GPT-3.5/4

Submission history

Microsoft Azure AI Services Launches Personal Voice for Customized AI Audio

Simplifying Voice Synthesis

Ethical Considerations and Availability

Microsoft Copilot is now available as a ChatGPT-like app on Android

You no longer need the Bing mobile app to access Copilot on Android devices.

Fast Inference of Mixture-of-Experts Language Models with Offloading

Abstract

Computer Science > Artificial Intelligence

A Survey of Reasoning with Foundation Models

Submission history

Computer Science > Computer Vision and Pattern Recognition

AppAgent: Multimodal Agents as Smartphone Users

Submission history

About

Introduction