bnew

Veteran
Joined
Nov 1, 2015
Messages
58,213
Reputation
8,625
Daps
161,890



Adapting Large Language Models via Reading Comprehension​

This repo contains the model, code and data for our paper Adapting Large Language Models via Reading Comprehension

We explore continued pre-training on domain-specific corpora for large language models. While this approach enriches LLMs with domain knowledge, it significantly hurts their prompting ability for question answering. Inspired by human learning via reading comprehension, we propose a simple method to transform large-scale pre-training corpora into reading comprehension texts, consistently improving prompting performance across tasks in biomedicine, finance, and law domains. Our 7B model competes with much larger domain-specific models like BloombergGPT-50B. Moreover, our domain-specific reading comprehension texts enhance model performance even on general benchmarks, indicating potential for developing a general LLM across more domains.

Domain-specific LLMs​

Our models of different domains are now available in Huggingface: biomedicine-LLM, finance-LLM and law-LLM, the performances of our AdaptLLM compared to other domain-specific LLMs are:


Domain-specific Tasks​

To easily reproduce our results, we have uploaded the filled-in zero/few-shot input instructions and output completions of each domain-specific task: biomedicine-tasks, finance-tasks, and law-tasks.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,213
Reputation
8,625
Daps
161,890




Computer Science > Artificial Intelligence​

[Submitted on 22 Dec 2023]

TACO: Topics in Algorithmic COde generation dataset​

Rongao Li (1 and 2), Jie Fu (1), Bo-Wen Zhang (1), Tao Huang (2), Zhihong Sun (2), Chen Lyu (2), Guang Liu (1), Zhi Jin (3), Ge Li (3) ((1) Beijing Academy of Artificial Intelligence, (2) School of Information Science and Engineering, Shandong Normal University, China, (3) Key Lab of HCST (PKU), MOE, SCS, Peking University, China)
We introduce TACO, an open-source, large-scale code generation dataset, with a focus on the optics of algorithms, designed to provide a more challenging training dataset and evaluation benchmark in the field of code generation models. TACO includes competition-level programming questions that are more challenging, to enhance or evaluate problem understanding and reasoning abilities in real-world programming scenarios. There are 25433 and 1000 coding problems in training and test set, as well as up to 1.55 million diverse solution answers. Moreover, each TACO problem includes several fine-grained labels such as task topics, algorithms, programming skills, and difficulty levels, providing a more precise reference for the training and evaluation of code generation models. The dataset and evaluation scripts are available on Hugging Face Hub (this https URL) and Github (this https URL).
Subjects:Artificial Intelligence (cs.AI)
Cite as:arXiv:2312.14852 [cs.AI]
(or arXiv:2312.14852v1 [cs.AI] for this version)

Submission history​

From: Bo-Wen Zhang [view email]
[v1] Fri, 22 Dec 2023 17:25:42 UTC (600 KB)


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,213
Reputation
8,625
Daps
161,890







Generative Multimodal Models are In-Context Learners​

Quan Sun1*, Yufeng Cui1*, Xiaosong Zhang1*, Fan Zhang1*, Qiying Yu2,1*, Zhengxiong Luo1, Yueze Wang1, Yongming Rao1 Jingjing Liu2 Tiejun Huang1,3 Xinlong Wang1†
1Beijing Academy of Artificial Intelligence 2Tsinghua University 3Peking University
*equal contribution †project lead
arXiv Code Demo 🤗HF Demo 🤗HF Model

Abstract​

The human ability to easily solve multimodal tasks in context (i.e., with only a few demonstrations or simple instructions), is what current multimodal systems have largely struggled to imitate. In this work, we demonstrate that the task-agnostic in-context learning capabilities of large multimodal models can be significantly enhanced by effective scaling-up. We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences with a unified autoregressive objective. Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning, such as visual prompting and object-grounded generation. The model sets a new record on multiple multimodal understanding tasks in few-shot settings. When instruction-tuned to follow specific instructions, Emu2 further achieves new state-of-the-art on challenging tasks such as question answering benchmarks for large multimodal models and open-ended subject-driven generation. These achievements demonstrate that Emu2 can serve as a base model and general-purpose interface for a wide range of multimodal tasks. Code and models are publicly available to facilitate future research.

Video​



A strong multimodal few-shot learner​

comparison_fewshot.

An impressive multimodal generalist​

Radar.

A skilled painter​

gen_metrics.

Zero-shot subject-driven generation

Multimodal in-context learning​

multi-modal-incontext-learning.

Strong multimodal understanding​

hexogon.

guide robot.

damage car.

sample A and B.

Generate image from any prompt sequence​

generate_from_any_prompt_sequence1.

generate_from_any_prompt_sequence2.

generate_from_any_prompt_sequence3.

Generate video from any prompt sequence​

video_generation.




A.I generated explanation:

Sure, let’s break down this abstract into simpler terms and provide some examples:
  1. Multimodal tasks in context: This refers to tasks that involve multiple types of data (like text, images, and sound) and require understanding the context. For example, if you see a picture of a dog and read a story about a dog, you can understand that both are related. This is something humans do easily but is challenging for machines.
  2. In-context learning capabilities of large multimodal models: This means the ability of large AI models to learn from the context in which they’re used. For instance, if an AI model is used to recommend movies, it might learn to suggest horror movies when it’s close to Halloween, based on the context of the time of year.
  3. Emu2: This is the name of the new AI model introduced in the paper. It’s a large model with 37 billion parameters, which means it has a lot of capacity to learn from data.
  4. Unified autoregressive objective: This is a fancy way of saying that Emu2 learns to predict the next piece of data (like the next word in a sentence) based on all the data it has seen so far.
  5. Visual prompting and object-grounded generation: These are examples of tasks that Emu2 can do. Visual prompting might involve generating a description of an image, while object-grounded generation could involve writing a story about a specific object in an image.
  6. Few-shot settings: This means that Emu2 can learn to do new tasks with only a few examples. For instance, if you show it a few examples of cats and then ask it to identify cats in other images, it can do this effectively.
  7. Instruction-tuned: This means that Emu2 can be adjusted to follow specific instructions, like answering questions or generating text on a specific topic.
  8. Code and models are publicly available: This means that the authors have shared their work publicly, so other researchers can use and build upon it.
I hope this helps! Let me know if you have any other questions.
 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,213
Reputation
8,625
Daps
161,890

Computer Science > Artificial Intelligence​

[Submitted on 8 Dec 2023]

Language Models, Agent Models, and World Models: The LAW for Machine Reasoning and Planning​

Zhiting Hu, Tianmin Shu
Despite their tremendous success in many applications, large language models often fall short of consistent reasoning and planning in various (language, embodied, and social) scenarios, due to inherent limitations in their inference, learning, and modeling capabilities. In this position paper, we present a new perspective of machine reasoning, LAW, that connects the concepts of Language models, Agent models, and World models, for more robust and versatile reasoning capabilities. In particular, we propose that world and agent models are a better abstraction of reasoning, that introduces the crucial elements of deliberate human-like reasoning, including beliefs about the world and other agents, anticipation of consequences, goals/rewards, and strategic planning. Crucially, language models in LAW serve as a backend to implement the system or its elements and hence provide the computational power and adaptability. We review the recent studies that have made relevant progress and discuss future research directions towards operationalizing the LAW framework.
Comments:Position paper. Accompanying NeurIPS2023 Tutorial: this https URL
Subjects:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
Cite as:arXiv:2312.05230 [cs.AI]
(or arXiv:2312.05230v1 [cs.AI] for this version)
[2312.05230] Language Models, Agent Models, and World Models: The LAW for Machine Reasoning and Planning
Focus to learn more

Submission history​

From: Zhiting Hu [view email]
[v1] Fri, 8 Dec 2023 18:25:22 UTC (981 KB)




 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,213
Reputation
8,625
Daps
161,890


tVAcwKkIH5vkfzqgqHeHi.png

**The first 7b model achieving an 8.8 on MT-Bench which is performance at Humanalities, Coding and Writing.**



xDAN-AI • > DiscordTwitterHuggingface



Image



########## First turn ##########



modelturnscoresize
gpt-418.95625-
xDAN-L1-Chat-RL-v118.875007b
xDAN-L2-Chat-RL-v218.7875030b
claude-v118.15000-
gpt-3.5-turbo18.0750020b
vicuna-33b-v1.317.4562533b
wizardlm-30b17.1312530b
oasst-sft-7-llama-30b17.1062530b
Llama-2-70b-chat16.9875070b
########## Second turn ##########



modelturnscoresize
gpt-429.025000-
xDAN-L2-Chat-RL-v218.08750030b
xDAN-L1-Chat-RL-v127.8250007b
gpt-3.5-turbo27.81250020b
claude-v127.650000-
wizardlm-30b26.88750030b
vicuna-33b-v1.326.78750033b
Llama-2-70b-chat26.72500070b
########## Average turn##########



modelscoresize
gpt-48.990625-
xDAN-L2-Chat-RL-v28.43750030b
xDAN-L1-Chat-RL-v18.3500007b
gpt-3.5-turbo7.94375020b
claude-v17.900000-
vicuna-33b-v1.37.12187533b
wizardlm-30b7.00937530b
Llama-2-70b-chat6.85625070b
Prompt Template(Alpaca)



Instruction:"You are a helpful assistant named DAN.You are an expert in worldly knowledge, skilled in employing a probing questioning strategy, carefully considering each step before providing answers."​

{Question}



Response:​

Created By xDAN-AI at 2023-12-15​

Eval by FastChat: GitHub - lm-sys/FastChat: An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.

Check: https://www.xdan.ai

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,213
Reputation
8,625
Daps
161,890


Computer Science > Artificial Intelligence​

[Submitted on 13 Dec 2023]

PromptBench: A Unified Library for Evaluation of Large Language Models​

Kaijie Zhu, Qinlin Zhao, Hao Chen, Jindong Wang, Xing Xie
The evaluation of large language models (LLMs) is crucial to assess their performance and mitigate potential security risks. In this paper, we introduce PromptBench, a unified library to evaluate LLMs. It consists of several key components that are easily used and extended by researchers: prompt construction, prompt engineering, dataset and model loading, adversarial prompt attack, dynamic evaluation protocols, and analysis tools. PromptBench is designed to be an open, general, and flexible codebase for research purposes that can facilitate original study in creating new benchmarks, deploying downstream applications, and designing new evaluation protocols. The code is available at: this https URL and will be continuously supported.
Comments:An extension to PromptBench (arXiv:2306.04528) for unified evaluation of LLMs using the same name; code: this https URL
Subjects:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:arXiv:2312.07910 [cs.AI]
(or arXiv:2312.07910v1 [cs.AI] for this version)
[2312.07910] PromptBench: A Unified Library for Evaluation of Large Language Models
Focus to learn more

Submission history

From: Jindong Wang [view email]
[v1] Wed, 13 Dec 2023 05:58:34 UTC (288 KB)






Logo
PromptBench: A Unified Library for Evaluating and Understanding Large Language Models.
Paper · Documentation · Leaderboard · More papers

Table of Contents

News and Updates​

  • [16/12/2023] Add support for Gemini, Mistral, Mixtral, Baichuan, Yi models.
  • [15/12/2023] Add detailed instructions for users to add new modules (models, datasets, etc.) examples/add_new_modules.md.
  • [05/12/2023] Published promptbench 0.0.1.

Introduction​

PromptBench is a Pytorch-based Python package for Evaluation of Large Language Models (LLMs). It provides user-friendly APIs for researchers to conduct evaluation on LLMs. Check the technical report: [2312.07910] PromptBench: A Unified Library for Evaluation of Large Language Models.

Code Structure

What does promptbench currently provide?​

  1. Quick model performance assessment: We offer a user-friendly interface that allows for quick model building, dataset loading, and evaluation of model performance.
  2. Prompt Engineering: We implemented several prompt engineering methods. For example: Few-shot Chain-of-Thought [1], Emotion Prompt [2], Expert Prompting [3] and so on.
  3. Evaluating adversarial prompts: promptbench integrated prompt attacks [4], enabling researchers to simulate black-box adversarial prompt attacks on models and evaluate their robustness (see details here).
  4. Dynamic evaluation to mitigate potential test data contamination: we integrated the dynamic evaluation framework DyVal [5], which generates evaluation samples on-the-fly with controlled complexity.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,213
Reputation
8,625
Daps
161,890




LongAnimateDiff​

Sapir Weissbuch, Naomi Ken Korem, Daniel Shalem, Yoav HaCohen | Lightricks Research

Hugging Face Spaces

We are pleased to release the "LongAnimateDiff" model, which has been trained to generate videos with a variable frame count, ranging from 16 to 64 frames. This model is compatible with the original AnimateDiff model.

We release two models:

  1. The LongAnimateDiff model, capable of generating videos with frame counts ranging from 16 to 64. You can download the weights from either Google Drive or HuggingFace. For optimal results, we recommend using a motion scale of 1.28.
  2. A specialized model designed to generate 32-frame videos. This model typically produces higher quality videos compared to the LongAnimateDiff model supporting 16-64 frames. Please download the weights from Google Drive or HuggingFace. For better results, use a motion scale of 1.15.

Results​

Walking astronaut on the moonA corgi dog skying on skis down a snowy mountainA pizza spinning inside a wood fired pizza ovenA young man is dancing in a Paris nice streetA ship in the oceanA hamster is riding a bicycle on the road
A drone is flying in the sky above a forestA drone is flying in the sky above the mountainsA swan swims in the lakeA ginger woman in space futurePhoto portrait of old lady with glassesSmall fish swimming in an aquarium



 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,213
Reputation
8,625
Daps
161,890

Computer Science > Computation and Language​

[Submitted on 26 Dec 2023]

Principled Instructions Are All You Need for Questioning LLaMA-1/2, GPT-3.5/4​

Sondos Mahmoud Bsharat, Aidar Myrzakhan, Zhiqiang Shen
This paper introduces 26 guiding principles designed to streamline the process of querying and prompting large language models. Our goal is to simplify the underlying concepts of formulating questions for various scales of large language models, examining their abilities, and enhancing user comprehension on the behaviors of different scales of large language models when feeding into different prompts. Extensive experiments are conducted on LLaMA-1/2 (7B, 13B and 70B), GPT-3.5/4 to verify the effectiveness of the proposed principles on instructions and prompts design. We hope that this work provides a better guide for researchers working on the prompting of large language models. Project page is available at this https URL.
Comments:Github at: this https URL
Subjects:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:arXiv:2312.16171 [cs.CL]
(or arXiv:2312.16171v1 [cs.CL] for this version)

Submission history

From: Zhiqiang Shen [view email]
[v1] Tue, 26 Dec 2023 18:59:33 UTC (1,127 KB)




wwITjKb.png




EjyGo3y.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,213
Reputation
8,625
Daps
161,890

Microsoft Azure AI Services Launches Personal Voice for Customized AI Audio​


Microsoft's Personal Voice lets users generate their own AI-based voice using a 60-second audio recording.

By Luke Jones
November 17, 2023 3:41 pm CET


Microsoft-Azure-Peronal-Voice-696x375.jpg.webp


Microsoft has introduced Personal Voice, a new feature allowing individuals and businesses to generate AI-based voices using their own vocal samples. Announced during the Ignite 2023 developers conference, the Azure AI Speech service's latest addition is set to revolutionize the way AI voices are created, providing opportunities in gaming, language dubbing, and personalized voice assistants.

Simplifying Voice Synthesis


The new feature builds upon Microsoft's existing custom neural voice capabilities, streamlining the process for creating a synthetic voice that closely resembles a specific person's speech. Compared to traditional methods that may be complex or expensive, Personal Voice enables users to synthesize a voice that mirrors their own with just a 60-second audio recording.

This technological advancement is seen as particularly transformative for the entertainment industry, where it could be employed to dub an actor's voice across various languages, thereby maintaining a consistent vocal presence. In gaming, players might imbue their characters with a voice that reflects their actual speech, offering a more immersive experience.



Microsoft uses similar AI capabilities in the Skype TruVoice feature. Skype now supports real-time translation for video calls and the translation will use your personal voice (TruVoice). This means the person who hears the translation of you speaking will hear it in your actual voice.


Ethical Considerations and Availability


In light of potential misuses, such as creating deceptive audio clips, Microsoft emphasizes the importance of ethical conduct. Users must consent via a recorded statement, acknowledging they are aware that a digital version of their voice will be created and utilized. Adherence to Microsoft's established guidelines and conduct code is mandatory for all users.

Initially, Personal Voice will be accessible within limited regions, including West Europe, East US, and South East Asia. The company prepares for a public preview slated to become available on December 1. Microsoft's initiative arguably represents a step forward in naturalistic AI interactions by melding cutting-edge AI with the uniqueness of individual human voices. With meticulous guidelines and responsible usage, Personal Voice may soon redefine synthetic voice applications across various sectors.





 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,213
Reputation
8,625
Daps
161,890

Microsoft Copilot is now available as a ChatGPT-like app on Android​


You no longer need the Bing mobile app to access Copilot on Android devices.​


By Tom Warren, a senior editor covering Microsoft, PC gaming, console, and tech. He founded WinRumors, a site dedicated to Microsoft news, before joining The Verge in 2012.

Dec 26, 2023, 9:39 AM EST|33 Comments / 33 New

Press_Image_FINAL_16x9_4.jpg
Illustration of the Copilot logo

Image: Microsoft

Microsoft has quietly launched a dedicated Copilot app for Android. The new app is available in the Google Play Store, offering access to Microsoft’s AI-powered Copilot without the need for the Bing mobile app. Spotted by Neowin, Copilot on Android has been available for nearly a week, but an iOS version isn’t available just yet.

Microsoft’s Copilot app on Android is very similar to ChatGPT, with access to chatbot capabilities, image generation through DALL-E 3, and the ability to draft text for emails and documents. It also includes free access to OpenAI’s latest GPT-4 model, something you have to pay for if you’re using ChatGPT.


The Copilot interface on Android

The Copilot interface on Android[/SIZE]

Image: Microsoft

The launch of the Copilot app for Android comes a little over a month after Microsoft rebranded Bing Chat to Copilot. Microsoft originally launched its AI push earlier this year inside its Bing search engine, integrating a ChatGPT-like interface into search results. While that’s still available, Microsoft has dropped the Bing Chat branding and allowed Copilot to be more of a standalone experience that also exists on its own dedicated domain over at copilot.microsoft.com — much like ChatGPT.

Launching mobile apps for Copilot seems like the next logical step of expanding this standalone Copilot experience, particularly as Bing Chat Enterprise was also rebranded to just Copilot. While there’s no sign of an iOS version of Copilot right now, I’d expect it’s right around the corner. Until then, you can always use the Bing app on an iPhone or iPad to access the existing Copilot features.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,213
Reputation
8,625
Daps
161,890


Fast Inference of Mixture-of-Experts Language Models with Offloading​

Artyom Eliseev Moscow Institute of Physics and Technology Yandex School of Data Analysis lavawolfiee@gmail.com Denis Mazur Moscow Institute of Physics and Technology Yandex Researchcore denismazur8@gmail.com

Abstract​

With the widespread adoption of Large Language Models (LLMs), many deep learning practitioners are looking for strategies of running these models more efficiently. One such strategy is to use sparse Mixture-of-Experts (MoE) — a type of model architectures where only a fraction of model layers are active for any given input. This property allows MoE-based language models to generate tokens faster than their “dense” counterparts, but it also increases model size due to having multiple “experts”. Unfortunately, this makes state-of-the-art MoE language models difficult to run without high-end GPUs. In this work, we study the problem of running large MoE language models on consumer hardware with limited accelerator memory. We build upon parameter offloading algorithms and propose a novel strategy that accelerates offloading by taking advantage of innate properties of MoE LLMs. Using this strategy, we build can run Mixtral-8x7B with mixed quantization on desktop hardware and free-tier Google Colab instances.



 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,213
Reputation
8,625
Daps
161,890





Computer Science > Artificial Intelligence​

[Submitted on 17 Dec 2023 (v1), last revised 26 Dec 2023 (this version, v4)]

A Survey of Reasoning with Foundation Models​

Jiankai Sun, Chuanyang Zheng, Enze Xie, Zhengying Liu, Ruihang Chu, Jianing Qiu, Jiaqi Xu, Mingyu Ding, Hongyang Li, Mengzhe Geng, Yue Wu, Wenhai Wang, Junsong Chen, Zhangyue Yin, Xiaozhe Ren, Jie Fu, Junxian He, Wu Yuan, Qi Liu, Xihui Liu, Yu Li, Hao Dong, Yu Cheng, Ming Zhang, Pheng Ann Heng, Jifeng Dai, Ping Luo, Jingdong Wang, Ji-Rong Wen, Xipeng Qiu, Yike Guo, Hui Xiong, Qun Liu, Zhenguo Li

Reasoning, a crucial ability for complex problem-solving, plays a pivotal role in various real-world settings such as negotiation, medical diagnosis, and criminal investigation. It serves as a fundamental methodology in the field of Artificial General Intelligence (AGI). With the ongoing development of foundation models, there is a growing interest in exploring their abilities in reasoning tasks. In this paper, we introduce seminal foundation models proposed or adaptable for reasoning, highlighting the latest advancements in various reasoning tasks, methods, and benchmarks. We then delve into the potential future directions behind the emergence of reasoning abilities within foundation models. We also discuss the relevance of multimodal learning, autonomous agents, and super alignment in the context of reasoning. By discussing these future research directions, we hope to inspire researchers in their exploration of this field, stimulate further advancements in reasoning with foundation models, and contribute to the development of AGI.
Comments:20 Figures, 160 Pages, 750+ References, Project Page this https URL
Subjects:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:arXiv:2312.11562 [cs.AI]
(or arXiv:2312.11562v4 [cs.AI] for this version)
[2312.11562] A Survey of Reasoning with Foundation Models
Focus to learn more

Submission history​

From: Ruihang Chu [view email]

[v1] Sun, 17 Dec 2023 15:16:13 UTC (3,868 KB)
[v2] Wed, 20 Dec 2023 07:25:58 UTC (3,867 KB)
[v3] Thu, 21 Dec 2023 13:21:59 UTC (3,870 KB)
[v4] Tue, 26 Dec 2023 11:31:54 UTC (3,872 KB)

 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,213
Reputation
8,625
Daps
161,890



Computer Science > Computer Vision and Pattern Recognition​

[Submitted on 21 Dec 2023 (v1), last revised 22 Dec 2023 (this version, v2)]

AppAgent: Multimodal Agents as Smartphone Users​

Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, Gang Yu
Recent advancements in large language models (LLMs) have led to the creation of intelligent agents capable of performing complex tasks. This paper introduces a novel LLM-based multimodal agent framework designed to operate smartphone applications. Our framework enables the agent to operate smartphone applications through a simplified action space, mimicking human-like interactions such as tapping and swiping. This novel approach bypasses the need for system back-end access, thereby broadening its applicability across diverse apps. Central to our agent's functionality is its innovative learning method. The agent learns to navigate and use new apps either through autonomous exploration or by observing human demonstrations. This process generates a knowledge base that the agent refers to for executing complex tasks across different applications. To demonstrate the practicality of our agent, we conducted extensive testing over 50 tasks in 10 different applications, including social media, email, maps, shopping, and sophisticated image editing tools. The results affirm our agent's proficiency in handling a diverse array of high-level tasks.
Comments:Project Page is this https URL
Subjects:Computer Vision and Pattern Recognition (cs.CV)
Cite as:arXiv:2312.13771 [cs.CV]
(or arXiv:2312.13771v2 [cs.CV] for this version)
[2312.13771] AppAgent: Multimodal Agents as Smartphone Users
Focus to learn more

Submission history

From: Yucheng Han [view email]
[v1] Thu, 21 Dec 2023 11:52:45 UTC (10,766 KB)
[v2] Fri, 22 Dec 2023 02:29:17 UTC (10,766 KB)


About

AppAgent: Multimodal Agents as Smartphone Users, an LLM-based multimodal agent framework designed to operate smartphone apps.

Introduction

We introduce a novel LLM-based multimodal agent framework designed to operate smartphone applications.

Our framework enables the agent to operate smartphone applications through a simplified action space, mimicking human-like interactions such as tapping and swiping. This novel approach bypasses the need for system back-end access, thereby broadening its applicability across diverse apps.

Central to our agent's functionality is its innovative learning method. The agent learns to navigate and use new apps either through autonomous exploration or by observing human demonstrations. This process generates a knowledge base that the agent refers to for executing complex tasks across different applications.


h8kz4D7.jpeg

 
Top