Artificial Intelligence (AI) has made incredible progress recently. On the one hand, advanced foundation models like ChatGPT can offer powerful conversation, in-context learning and code generation abilities on a broad range of open-domain tasks. They can also generate high-level solution outlines for domain-specific tasks based on the common sense knowledge they have acquired. However, they still face difficulties with some specialized tasks because they lack enough domain-specific data during pre-training or they often have errors in their neural network computations on those tasks that need accurate executions. On the other hand, there are also many existing models and systems (symbolic-based or neural-based) that can do some domain-specific tasks very well. However, due to the different implementation or working mechanisms, they are not easily accessible or compatible with foundation models. Therefore, there is a clear and pressing need for a mechanism that can leverage foundation models to propose task solution outlines and then automatically match some of the sub-tasks in the outlines to the off-the-shelf models and systems with special functionalities to complete them. Inspired by this, we introduce this http URL as a new AI ecosystem that connects foundation models with millions of APIs for task completion. Unlike most previous work that aimed to improve a single AI model, this http URL focuses more on using existing foundation models (as a brain-like central system) and APIs of other AI models and systems (as sub-task solvers) to achieve diversified tasks in both digital and physical domains. As a position paper, we will present our vision of how to build such an ecosystem, explain each key component, and use study cases to illustrate both the feasibility of this vision and the main challenges we need to address next.
3 Application Scenarios
In this section, we present some examples of how TaskMatrix.AI can be applied in different application scenarios. We show how TaskMatrix.AI can assist in creating AI-powered content in Section 3.1 and 3.2. We demonstrate how TaskMatrix.AI can facilitate office automation and cloud service usage in Section 3.3 and 3.4. We illustrate how TaskMatrix.AI can perform tasks in the physical world by interacting with robots and IoT devices in Section 3.5. All these cases have been implemented in practice and will be supported by the online system of TaskMatrix.AI, which will be released soon. We also explore more potential applications in Section 3.6.
3.1 Visual Task Completion
TaskMatrix.AI enables the user to interact with AI by 1) sending and receiving not only languages
but also images 2) providing complex visual questions or visual editing instructions that require
the collaboration of multiple AI models with multi-steps. 3) providing feedback and asking for
corrected results. We design a series of prompts to inject the visual model information into ChatGPT,
considering models of multiple inputs/outputs and models that require visual feedback. More details
are described at Wu et al. (2023). We demonstrate this with an example in Figrue 2. The APIs related
to this include:
• Image Editing Image Editing includes removing or replacing objects of an image, or
changing the style of an image. Removing objects from an image involves using image
editing tools or algorithms to get rid of unwanted elements. On the other hand, replacing
objects with new ones involves swapping out an element in an image with another one that
is more suitable. Finally, changing an image using text involves using machine learning
algorithms to generate an image based on a textual description.
• Image Question Answering This refers to the process of using machine learning algorithms
to answer questions about an image, often by analyzing the contents of the image and
providing relevant information. This can be useful in situations where the image contains
important information that needs to be extracted.
• Image Captioning This refers to the process of using machine learning algorithms to
generate textual descriptions of an image, often by analyzing the contents of the image and
providing relevant information.
• Text-to-Image This refers to the process of generating an image from a textual description,
often using machine learning algorithms that can generate realistic images based on textual
input.
• Image-to-Sketch/Depth/Hed/Line This refers to the process of converting an image to a
sketch, depth, Hed (Holistically-nested edge detection), or line, often using image processing
techniques or computer algorithms.
• Sketch/Depth/Hed/Line-to-Image This refers to the process of generating an image from
a sketch, depth, Hed (Holistically-nested edge detection), or line.