bnew

Veteran
Joined
Nov 1, 2015
Messages
56,202
Reputation
8,249
Daps
157,886



Both Sora and Stable Diffusion 3 adopt diffusion transformers, but do we really need a super large DiT for all sampling steps for generation?🧐

No🙅‍♂️. We found ~40% early timesteps of DiT-XL can be replaced with a 10x faster DiT-S without image quality drop!

Introduce Trajectory Stitching (T-Stitch), a training-free method that complements existing efficient sampling methods by dynamically allocating computation to different denoising steps.

Paper: Code: Project page:





 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,202
Reputation
8,249
Daps
157,886




Consolidating Attention Features for Multi-view Image Editing
Published on Feb 22
·
Featured in Daily Papers on Feb 22
Authors:
Or Patashnik, Rinon Gal, Daniel Cohen-Or, Jun-Yan Zhu, Fernando De la Torre

Abstract
Large-scale text-to-image models enable a wide range of image editing techniques, using text prompts or even spatial controls. However, applying these editing methods to multi-view images depicting a single scene leads to 3D-inconsistent results. In this work, we focus on spatial control-based geometric manipulations and introduce a method to consolidate the editing process across various views. We build on two insights: (1) maintaining consistent features throughout the generative process helps attain consistency in multi-view editing, and (2) the queries in self-attention layers significantly influence the image structure. Hence, we propose to improve the geometric consistency of the edited images by enforcing the consistency of the queries. To do so, we introduce QNeRF, a neural radiance field trained on the internal query features of the edited images. Once trained, QNeRF can render 3D-consistent queries, which are then softly injected back into the self-attention layers during generation, greatly improving multi-view consistency. We refine the process through a progressive, iterative method that better consolidates queries across the diffusion timesteps. We compare our method to a range of existing techniques and demonstrate that it can achieve better multi-view consistency and higher fidelity to the input scene. These advantages allow us to train NeRFs with fewer visual artifacts, that are better aligned with the target geometry.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,202
Reputation
8,249
Daps
157,886


TinyLLaVA

A Framework of Small-scale Large Multimodal Models

present the TinyLLaVA framework that provides a unified perspective in designing and analyzing the small-scale Large Multimodal Models (LMMs). We empirically study the effects of different vision encoders, connection modules, language models, training data and training recipes. Our extensive experiments showed that better quality of data combined with better training recipes, smaller LMMs can consistently achieve on-par performances compared to bigger LMMs. Under our framework, we train a family of small-scale LMMs. Our best model, TinyLLaVA-3.1B, achieves better overall performance against existing 7B models such as LLaVA-1.5 and Qwen-VL. We hope our findings can serve as baselines for future research in terms of data scaling, training setups and model selections. Our model weights and codes will be made public.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,202
Reputation
8,249
Daps
157,886



Copilot Evaluation Harness

Evaluating LLM-Guided Software Programming

The integration of Large Language Models (LLMs) into Development Environments (IDEs) has become a focal point in modern software development. LLMs such as OpenAI GPT-3.5/4 and Code Llama offer the potential to significantly augment developer productivity by serving as intelligent, chat-driven programming assistants. However, utilizing LLMs out of the box is unlikely to be optimal for any given scenario. Rather, each system requires the LLM to be honed to its set of heuristics to ensure the best performance. In this paper, we introduce the Copilot evaluation harness: a set of data and tools for evaluating LLM-guided IDE interactions, covering various programming scenarios and languages. We propose our metrics as a more robust and information-dense evaluation than previous state of the art evaluation systems. We design and compute both static and execution based success metrics for scenarios encompassing a wide range of developer tasks, including code generation from natural language (generate), documentation generation from code (doc), test case generation (test), bug-fixing (fix), and workspace understanding and query resolution (workspace). These success metrics are designed to evaluate the performance of LLMs within a given IDE and its respective parameter space. Our learnings from evaluating three common LLMs using these metrics can inform the development and validation of future scenarios in LLM guided IDEs.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,202
Reputation
8,249
Daps
157,886



Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception​



Open in Spaces Demo ModelScope

Junyang Wang1, Haiyang Xu2†, Jiabo Ye2, Ming Yan2†,
Weizhou Shen2, Ji Zhang2, Fei Huang2, Jitao Sang1†
{junyangwang, jtsang}@bjtu.edu.cn, {shuofeng.xhy, ym119608}@alibaba-inc.com

1Beijing Jiaotong University 2Alibaba Group
†Corresponding author​

📋Introduction​



  • Pure visual solution, independent of XML and system metadata.
  • Unrestricted operation scope, capable of multi-app operations.
  • Multiple visual perception tools for operation localization.
  • No need for exploration and training, plug and play.

📢News​

  • [2.21] 🔥🔥We provide a demo that can upload screenshots from mobile devices. Now you can experience it at Hugging Face and ModelScope.
  • [2.5] 🔥🔥We provide a free API and deploy the entire process for experiencing Mobile Agent, even if you don't have an OpenAI API Key. Check out Quick Start.
  • [1.31] 🔥Our code is available! Welcome to try Mobile-Agent.
  • [1.31] 🔥Human-operated data in Mobile-Eval is in preparation and will be open-sourced soon.
  • [1.30] Our paper is available at LINK.
  • [1.30] Our evaluation results on Mobile-Eval are available.
  • [1.30] The code and Mobile-Eval benchmark are coming soon!
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,202
Reputation
8,249
Daps
157,886


OpenCodeInterpreter

Integrating Code Generation with Execution and Refinement

The introduction of large language models has significantly advanced code generation. However, open-source models often lack the execution capabilities and iterative refinement of advanced systems like the GPT-4 Code Interpreter. To address this, we introduce OpenCodeInterpreter, a family of open-source code systems designed for generating, executing, and iteratively refining code. Supported by Code-Feedback, a dataset featuring 68K multi-turn interactions, OpenCodeInterpreter integrates execution and human feedback for dynamic code refinement. Our comprehensive evaluation of OpenCodeInterpreter across key benchmarks such as HumanEval, MBPP, and their enhanced versions from EvalPlus reveals its exceptional performance. Notably, OpenCodeInterpreter-33B achieves an accuracy of 83.2 (76.4) on the average (and plus versions) of HumanEval and MBPP, closely rivaling GPT-4's 84.2 (76.2) and further elevates to 91.6 (84.6) with synthesized human feedback from GPT-4. OpenCodeInterpreter brings the gap between open-source code generation models and proprietary systems like GPT-4 Code Interpreter.




 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,202
Reputation
8,249
Daps
157,886


Introducing Phind-70B – closing the code quality gap with GPT-4 Turbo while running 4x faster​



We're excited to announce Phind-70B, our largest and most performant model to date. Running at up to 80 tokens per second, Phind-70B gives high-quality answers for technical topics without making users make a cup of coffee while they wait. We think it offers the best overall user experience for developers amongst state-of-the-art models.​

Phind-70B is based on the CodeLlama-70B model and is fine-tuned on an additional 50 billion tokens, yielding significant improvements. It also supports a context window of 32K tokens.

Phind-70B scores 82.3% on HumanEval, beating the latest GPT-4 Turbo (gpt-4-0125-preview) score of 81.1% in our evaluation. On Meta's CRUXEval dataset, Phind-70B scores 59% to GPT-4's reported score of 62% on the output prediction benchmark. However, neither of these public datasets fully captures how our users use Phind for real-world workloads. We find that Phind-70B is in the same quality realm as GPT-4 Turbo for code generation and exceeds it on some tasks. Phind-70B is also less “lazy” than GPT-4 Turbo and doesn't hesistate to generate detailed code examples.

Phind 70B HumanEval
Phind 70B CRUXEval

Phind-70B is significantly faster than GPT-4 Turbo, running at 80+ tokens per second to GPT-4 Turbo's ~20 tokens per second. We're able to achieve this by running NVIDIA's TensorRT-LLM library on H100 GPUs, and we're working on optimizations to further increase Phind-70B's inference speed.

Phind-70B is available today to try for free and without a login. You can get higher limits by subscribing to Phind Pro.

We love the open-source community and will be releasing the weights for the latest Phind-34B model in the coming weeks. We intend to release the weights for Phind-70B in time as well.

We'd like to thank our cloud partners, SF Compute and AWS, for helping us get the infrastructure right for training and serving Phind-70B. We'd also like to thank our partners at Meta and NVIDIA for their support.

Fun fact: We melted an H100 during Phind-70B's training!








 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,202
Reputation
8,249
Daps
157,886






MetaVoice-1B​

Playground | Open In Colab

MetaVoice-1B is a 1.2B parameter base model trained on 100K hours of speech for TTS (text-to-speech). It has been built with the following priorities:

  • Emotional speech rhythm and tone in English.
  • Zero-shot cloning for American & British voices, with 30s reference audio.
  • Support for (cross-lingual) voice cloning with finetuning.
    • We have had success with as little as 1 minute training data for Indian speakers.
  • Synthesis of arbitrary length text
We’re releasing MetaVoice-1B under the Apache 2.0 license, it can be used without restrictions.

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,202
Reputation
8,249
Daps
157,886




Computer Science > Computation and Language​

[Submitted on 25 Jan 2024 (v1), last revised 28 Jan 2024 (this version, v2)]

WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models​

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, Dong Yu
The advancement of large language models (LLMs) leads to a new era marked by the development of autonomous applications in the real world, which drives innovation in the creation of advanced web-based agents. Existing web agents typically only handle one input modality and are evaluated only in simplified web simulators or static web snapshots, greatly limiting their applicability in real-world scenarios. To bridge this gap, we introduce WebVoyager, an innovative Large Multimodal Model (LMM) powered web agent that can complete user instructions end-to-end by interacting with real-world websites. Moreover, we propose a new evaluation protocol for web agents to address the challenges of automatic evaluation of open-ended web agent tasks, leveraging the robust multimodal comprehension capabilities of GPT-4V. We create a new benchmark by gathering real-world tasks from 15 widely used websites to evaluate our agents. We show that WebVoyager achieves a 55.7% task success rate, significantly surpassing the performance of both GPT-4 (All Tools) and the WebVoyager (text-only) setups, underscoring the exceptional capability of WebVoyager in practical applications. We found that our proposed automatic evaluation achieves 85.3% agreement with human judgment, paving the way for further development of web agents in a real-world setting.
Subjects:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:arXiv:2401.13919 [cs.CL]
(or arXiv:2401.13919v2 [cs.CL] for this version)
[2401.13919] WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models
Focus to learn more

Submission history

From: Hongliang He [view email]
[v1] Thu, 25 Jan 2024 03:33:18 UTC (18,186 KB)
[v2] Sun, 28 Jan 2024 07:57:21 UTC (18,186 KB)
 
Top