Large Language Models News & Discussions

bnew · Feb 23, 2024

Both Sora and Stable Diffusion 3 adopt diffusion transformers, but do we really need a super large DiT for all sampling steps for generation?

No

. We found ~40% early timesteps of DiT-XL can be replaced with a 10x faster DiT-S without image quality drop!

Introduce Trajectory Stitching (T-Stitch), a training-free method that complements existing efficient sampling methods by dynamically allocating computation to different denoising steps.

Paper:

T-Stitch: Accelerating Sampling in Pre-Trained Diffusion Models with Trajectory Stitching

Sampling from diffusion probabilistic models (DPMs) is often expensive for high-quality image generation and typically requires many steps with a large model. In this paper, we introduce sampling Trajectory Stitching T-Stitch, a simple yet efficient technique to improve the sampling efficiency...

arxiv.org

Code:

GitHub - NVlabs/T-Stitch: [ICLR 2025] Official PyTorch implmentation of paper "T-Stitch: Accelerating Sampling in Pre-trained Diffusion Models with Trajectory Stitching"

[ICLR 2025] Official PyTorch implmentation of paper "T-Stitch: Accelerating Sampling in Pre-trained Diffusion Models with Trajectory Stitching" - NVlabs/T-Stitch

github.com

Project page:

T-Stitch: Accelerating Sampling in Pre-trained Diffusion Models with Trajectory Stitching

t-stitch.github.io

bnew · Feb 23, 2024

https://archive.is/yYAN0

Paper page - Consolidating Attention Features for Multi-view Image Editing

Join the discussion on this paper page

huggingface.co

Consolidating Attention Features for Multi-view Image Editing
Published on Feb 22
·
Featured in Daily Papers on Feb 22
Authors:
Or Patashnik, Rinon Gal, Daniel Cohen-Or, Jun-Yan Zhu, Fernando De la Torre

Abstract
Large-scale text-to-image models enable a wide range of image editing techniques, using text prompts or even spatial controls. However, applying these editing methods to multi-view images depicting a single scene leads to 3D-inconsistent results. In this work, we focus on spatial control-based geometric manipulations and introduce a method to consolidate the editing process across various views. We build on two insights: (1) maintaining consistent features throughout the generative process helps attain consistency in multi-view editing, and (2) the queries in self-attention layers significantly influence the image structure. Hence, we propose to improve the geometric consistency of the edited images by enforcing the consistency of the queries. To do so, we introduce QNeRF, a neural radiance field trained on the internal query features of the edited images. Once trained, QNeRF can render 3D-consistent queries, which are then softly injected back into the self-attention layers during generation, greatly improving multi-view consistency. We refine the process through a progressive, iterative method that better consolidates queries across the diffusion timesteps. We compare our method to a range of existing techniques and demonstrate that it can achieve better multi-view consistency and higher fidelity to the input scene. These advantages allow us to train NeRFs with fewer visual artifacts, that are better aligned with the target geometry.

bnew · Feb 23, 2024

TinyLLaVA

A Framework of Small-scale Large Multimodal Models

present the TinyLLaVA framework that provides a unified perspective in designing and analyzing the small-scale Large Multimodal Models (LMMs). We empirically study the effects of different vision encoders, connection modules, language models, training data and training recipes. Our extensive experiments showed that better quality of data combined with better training recipes, smaller LMMs can consistently achieve on-par performances compared to bigger LMMs. Under our framework, we train a family of small-scale LMMs. Our best model, TinyLLaVA-3.1B, achieves better overall performance against existing 7B models such as LLaVA-1.5 and Qwen-VL. We hope our findings can serve as baselines for future research in terms of data scaling, training setups and model selections. Our model weights and codes will be made public.

bnew · Feb 23, 2024

Copilot Evaluation Harness

Evaluating LLM-Guided Software Programming

The integration of Large Language Models (LLMs) into Development Environments (IDEs) has become a focal point in modern software development. LLMs such as OpenAI GPT-3.5/4 and Code Llama offer the potential to significantly augment developer productivity by serving as intelligent, chat-driven programming assistants. However, utilizing LLMs out of the box is unlikely to be optimal for any given scenario. Rather, each system requires the LLM to be honed to its set of heuristics to ensure the best performance. In this paper, we introduce the Copilot evaluation harness: a set of data and tools for evaluating LLM-guided IDE interactions, covering various programming scenarios and languages. We propose our metrics as a more robust and information-dense evaluation than previous state of the art evaluation systems. We design and compute both static and execution based success metrics for scenarios encompassing a wide range of developer tasks, including code generation from natural language (generate), documentation generation from code (doc), test case generation (test), bug-fixing (fix), and workspace understanding and query resolution (workspace). These success metrics are designed to evaluate the performance of LLMs within a given IDE and its respective parameter space. Our learnings from evaluating three common LLMs using these metrics can inform the development and validation of future scenarios in LLM guided IDEs.

bnew · Feb 23, 2024

bnew · Feb 23, 2024

GitHub - X-PLUG/MobileAgent: Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception - X-PLUG/MobileAgent

github.com

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

Junyang Wang1, Haiyang Xu2†, Jiabo Ye2, Ming Yan2†,
Weizhou Shen2, Ji Zhang2, Fei Huang2, Jitao Sang1†
{junyangwang, jtsang}@bjtu.edu.cn, {shuofeng.xhy, ym119608}@alibaba-inc.com

1Beijing Jiaotong University 2Alibaba Group
†Corresponding author

Introduction

Pure visual solution, independent of XML and system metadata.
Unrestricted operation scope, capable of multi-app operations.
Multiple visual perception tools for operation localization.
No need for exploration and training, plug and play.

News

[2.21] We provide a demo that can upload screenshots from mobile devices. Now you can experience it at Hugging Face and ModelScope.
[2.5] We provide a free API and deploy the entire process for experiencing Mobile Agent, even if you don't have an OpenAI API Key. Check out Quick Start.
[1.31] Our code is available! Welcome to try Mobile-Agent.
[1.31] Human-operated data in Mobile-Eval is in preparation and will be open-sourced soon.
[1.30] Our paper is available at LINK.
[1.30] Our evaluation results on Mobile-Eval are available.
[1.30] The code and Mobile-Eval benchmark are coming soon!

bnew · Feb 23, 2024

https://archive.is/8nyfo

OpenCodeInterpreter

Integrating Code Generation with Execution and Refinement

The introduction of large language models has significantly advanced code generation. However, open-source models often lack the execution capabilities and iterative refinement of advanced systems like the GPT-4 Code Interpreter. To address this, we introduce OpenCodeInterpreter, a family of open-source code systems designed for generating, executing, and iteratively refining code. Supported by Code-Feedback, a dataset featuring 68K multi-turn interactions, OpenCodeInterpreter integrates execution and human feedback for dynamic code refinement. Our comprehensive evaluation of OpenCodeInterpreter across key benchmarks such as HumanEval, MBPP, and their enhanced versions from EvalPlus reveals its exceptional performance. Notably, OpenCodeInterpreter-33B achieves an accuracy of 83.2 (76.4) on the average (and plus versions) of HumanEval and MBPP, closely rivaling GPT-4's 84.2 (76.2) and further elevates to 91.6 (84.6) with synthesized human feedback from GPT-4. OpenCodeInterpreter brings the gap between open-source code generation models and proprietary systems like GPT-4 Code Interpreter.

bnew · Feb 23, 2024

Phind

Get answers to complex questions with Phind's AI answer engine.

www.phind.com

Introducing Phind-70B – closing the code quality gap with GPT-4 Turbo while running 4x faster

We're excited to announce Phind-70B, our largest and most performant model to date. Running at up to 80 tokens per second, Phind-70B gives high-quality answers for technical topics without making users make a cup of coffee while they wait. We think it offers the best overall user experience for developers amongst state-of-the-art models.

Phind-70B is based on the CodeLlama-70B model and is fine-tuned on an additional 50 billion tokens, yielding significant improvements. It also supports a context window of 32K tokens.

Phind-70B scores 82.3% on HumanEval, beating the latest GPT-4 Turbo (gpt-4-0125-preview) score of 81.1% in our evaluation. On Meta's CRUXEval dataset, Phind-70B scores 59% to GPT-4's reported score of 62% on the output prediction benchmark. However, neither of these public datasets fully captures how our users use Phind for real-world workloads. We find that Phind-70B is in the same quality realm as GPT-4 Turbo for code generation and exceeds it on some tasks. Phind-70B is also less “lazy” than GPT-4 Turbo and doesn't hesistate to generate detailed code examples.

Phind-70B is significantly faster than GPT-4 Turbo, running at 80+ tokens per second to GPT-4 Turbo's ~20 tokens per second. We're able to achieve this by running NVIDIA's TensorRT-LLM library on H100 GPUs, and we're working on optimizations to further increase Phind-70B's inference speed.

Phind-70B is available today to try for free and without a login. You can get higher limits by subscribing to Phind Pro.

We love the open-source community and will be releasing the weights for the latest Phind-34B model in the coming weeks. We intend to release the weights for Phind-70B in time as well.

We'd like to thank our cloud partners, SF Compute and AWS, for helping us get the infrastructure right for training and serving Phind-70B. We'd also like to thank our partners at Meta and NVIDIA for their support.

Fun fact: We melted an H100 during Phind-70B's training!

Try Phind-70B

Phind-70B: Closing the code quality gap with GPT-4 Turbo while running 4x faster | Hacker News

news.ycombinator.com

bnew · Feb 23, 2024

bnew · Feb 23, 2024

bnew · Feb 24, 2024

https://archive.is/ZzL3J

GitHub - metavoiceio/metavoice-src: Foundational model for human-like, expressive TTS

Foundational model for human-like, expressive TTS. Contribute to metavoiceio/metavoice-src development by creating an account on GitHub.

github.com

MetaVoice-1B

Playground |

MetaVoice-1B is a 1.2B parameter base model trained on 100K hours of speech for TTS (text-to-speech). It has been built with the following priorities:

Emotional speech rhythm and tone in English.
Zero-shot cloning for American & British voices, with 30s reference audio.
Support for (cross-lingual) voice cloning with finetuning.
- We have had success with as little as 1 minute training data for Indian speakers.
Synthesis of arbitrary length text

We’re releasing MetaVoice-1B under the Apache 2.0 license, it can be used without restrictions.

metavoiceio/metavoice-1B-v0.1 · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

bnew · Feb 24, 2024

bnew · Feb 25, 2024

https://archive.is/cv74S

bnew · Feb 25, 2024

Mamba: The Easy Way

An overview of the big ideas behind Mamba, a brand-new language model architecture.

jackcook.com

bnew · Feb 25, 2024

[2401.13919] WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

Computer Science > Computation and Language

[Submitted on 25 Jan 2024 (v1), last revised 28 Jan 2024 (this version, v2)]

WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, Dong Yu

The advancement of large language models (LLMs) leads to a new era marked by the development of autonomous applications in the real world, which drives innovation in the creation of advanced web-based agents. Existing web agents typically only handle one input modality and are evaluated only in simplified web simulators or static web snapshots, greatly limiting their applicability in real-world scenarios. To bridge this gap, we introduce WebVoyager, an innovative Large Multimodal Model (LMM) powered web agent that can complete user instructions end-to-end by interacting with real-world websites. Moreover, we propose a new evaluation protocol for web agents to address the challenges of automatic evaluation of open-ended web agent tasks, leveraging the robust multimodal comprehension capabilities of GPT-4V. We create a new benchmark by gathering real-world tasks from 15 widely used websites to evaluate our agents. We show that WebVoyager achieves a 55.7% task success rate, significantly surpassing the performance of both GPT-4 (All Tools) and the WebVoyager (text-only) setups, underscoring the exceptional capability of WebVoyager in practical applications. We found that our proposed automatic evaluation achieves 85.3% agreement with human judgment, paving the way for further development of web agents in a real-world setting.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2401.13919 [cs.CL]
	(or arXiv:2401.13919v2 [cs.CL] for this version)
	[2401.13919] WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models Focus to learn more

Submission history

From: Hongliang He [view email]
[v1] Thu, 25 Jan 2024 03:33:18 UTC (18,186 KB)
[v2] Sun, 28 Jan 2024 07:57:21 UTC (18,186 KB)

Large Language Models News & Discussions

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception​

Introduction​

News​

Veteran

Veteran

Introducing Phind-70B – closing the code quality gap with GPT-4 Turbo while running 4x faster​

Veteran

Veteran

Veteran

MetaVoice-1B​

Veteran

Veteran

Veteran

Veteran

Computer Science > Computation and Language​

WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models​

Submission history​

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

Introduction

News

Introducing Phind-70B – closing the code quality gap with GPT-4 Turbo while running 4x faster

MetaVoice-1B

Computer Science > Computation and Language

WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

Submission history