bnew

Veteran
Joined
Nov 1, 2015
Messages
51,806
Reputation
7,926
Daps
148,752

6LQ0x7B.png

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,806
Reputation
7,926
Daps
148,752


UyPdRIb.png

p5SQS3a.jpeg

OgGlLp3.png



01.AI just released Yi-VL-34B on Hugging Face

Yi Visual Language (Yi-VL) model is the open-source, multimodal version of the Yi Large Language Model (LLM) series, enabling content comprehension, recognition, and multi-round conversations about images.

huggingface.co/01-ai/Yi-VL-6…

huggingface.co/01-ai/Yi-VL-3…

ranking first among all existing open-source models in the latest benchmarks including MMMU in English and CMMMU in Chinese
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,806
Reputation
7,926
Daps
148,752

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,806
Reputation
7,926
Daps
148,752



Adobe presents ActAnywhere

Subject-Aware Video Background Generation

paper page: Paper page - ActAnywhere: Subject-Aware Video Background Generation

ActAnywhere, a generative model that automates this process which traditionally requires tedious manual efforts. Our model leverages the power of large-scale video diffusion models, and is specifically tailored for this task. ActAnywhere a sequence of foreground subject segmentation as input and an image that describes the desired scene as condition, to produce a coherent video with realistic foreground-background interactions while adhering to the condition frame. We train our model on a large-scale dataset of human-scene interaction videos. Extensive evaluations demonstrate the superior performance of our model, significantly outperforming baselines. Moreover, we show that ActAnywhere name generalizes to diverse out-of-distribution samples, including non-human subjects.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,806
Reputation
7,926
Daps
148,752




Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

paper page: Paper page - Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads


The inference process in Large Language Models (LLMs) is often limited due to the absence of parallelism in the auto-regressive decoding process, resulting in most operations being restricted by the memory bandwidth of accelerators. While methods such as speculative decoding have been suggested to address this issue, their implementation is impeded by the challenges associated with acquiring and maintaining a separate draft model. In this paper, we present Medusa, an efficient method that augments LLM inference by adding extra decoding heads to predict multiple subsequent tokens in parallel. Using a tree-based attention mechanism, Medusa constructs multiple candidate continuations and verifies them simultaneously in each decoding step. By leveraging parallel processing, Medusa introduces only minimal overhead in terms of single-step latency while substantially reducing the number of decoding steps required. We present two levels of fine-tuning procedures for Medusa to meet the needs of different use cases: Medusa-1: Medusa is directly fine-tuned on top of a frozen backbone LLM, enabling lossless inference acceleration. Medusa-2: Medusa is fine-tuned together with the backbone LLM, enabling better prediction accuracy of Medusa heads and higher speedup but needing a special training recipe that preserves the backbone model's capabilities. Moreover, we propose several extensions that improve or expand the utility of Medusa, including a self-distillation to handle situations where no training data is available and a typical acceptance scheme to boost the acceptance rate while maintaining generation quality. We evaluate Medusa on models of various sizes and training procedures. Our experiments demonstrate that Medusa-1 can achieve over 2.2x speedup without compromising generation quality, while Medusa-2 further improves the speedup to 2.3-3.6x.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,806
Reputation
7,926
Daps
148,752






Overview​

Human beings possess the capability to multiply a mélange of multisensory cues while actively exploring and interacting with the 3D world. Current multi-modal large language models, however, passively absorb sensory data as inputs, lacking the capacity to actively interact with the objects in the 3D environment and dynamically collect their multisensory information. To usher in the study of this area,we propose MultiPLY, a multisensory embodied large language model that could incorporate multisensory interactive data, including visual, audio, tactile, and thermal information into large language models, thereby establishing the correlation among words, actions, and perceptions. MultiPLY can perform a diverse set of multisensory embodied tasks, including multisensory question answering, embodied question answering, task decomposition, object retrieval, and tool use.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,806
Reputation
7,926
Daps
148,752

0Sblyaq.png

Exciting news! @intern_lm 7/20B models are now live on the @huggingface Open LLM Leaderboard!

🔍 Highlights:

- 200K context length for base/chat models.
- 20B model is on par with the performance of Yi-34B.
- 7B model is the best in the <= 13B range.

 
Top