Large Language Models News & Discussions

bnew · Jan 20, 2024

[2401.10224] The Manga Whisperer: Automatically Generating Transcriptions for Comics

Computer Science > Computer Vision and Pattern Recognition

[Submitted on 18 Jan 2024]

The Manga Whisperer: Automatically Generating Transcriptions for Comics

In the past few decades, Japanese comics, commonly referred to as Manga, have transcended both cultural and linguistic boundaries to become a true worldwide sensation. Yet, the inherent reliance on visual cues and illustration within manga renders it largely inaccessible to individuals with visual impairments. In this work, we seek to address this substantial barrier, with the aim of ensuring that manga can be appreciated and actively engaged by everyone. Specifically, we tackle the problem of diarisation i.e. generating a transcription of who said what and when, in a fully automatic way.
To this end, we make the following contributions: (1) we present a unified model, Magi, that is able to (a) detect panels, text boxes and character boxes, (b) cluster characters by identity (without knowing the number of clusters apriori), and (c) associate dialogues to their speakers; (2) we propose a novel approach that is able to sort the detected text boxes in their reading order and generate a dialogue transcript; (3) we annotate an evaluation benchmark for this task using publicly available [English] manga pages. The code, evaluation datasets and the pre-trained model can be found at: this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2401.10224 [cs.CV]
	(or arXiv:2401.10224v1 [cs.CV] for this version)

Submission history

From: Ragav Sachdeva [view email]
[v1] Thu, 18 Jan 2024 18:59:09 UTC (34,898 KB)

https://arxiv.org/pdf/2401.10224.pdf

GitHub - ragavsachdeva/magi: Generate a transcript for your favourite Manga: Detect manga characters, text blocks and panels. Order panels. Cluster characters. Match texts to their speakers. Perform OCR.

Generate a transcript for your favourite Manga: Detect manga characters, text blocks and panels. Order panels. Cluster characters. Match texts to their speakers. Perform OCR. - GitHub - ragavsachd...

github.com

About

Generate a transcript for your favourite Manga: Detect manga characters, text blocks and panels. Order panels. Cluster characters. Match texts to their speakers. Perform OCR.

The Manga Whisperer: Automatically Generating Transcriptions for Comics

[arXiv]

Ragav Sachdeva, Andrew Zisserman

TLDR

The model is available at HuggingFace Model Hub.
Try it out for yourself using this HuggingFace Spaces Demo (no GPU, so slow).
Dataset is coming soon.
Basic model usage is provided below, more details to follow.

bnew · Jan 21, 2024

Code Prompting Elicits Conditional Reasoning Abilities in Text+Code LLMs

Reasoning is a fundamental component of language understanding. Recent prompting techniques, such as chain of thought, have consistently improved LLMs' performance on various reasoning tasks. Nevertheless, there is still little understanding of what triggers reasoning abilities in LLMs in the...

arxiv.org

Computer Science > Computation and Language

[Submitted on 18 Jan 2024]

Code Prompting Elicits Conditional Reasoning Abilities in Text+Code LLMs

Haritz Puerto, Martin Tutek, Somak Aditya, Xiaodan Zhu, Iryna Gurevych

Reasoning is a fundamental component for achieving language understanding. Among the multiple types of reasoning, conditional reasoning, the ability to draw different conclusions depending on some condition, has been understudied in large language models (LLMs). Recent prompting methods, such as chain of thought, have significantly improved LLMs on reasoning tasks. Nevertheless, there is still little understanding of what triggers reasoning abilities in LLMs. We hypothesize that code prompts can trigger conditional reasoning in LLMs trained on text and code. We propose a chain of prompts that transforms a natural language problem into code and prompts the LLM with the generated code. Our experiments find that code prompts exhibit a performance boost between 2.6 and 7.7 points on GPT 3.5 across multiple datasets requiring conditional reasoning. We then conduct experiments to discover how code prompts elicit conditional reasoning abilities and through which features. We observe that prompts need to contain natural language text accompanied by high-quality code that closely represents the semantics of the instance text. Furthermore, we show that code prompts are more efficient, requiring fewer demonstrations, and that they trigger superior state tracking of variables or key entities.

Comments:	Code, prompt templates, prompts, and outputs are publicly available at this https URL
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2401.10065 [cs.CL]
	(or arXiv:2401.10065v1 [cs.CL] for this version)

Submission history

From: Haritz Puerto [view email]

[v1] Thu, 18 Jan 2024 15:32:24 UTC (9,152 KB)

https://arxiv.org/pdf/2401.10065.pdf

bnew · Jan 21, 2024

GitHub - b4rtaz/distributed-llama: Connect home devices into a powerful cluster to accelerate LLM inference. More devices means faster inference.

Connect home devices into a powerful cluster to accelerate LLM inference. More devices means faster inference. - b4rtaz/distributed-llama

github.com

About

Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.

Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage. This project proves that it's possible split the workload of LLMs across multiple devices and achieve a significant speedup. Distributed Llama allows you to run huge LLMs in-house. The project uses TCP sockets to synchronize the state. You can easily configure your AI cluster by using a home router.

bnew · Jan 22, 2024

https://www.reddit.com/r/LocalLLaMA/comments//a_new_base_model_orion_14b_trained_on_25t_tokens/

OrionStarAI/Orion-14B-Base · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

OrionStarAI/Orion-14B-Chat · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

bnew · Jan 22, 2024

https://archive.is/RH5sH

TikTok presents Depth Anything

Unleashing the Power of Large-Scale Unlabeled Data

paper page: https://huggingface.co/papers/2401.10891huggingface.co/papers/2401.1…

demo: Depth Anything - a Hugging Face Space by LiheYoung

Depth Anything is trained on 1.5M labeled images and 62M+ unlabeled images jointly, providing the most capable Monocular Depth Estimation (MDE) foundation models with the following features:

zero-shot relative depth estimation, better than MiDaS v3.1 (BEiTL-512)

zero-shot metric depth estimation, better than ZoeDepth

optimal in-domain fine-tuning and evaluation on NYUv2 and KITTI

Paper page - Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data

Join the discussion on this paper page

huggingface.co

Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data

Published on Jan 19
·Featured in Daily Papers on Jan 21
Authors:
Lihe Yang,
Bingyi Kang,
Zilong Huang,
Xiaogang Xu,
Jiashi Feng,
Hengshuang Zhao

Abstract

This work presents Depth Anything, a highly practical solution for robust monocular depth estimation. Without pursuing novel technical modules, we aim to build a simple yet powerful foundation model dealing with any images under any circumstances. To this end, we scale up the dataset by designing a data engine to collect and automatically annotate large-scale unlabeled data (~62M), which significantly enlarges the data coverage and thus is able to reduce the generalization error. We investigate two simple yet effective strategies that make data scaling-up promising. First, a more challenging optimization target is created by leveraging data augmentation tools. It compels the model to actively seek extra visual knowledge and acquire robust representations. Second, an auxiliary supervision is developed to enforce the model to inherit rich semantic priors from pre-trained encoders. We evaluate its zero-shot capabilities extensively, including six public datasets and randomly captured photos. It demonstrates impressive generalization ability. Further, through fine-tuning it with metric depth information from NYUv2 and KITTI, new SOTAs are set. Our better depth model also results in a better depth-conditioned ControlNet. Our models are released at GitHub - LiheYoung/Depth-Anything: Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data.

Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data

This work presents Depth Anything, a highly practical solution for robust monocular depth estimation. Without pursuing novel technical modules, we aim to build a simple yet powerful foundation model dealing with any images under any circumstances. To this end, we scale up the dataset by...

arxiv.org

https://arxiv.org/pdf/2401.10891

bnew · Jan 22, 2024

https://archive.is/MDgcR

InstantID - a Hugging Face Space by InstantX

Discover amazing ML apps made by the community

hf.co

GitHub - instantX-research/InstantID: InstantID: Zero-shot Identity-Preserving Generation in Seconds 🔥

InstantID: Zero-shot Identity-Preserving Generation in Seconds 🔥 - instantX-research/InstantID

github.com

About

InstantID : Zero-shot Identity-Preserving Generation in Seconds

InstantID : Zero-shot Identity-Preserving Generation in Seconds

InstantID is a new state-of-the-art tuning-free method to achieve ID-Preserving generation with only single image, supporting various downstream tasks.

bnew · Jan 22, 2024

https://archive.is/UmKBQ

Old Photo Restoration - a Hugging Face Space by modelscope

Discover amazing ML apps made by the community

huggingface.co

bnew · Jan 22, 2024

https://archive.is/6Irtr

01-ai/Yi-VL-6B · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

01-ai/Yi-VL-34B · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

01.AI just released Yi-VL-34B on Hugging Face

Yi Visual Language (Yi-VL) model is the open-source, multimodal version of the Yi Large Language Model (LLM) series, enabling content comprehension, recognition, and multi-round conversations about images.

huggingface.co/01-ai/Yi-VL-6…

huggingface.co/01-ai/Yi-VL-3…

ranking first among all existing open-source models in the latest benchmarks including MMMU in English and CMMMU in Chinese

bnew · Jan 22, 2024

https://archive.is/8Dwd6

https://archive.is/J2QPr

Old Photo Restoration - a Hugging Face Space by modelscope

Discover amazing ML apps made by the community

huggingface.co

piddnad/DDColor-models · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

piddnad/ddcolor – Run with an API on Replicate

Towards Photo-Realistic Image Colorization via Dual Decoders

replicate.com

bnew · Jan 22, 2024

Paper page - Zero Bubble Pipeline Parallelism

Join the discussion on this paper page

huggingface.co

bnew · Jan 22, 2024

Paper page - Understanding Video Transformers via Universal Concept Discovery

Join the discussion on this paper page

huggingface.co

bnew · Jan 22, 2024

https://archive.is/wip/PWyS3

Adobe presents ActAnywhere

Subject-Aware Video Background Generation

paper page: Paper page - ActAnywhere: Subject-Aware Video Background Generation

ActAnywhere, a generative model that automates this process which traditionally requires tedious manual efforts. Our model leverages the power of large-scale video diffusion models, and is specifically tailored for this task. ActAnywhere a sequence of foreground subject segmentation as input and an image that describes the desired scene as condition, to produce a coherent video with realistic foreground-background interactions while adhering to the condition frame. We train our model on a large-scale dataset of human-scene interaction videos. Extensive evaluations demonstrate the superior performance of our model, significantly outperforming baselines. Moreover, we show that ActAnywhere name generalizes to diverse out-of-distribution samples, including non-human subjects.

bnew · Jan 22, 2024

https://archive.is/rCQ5E

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

paper page: Paper page - Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

The inference process in Large Language Models (LLMs) is often limited due to the absence of parallelism in the auto-regressive decoding process, resulting in most operations being restricted by the memory bandwidth of accelerators. While methods such as speculative decoding have been suggested to address this issue, their implementation is impeded by the challenges associated with acquiring and maintaining a separate draft model. In this paper, we present Medusa, an efficient method that augments LLM inference by adding extra decoding heads to predict multiple subsequent tokens in parallel. Using a tree-based attention mechanism, Medusa constructs multiple candidate continuations and verifies them simultaneously in each decoding step. By leveraging parallel processing, Medusa introduces only minimal overhead in terms of single-step latency while substantially reducing the number of decoding steps required. We present two levels of fine-tuning procedures for Medusa to meet the needs of different use cases: Medusa-1: Medusa is directly fine-tuned on top of a frozen backbone LLM, enabling lossless inference acceleration. Medusa-2: Medusa is fine-tuned together with the backbone LLM, enabling better prediction accuracy of Medusa heads and higher speedup but needing a special training recipe that preserves the backbone model's capabilities. Moreover, we propose several extensions that improve or expand the utility of Medusa, including a self-distillation to handle situations where no training data is available and a typical acceptance scheme to boost the acceptance rate while maintaining generation quality. We evaluate Medusa on models of various sizes and training procedures. Our experiments demonstrate that Medusa-1 can achieve over 2.2x speedup without compromising generation quality, while Medusa-2 further improves the speedup to 2.3-3.6x.

bnew · Jan 22, 2024

https://archive.is/AY2IN

MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World

vis-www.cs.umass.edu

GitHub - UMass-Foundation-Model/MultiPLY: Code for MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World

Code for MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World - GitHub - UMass-Foundation-Model/MultiPLY: Code for MultiPLY: A Multisensory Object-Centric Embodied Larg...

github.com

[2401.08577] MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World

Overview

Human beings possess the capability to multiply a mélange of multisensory cues while actively exploring and interacting with the 3D world. Current multi-modal large language models, however, passively absorb sensory data as inputs, lacking the capacity to actively interact with the objects in the 3D environment and dynamically collect their multisensory information. To usher in the study of this area,we propose MultiPLY, a multisensory embodied large language model that could incorporate multisensory interactive data, including visual, audio, tactile, and thermal information into large language models, thereby establishing the correlation among words, actions, and perceptions. MultiPLY can perform a diverse set of multisensory embodied tasks, including multisensory question answering, embodied question answering, task decomposition, object retrieval, and tool use.

bnew · Jan 22, 2024

https://archive.is/ml05Z

Exciting news! @intern_lm 7/20B models are now live on the @huggingface Open LLM Leaderboard!

Highlights:

- 200K context length for base/chat models.
- 20B model is on par with the performance of Yi-34B.
- 7B model is the best in the <= 13B range.

internlm (InternLM)

Org profile for InternLM on Hugging Face, the AI community building the future.

huggingface.co

Large Language Models News & Discussions

Veteran

Computer Science > Computer Vision and Pattern Recognition​

The Manga Whisperer: Automatically Generating Transcriptions for Comics​

Submission history​

About​

The Manga Whisperer: Automatically Generating Transcriptions for Comics​

TLDR​

Veteran

Computer Science > Computation and Language​

Code Prompting Elicits Conditional Reasoning Abilities in Text+Code LLMs​

Submission history​

Veteran

About​

Veteran

Veteran

Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data​

Abstract​

Veteran

About​

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Overview​

Veteran

Computer Science > Computer Vision and Pattern Recognition

The Manga Whisperer: Automatically Generating Transcriptions for Comics

Submission history

About

The Manga Whisperer: Automatically Generating Transcriptions for Comics

TLDR

Computer Science > Computation and Language

Code Prompting Elicits Conditional Reasoning Abilities in Text+Code LLMs

Submission history

About

Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data

Abstract

About

Overview