Large Language Models News & Discussions

bnew · Mar 30, 2024

Meta is adding AI to its Ray-Ban smart glasses next month

The smart glasses will soon come with a built-in assistant.

www.theverge.com

Meta is adding AI to its Ray-Ban smart glasses next month

The Ray-Ban Meta Smart Glasses can do things like identify objects, monuments, and animals, as well as translate text.

By Emma Roth, a news writer who covers the streaming wars, consumer tech, crypto, social media, and much more. Previously, she was a writer and editor at MUO.

Mar 28, 2024, 9:38 AM EDT

A photo showing the Ray-Ban Meta Smart Glasses on a blue and yellow background

Photo by Amelia Holowaty Krales / The Verge

Meta will bring AI to its Ray-Ban smart glasses starting next month, according to a report from The New York Times. The multimodal AI features, which can perform translation, along with object, animal, and monument identification, have been in early access since last December.

Users can activate the glasses’ smart assistant by saying “Hey Meta,” and then saying a prompt or asking a question. It will then respond through the speakers built into the frames. The NYT offers a glimpse at how well Meta’s AI works when taking the glasses for a spin in a grocery store, while driving, at museums, and even at the zoo.

Although Meta’s AI was able to correctly identify pets and artwork, it didn’t get things right 100 percent of the time. The NYT found that the glasses struggled to identify zoo animals that were far away and behind cages. It also didn’t properly identify an exotic fruit, called a cherimoya, after multiple tries. As for AI translations, the NYT found that the glasses support English, Spanish, Italian, French, and German.

Meta will likely continue refining these features as time goes on. Right now, the AI features in the Ray-Ban Meta Smart Glasses are only available through an early access waitlist for users in the US.

bnew · Mar 30, 2024

1/1
MagicLens: State-of-the-art instruction-following image retrieval model on 10 benchmarks but 50x smaller than prior best!

Check out our paper on huggingface: Paper page - MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions

1/6
Proud to present MagicLens: image retrieval models following open-ended instructions.

MagicLens

MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions

open-vision-language.github.io

Highlights of MagicLens:

>Novel Insights: Naturally occurring image pairs on the same web page contain diverse image relations (e.g., inside and outside views of the same building). Modeling such diverse relations can enable richer search intents beyond just searching for identical images in traditional image retrieval.

>Strong Performance: Trained on 36.7M data mined from the web, a single MagicLens model matches or exceeds prior SOTA methods on 10 benchmarks across various tasks, including multimodal-to-image, image-to-image, and text-to-image retrieval.

>Efficiency: On multiple benchmarks, MagicLens outperforms previous SOTA (>14.6B) but with a 50X smaller model size (267M).

>Open-Ended Search: MagicLens can satisfy various search intents expressed by open-ended instructions, especially complex and beyond visual intents — where prior best methods fall short.

Check out our technical report for more details: [2403.19651] MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions

This is a joint work with awesome collaborators:
@YiLuan9
@Hexiang_Hu

@kentonctlee
, Siyuan Qiao,
@WenhuChen

@ysu_nlp
and
@mchang21
, from
@GoogleDeepMind
and
@osunlp
.

2/6
We mine image pairs from the same web pages and express their diverse semantic relations, which may extend beyond visual similarities (e.g., a charger of a product), as open-ended instructions with #LMM and #LLM.

3/6
How to precisely and explicitly capture the implicit relations between the image pairs?

We build a systematic pipeline with heavy mining, cleaning, and pairing. Then, we annotate massive metadata via #LMMs and generate open-ended instructions via #LLMs.

4/6
After mining 36.7 million triplets (query image, instruction, target image), we train light-weight MagicLens models taking image and instruction as input.

With comparable sizes, a single model achieves best results on 10 benchmarks across three image retrieval task forms.

5/6
On several benchmarks, MagicLens outperforms 50X larger SOTA methods by a large margin.

6/6
To simulate a more realistic scenario, we hold out an index set with 1.4 million images, the largest retrieval pool to date.

Human evaluation finds that MagicLens excels at all kinds of instructions, especially those that are complex and go beyond visual similarity.

bnew · Mar 30, 2024

1/4
After enjoying the large MoE model, why not have a look at a small one?

This is it, Qwen1.5-MoE-A2.7B, a 14B MoE model with only 2.7B activated parameters!

HF: Qwen (Qwen) , search repos with “Qwen1.5-MoE-A2.7B” in model names.

GitHub: GitHub - QwenLM/Qwen1.5: Qwen1.5 is the improved version of Qwen, the large language model series developed by Qwen team, Alibaba Cloud.

Blog: Qwen1.5-MoE: Matching 7B Model Performance with 1/3 Activated Parameters

Features include:

* Matches the 7B model quality
* 75% pretraining cost reduction
* 1.74x inference speed acceleration
* 64 experts per MoE layer, 8 activated for each token, where 4 for all tokens and 4 based on routing
* Upcycled (initialized) from Qwen-1.8B

Now it is only supported by HF transformers and vLLM. For both you need to install from source, as the latest versions with `qwen2_moe` are not released yet. This is also something new for us. Hope you enjoy and feel free shoot us feedback!

2/4
https://arxiv.org/pdf/2212.05055.pdf

3/4
A little bit different. We used upcycling first, and then split the FFNs into more experts, which is known as finegrained experts. We had some techniques in bringing in randomness, but we might share them later in a more formal tech report.

4/4
v0.2 achieves 7.6

1/4
Here is the demo of Qwen1.5-MoE-A2.7B:

Qwen1.5 MoE A2.7B Chat Demo - a Hugging Face Space by Qwen

Discover amazing ML apps made by the community

huggingface.co

It is fast. See if it matches your expectation of the model quality!

2/4
@_akhaliq give it a try

3/4
They are working on this

4/4
No we did not do the self intro stuff in it. It Just hallucinates. Bad

bnew · Mar 30, 2024

1/5
𝔸𝕘𝕖𝕟𝕥 𝕃𝕦𝕞𝕠𝕤 is one of the first unified and modular frameworks for training open-source LLM-based agents.

New features:
Multimodal Reasoning with 𝕃𝕦𝕞𝕠𝕤
13B-scale 𝕃𝕦𝕞𝕠𝕤 models
𝕃𝕦𝕞𝕠𝕤 data-explorer demo

@ai2_mosaic
@uclanlp

: [2311.05657] Agent Lumos: Unified and Modular Training for Open-Source Language Agents
: GitHub - allenai/lumos: Code and data for "Lumos: Learning Agents with Unified Data, Modular Design, and Open-Source LLMs"
: ai2lumos (Lumos Agents (AI2))
Demo: Agent Lumos - a Hugging Face Space by ai2lumos (1/N)

2/5

Lumos now supports multimodal tasks!

🖼 It accepts image caption input and then solves visual reasoning tasks with planning. We adopt the same strategy for generating multimodal annotations as other complex interactive tasks.

Multimodal training annotations are here：

Lumos plan annotations: ai2lumos/lumos_multimodal_plan_iterative · Datasets at Hugging Face

Lumos ground annotations: ai2lumos/lumos_multimodal_ground_iterative · Datasets at Hugging Face

3/5
Lumos has strong multimodal perf.

Lumos outperforms larger VL models such as MiniGPT-4-13B on A-OKVQA and ScienceQA

7B-scale Lumos outperforms LLAVA1.5-7B on ScienceQA

7B-scale Lumos outperforms AutoAct, which is directly fine-tuned on ScienceQA (3/N)

4/5
Lumos-13B are released

Lumos-13B for each task type are released

They further lift the performance level, compared with 7B-scale models (4/N)

5/5
Lumos data demo is released

It is a conversational demo to show how planning and grounding modules work and interact with each other

Link: Agent Lumos - a Hugging Face Space by ai2lumos (5/N)

bnew · Mar 30, 2024

1/1
Try BrushNet! Segment and fill in anywhere with precision and coherence!

Check out
@huggingface space: BrushNet - a Hugging Face Space by TencentARC

1/1
Run Brushnet locally on your machine with 1 click!

bnew · Mar 30, 2024

1/1
Most existing Video LLMs struggle with the “When?” questions. In contrast, our proposed LITA can answer challenging "when" questions like “When does the girl show resilience in her performance?”

Paper: [2403.19046] LITA: Language Instructed Temporal-Localization Assistant
Code: GitHub - NVlabs/LITA

bnew · Mar 30, 2024

1/3
Multimodal DPO, large & tiny multimodal matryoshka embeddings, and 1st party ONNX support for 10x lighter deployments

In partnership with
@nebiusofficial , Unum is releasing a new set of pocket-sized multimodal models, already available on
@huggingface

2/3
Check out the new collections on
@huggingface and let us which languages we should prioritize for future multilingual releases

unum-cloud (Unum)

3/3
And the new inference code with binary and Matryoshka examples on our
@GitHub repo - don't forget to and spread the word

bnew · Mar 30, 2024

1/1
BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text

weights: stanford-crfm/BioMedLM · Hugging Face
abs: [2403.18421] BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text

bnew · Mar 30, 2024

1/1
Google presents Long-form factuality in large language models

- Proposes that LLM agents can be used as automated evaluators for longform factuality
- Shows that LLM agents can achieve superhuman rating performance

repo: GitHub - google-deepmind/long-form-factuality: Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".
abs: [2403.18802] Long-form factuality in large language models

bnew · Mar 30, 2024

1/1
DINO-Tracker: Taming DINO for Self-Supervised Point Tracking in a Single Video

- Presents a new framework for long-term dense tracking in video
- Achieves SotA results and significantly outperforms SSL methods

proj: DINO-Tracker: Taming DINO for Self-Supervised Point Tracking in a Single Video
abs: [2403.14548] DINO-Tracker: Taming DINO for Self-Supervised Point Tracking in a Single Video

bnew · Mar 30, 2024

1/1
Track Everything Everywhere Fast and Robustly

Substantial improvement in training speed (>10x faster), robustness, and accuracy in tracking over the SoTA optimization-based method

proj: Track Everything Everywhere Fast and Robustly
abs: [2403.17931] Track Everything Everywhere Fast and Robustly

bnew · Mar 30, 2024

1/1
The Unreasonable Ineffectiveness of the Deeper Layers

Finds minimal degradation of performance of LLMs on various QA benchmarks until after a large fraction (up to half) of the layers are removed

[2403.17887] The Unreasonable Ineffectiveness of the Deeper Layers

bnew · Mar 30, 2024

1/2
Human Image Personalization with High-fidelity Identity Preservation

Presents FlashFace, a practical tool with which users can easily personalize their own photos on the fly by providing one or a few reference face images and a text prompt

proj: FlashFace
abs: [2403.17008] FlashFace: Human Image Personalization with High-fidelity Identity Preservation

2/2
Google presents MagicLens: image retrieval models following open-ended instructions

Outperforms previous SotA but with a 50x smaller model size

proj: https://open-vision-language.github.io/MagicLens/ abs: [2403.19651] MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions

bnew · Mar 30, 2024

1/1
LLM Agent Operating System

Presents AIOS, an LLM agent operating system, which embeds large language model into operating systems

repo: GitHub - agiresearch/AIOS: AIOS: LLM Agent Operating System
abs: [2403.16971] AIOS: LLM Agent Operating System

bnew · Mar 30, 2024

1/6
Are larger vision models always necessary?

We find scaling on **image scales** (e.g., 224->448->672) is usually better than scaling on model size (e.g., Base->Large->Giant).

With one line of code, improve any vision model for Multimodal LLMs or various vision and robotic tasks!

2/6
We enable any vision model to extract multi-scale features by splitting the large-scale image into regular-sized crops, processing them separately, merging the features together, and pooling it to the regular size.

3/6
This simple approach has the advantage of no additional parameters (so no need to re-train anything) and keeping the same number of output vision tokens (thus keeping the same input length for MLLMs).

4/6
We find keeping the same model size and using larger image scales is usually comparable or better than using larger models. Comparison on image classification, semantic segmentation, and depth estimation:

5/6
Similar trends for MLLMs:

6/6
And robotic tasks too!

Large Language Models News & Discussions

More options

bnew

Veteran

Meta is adding AI to its Ray-Ban smart glasses next month

Meta is adding AI to its Ray-Ban smart glasses next month

The Ray-Ban Meta Smart Glasses can do things like identify objects, monuments, and animals, as well as translate text.

bnew

Veteran

MagicLens

bnew

Veteran

Qwen1.5 MoE A2.7B Chat Demo - a Hugging Face Space by Qwen

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

Large Language Models News & Discussions

Veteran

Meta is adding AI to its Ray-Ban smart glasses next month​

The Ray-Ban Meta Smart Glasses can do things like identify objects, monuments, and animals, as well as translate text.​

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Meta is adding AI to its Ray-Ban smart glasses next month

The Ray-Ban Meta Smart Glasses can do things like identify objects, monuments, and animals, as well as translate text.