bnew

Veteran
Joined
Nov 1, 2015
Messages
56,200
Reputation
8,249
Daps
157,880

Meta is adding AI to its Ray-Ban smart glasses next month​


The Ray-Ban Meta Smart Glasses can do things like identify objects, monuments, and animals, as well as translate text.​

By Emma Roth, a news writer who covers the streaming wars, consumer tech, crypto, social media, and much more. Previously, she was a writer and editor at MUO.

Mar 28, 2024, 9:38 AM EDT

A photo showing the Ray-Ban Meta Smart Glasses on a blue and yellow background

Photo by Amelia Holowaty Krales / The Verge

Meta will bring AI to its Ray-Ban smart glasses starting next month, according to a report from The New York Times. The multimodal AI features, which can perform translation, along with object, animal, and monument identification, have been in early access since last December.

Users can activate the glasses’ smart assistant by saying “Hey Meta,” and then saying a prompt or asking a question. It will then respond through the speakers built into the frames. The NYT offers a glimpse at how well Meta’s AI works when taking the glasses for a spin in a grocery store, while driving, at museums, and even at the zoo.

Although Meta’s AI was able to correctly identify pets and artwork, it didn’t get things right 100 percent of the time. The NYT found that the glasses struggled to identify zoo animals that were far away and behind cages. It also didn’t properly identify an exotic fruit, called a cherimoya, after multiple tries. As for AI translations, the NYT found that the glasses support English, Spanish, Italian, French, and German.

Meta will likely continue refining these features as time goes on. Right now, the AI features in the Ray-Ban Meta Smart Glasses are only available through an early access waitlist for users in the US.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,200
Reputation
8,249
Daps
157,880


1/1
MagicLens: State-of-the-art instruction-following image retrieval model on 10 benchmarks but 50x smaller than prior best!

Check out our paper on huggingface: Paper page - MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions








1/6
Proud to present MagicLens: image retrieval models following open-ended instructions.

Highlights of MagicLens:

>Novel Insights: Naturally occurring image pairs on the same web page contain diverse image relations (e.g., inside and outside views of the same building). Modeling such diverse relations can enable richer search intents beyond just searching for identical images in traditional image retrieval.

>Strong Performance: Trained on 36.7M data mined from the web, a single MagicLens model matches or exceeds prior SOTA methods on 10 benchmarks across various tasks, including multimodal-to-image, image-to-image, and text-to-image retrieval.

>Efficiency: On multiple benchmarks, MagicLens outperforms previous SOTA (>14.6B) but with a 50X smaller model size (267M).

>Open-Ended Search: MagicLens can satisfy various search intents expressed by open-ended instructions, especially complex and beyond visual intents — where prior best methods fall short.

Check out our technical report for more details: [2403.19651] MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions

This is a joint work with awesome collaborators:
@YiLuan9
@Hexiang_Hu

@kentonctlee
, Siyuan Qiao,
@WenhuChen

@ysu_nlp
and
@mchang21
, from
@GoogleDeepMind
and
@osunlp
.

2/6
We mine image pairs from the same web pages and express their diverse semantic relations, which may extend beyond visual similarities (e.g., a charger of a product), as open-ended instructions with #LMM and #LLM.

3/6
How to precisely and explicitly capture the implicit relations between the image pairs?

We build a systematic pipeline with heavy mining, cleaning, and pairing. Then, we annotate massive metadata via #LMMs and generate open-ended instructions via #LLMs.

4/6
After mining 36.7 million triplets (query image, instruction, target image), we train light-weight MagicLens models taking image and instruction as input.

With comparable sizes, a single model achieves best results on 10 benchmarks across three image retrieval task forms.

5/6
On several benchmarks, MagicLens outperforms 50X larger SOTA methods by a large margin.

6/6
To simulate a more realistic scenario, we hold out an index set with 1.4 million images, the largest retrieval pool to date.

Human evaluation finds that MagicLens excels at all kinds of instructions, especially those that are complex and go beyond visual similarity.
GJzaPt4X0AAF4oU.png

GJzcXxqWwAAWMY_.jpg

GJzbPofWcAAK9Wf.png

GJzbozAXEAAmOsc.jpg

GJzdwmiXcAA_oht.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,200
Reputation
8,249
Daps
157,880




1/4
After enjoying the large MoE model, why not have a look at a small one?

This is it, Qwen1.5-MoE-A2.7B, a 14B MoE model with only 2.7B activated parameters!

HF: Qwen (Qwen) , search repos with “Qwen1.5-MoE-A2.7B” in model names.

GitHub: GitHub - QwenLM/Qwen1.5: Qwen1.5 is the improved version of Qwen, the large language model series developed by Qwen team, Alibaba Cloud.

Blog: Qwen1.5-MoE: Matching 7B Model Performance with 1/3 Activated Parameters

Features include:

* Matches the 7B model quality
* 75% pretraining cost reduction
* 1.74x inference speed acceleration
* 64 experts per MoE layer, 8 activated for each token, where 4 for all tokens and 4 based on routing
* Upcycled (initialized) from Qwen-1.8B

Now it is only supported by HF transformers and vLLM. For both you need to install from source, as the latest versions with `qwen2_moe` are not released yet. This is also something new for us. Hope you enjoy and feel free shoot us feedback!

2/4
https://arxiv.org/pdf/2212.05055.pdf

3/4
A little bit different. We used upcycling first, and then split the FFNs into more experts, which is known as finegrained experts. We had some techniques in bringing in randomness, but we might share them later in a more formal tech report.

4/4
v0.2 achieves 7.6
GJxGxvNXsAAKk_d.jpg






1/4
Here is the demo of Qwen1.5-MoE-A2.7B:


It is fast. See if it matches your expectation of the model quality!

2/4
@_akhaliq give it a try

3/4
They are working on this

4/4
No we did not do the self intro stuff in it. It Just hallucinates. Bad
GJ5E8qzaAAAKHwV.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,200
Reputation
8,249
Daps
157,880





1/5
𝔸𝕘𝕖𝕟𝕥 𝕃𝕦𝕞𝕠𝕤 is one of the first unified and modular frameworks for training open-source LLM-based agents.

New features:
Multimodal Reasoning with 𝕃𝕦𝕞𝕠𝕤
13B-scale 𝕃𝕦𝕞𝕠𝕤 models
𝕃𝕦𝕞𝕠𝕤 data-explorer demo


@ai2_mosaic
@uclanlp


: [2311.05657] Agent Lumos: Unified and Modular Training for Open-Source Language Agents
: GitHub - allenai/lumos: Code and data for "Lumos: Learning Agents with Unified Data, Modular Design, and Open-Source LLMs"
: ai2lumos (Lumos Agents (AI2))
Demo: Agent Lumos - a Hugging Face Space by ai2lumos (1/N)

2/5
🤖Lumos now supports multimodal tasks!

🖼 It accepts image caption input and then solves visual reasoning tasks with planning. We adopt the same strategy for generating multimodal annotations as other complex interactive tasks.

🤗 Multimodal training annotations are here:

🪄 Lumos plan annotations: ai2lumos/lumos_multimodal_plan_iterative · Datasets at Hugging Face

🪄 Lumos ground annotations: ai2lumos/lumos_multimodal_ground_iterative · Datasets at Hugging Face

3/5
Lumos has strong multimodal perf.

Lumos outperforms larger VL models such as MiniGPT-4-13B on A-OKVQA and ScienceQA

7B-scale Lumos outperforms LLAVA1.5-7B on ScienceQA

7B-scale Lumos outperforms AutoAct, which is directly fine-tuned on ScienceQA (3/N)

4/5
Lumos-13B are released

Lumos-13B for each task type are released

They further lift the performance level, compared with 7B-scale models (4/N)

5/5
Lumos data demo is released

It is a conversational demo to show how planning and grounding modules work and interact with each other

Link: Agent Lumos - a Hugging Face Space by ai2lumos (5/N)
GJ3EHTcbQAAlmvY.jpg

GJ3EobwbgAAFJyG.png

GJ3FE2ZbUAEnexd.png

GJ3FbtgbsAAjHCv.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,200
Reputation
8,249
Daps
157,880



1/3
Multimodal DPO, large & tiny multimodal matryoshka embeddings, and 1st party ONNX support for 10x lighter deployments

In partnership with
@nebiusofficial , Unum is releasing a new set of pocket-sized multimodal models, already available on
@huggingface


2/3
Check out the new collections on
@huggingface and let us which languages we should prioritize for future multilingual releases

unum-cloud (Unum)

3/3
And the new inference code with binary and Matryoshka examples on our
@GitHub repo - don't forget to and spread the word
GJySbahbEAA4BNJ.jpg

GJyTuXUaoAAfdmc.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,200
Reputation
8,249
Daps
157,880

1/1
Google presents Long-form factuality in large language models

- Proposes that LLM agents can be used as automated evaluators for longform factuality
- Shows that LLM agents can achieve superhuman rating performance

repo: GitHub - google-deepmind/long-form-factuality: Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".
abs: [2403.18802] Long-form factuality in large language models
GJuND-3XgAAejZe.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,200
Reputation
8,249
Daps
157,880

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,200
Reputation
8,249
Daps
157,880


1/2
Human Image Personalization with High-fidelity Identity Preservation

Presents FlashFace, a practical tool with which users can easily personalize their own photos on the fly by providing one or a few reference face images and a text prompt

proj: FlashFace
abs: [2403.17008] FlashFace: Human Image Personalization with High-fidelity Identity Preservation

2/2
Google presents MagicLens: image retrieval models following open-ended instructions

Outperforms previous SotA but with a 50x smaller model size

proj: https://open-vision-language.github.io/MagicLens/ abs: [2403.19651] MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions
GJkLypwWgAA3xjr.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,200
Reputation
8,249
Daps
157,880






1/6
Are larger vision models always necessary?

We find scaling on **image scales** (e.g., 224->448->672) is usually better than scaling on model size (e.g., Base->Large->Giant).

With one line of code, improve any vision model for Multimodal LLMs or various vision and robotic tasks!

2/6
We enable any vision model to extract multi-scale features by splitting the large-scale image into regular-sized crops, processing them separately, merging the features together, and pooling it to the regular size.

3/6
This simple approach has the advantage of no additional parameters (so no need to re-train anything) and keeping the same number of output vision tokens (thus keeping the same input length for MLLMs).

4/6
We find keeping the same model size and using larger image scales is usually comparable or better than using larger models. Comparison on image classification, semantic segmentation, and depth estimation:

5/6
Similar trends for MLLMs:

6/6
And robotic tasks too!
GJKV8-la0AAQMQM.jpg

GJKXuL4asAATuCN.jpg

GJKarwCbYAArETR.jpg

GJKayKXaIAAEWMF.jpg

GJKa5XnaYAAkNO6.png
 
Top