bnew

Veteran
Joined
Nov 1, 2015
Messages
57,981
Reputation
8,582
Daps
161,582









GFRU1sqbgAAJQX3.jpg

GFRU2Q_asAA-HYJ.png

GFRU5PPasAAY2q0.jpg

GFRU5jEaQAAOD0T.jpg

GFRU6FSbMAAc2Ou.jpg

GFRU6ijacAAaevX.png

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,981
Reputation
8,582
Daps
161,582

Computer Science > Computer Vision and Pattern Recognition​

[Submitted on 11 Jan 2024]

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs​

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, Saining Xie
Is vision good enough for language? Recent advancements in multimodal models primarily stem from the powerful reasoning abilities of large language models (LLMs). However, the visual component typically depends only on the instance-level contrastive language-image pre-training (CLIP). Our research reveals that the visual capabilities in recent multimodal LLMs (MLLMs) still exhibit systematic shortcomings. To understand the roots of these errors, we explore the gap between the visual embedding space of CLIP and vision-only self-supervised learning. We identify ''CLIP-blind pairs'' - images that CLIP perceives as similar despite their clear visual differences. With these pairs, we construct the Multimodal Visual Patterns (MMVP) benchmark. MMVP exposes areas where state-of-the-art systems, including GPT-4V, struggle with straightforward questions across nine basic visual patterns, often providing incorrect answers and hallucinated explanations. We further evaluate various CLIP-based vision-and-language models and found a notable correlation between visual patterns that challenge CLIP models and those problematic for multimodal LLMs. As an initial effort to address these issues, we propose a Mixture of Features (MoF) approach, demonstrating that integrating vision self-supervised learning features with MLLMs can significantly enhance their visual grounding capabilities. Together, our research suggests visual representation learning remains an open challenge, and accurate visual grounding is crucial for future successful multimodal systems.
Comments:Project page: this https URL
Subjects:Computer Vision and Pattern Recognition (cs.CV)
Cite as:arXiv:2401.06209 [cs.CV]
(or arXiv:2401.06209v1 [cs.CV] for this version)
[2401.06209] Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
Focus to learn more

Submission history

From: Saining Xie [view email]
[v1] Thu, 11 Jan 2024 18:58:36 UTC (2,305 KB)

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,981
Reputation
8,582
Daps
161,582
BIZ / TECH

AI model boasting GPT-4-like capabilities makes debut​



Zhu Shenshen

22:05 UTC+8, 2024-01-30​

Chinese tech giant iFlytek unveiled an upgraded artificial intelligence model on Tuesday with abilities that are almost the same as OpenAI's top global AI model GPT-4.

The debut of the Xinghuo 3.5 AI model makes sense since it features Chinese AI computing capability suppliers and partners, which helps the domestic industry bust the technology sanctions imposed by the United States, according to Liu Qingfeng, chairman of the Shenzhen-listed iFlytek.

The new Xinghuo 3.5 AI model boasts language and math capabilities surpassing that of GPT-4, while its multi-mode and coding capabilities are over 90 percent of the GPT-4 level.


AI model boasting GPT-4-like capabilities makes debut

Ti Gong​

A screenshot shows that the new Xinghuo 3.5 AI model's capablities (blue) are simular to OpenAI's GPT-4 level (yellow line).

OpenAI's GPT-4, now available to paid ChatGPT subscribers, is regarded as a top AI model worldwide.

A batch of China-developed generative AI models, including iFlytek's Xinghuo, are open to the public and have received a warm market response.

Besides its capabilities, iFlytek's model is developed with Chinese technology suppliers, including Huawei. It supports sustainable development of AI in China without "external influences and challenges", Liu told a conference broadcast online on Tuesday afternoon.

According to US measures, Chinese firms are forbidden or strictly restricted from accessing NVIDIA's advanced GPUs, a key component to train and develop AI models.

Currently, iFlytek's AI model is used by 350,000 developers, 220,000 of them enterprises. It means that AI has boosted work efficiency and digital transformation, is not only "for fun". The upgraded Xinghuo 3.5 will continue serving developers with improved capabilities, Liu added.

As a tech giant on voice recognition, education and translation, iFlytek has integrated AI into smart devices, covering a smart blackboard for schools, intelligent translators capable of recognizing languages and translating them aromatically, and office tablet devices, giving direct access to AI for consumers covering workplace, marketing, traveling, life and customer service.

One spotlight is a 5G voice-to-text service released by iFlytek and China Mobile on Tuesday. This can automatically transform calls into text scripts and mark key information like appointment time and location.

Shares of iFlytek rallied on Tuesday, closing 1.81 percent higher at 41.14 yuan (US$5.79), compared with a 2.4 percent fall in the Shenzhen Component Index. Its market value reached 95.3 billion yuan.


Source: SHINE Editor: Wang Yanlin

https://www.shine.cn/tags/chinamobile/

https://www.shine.cn/tags/huawei/
 

jensyao

Collector
Joined
Apr 8, 2021
Messages
3,946
Reputation
3,636
Daps
7,897
Reppin
Schooling people takes forever; don't bother
chat GPT is trained on the internet and makes bold claims even if they are entirely wrong about it and can fabricate the details. there are bots that deliver and dump chat gpt answers as a resource on regular sites like youtube and reddit etc. soon enough, chat gpt will be fed ducktale answers from chat gpt and we're going to be in a state of information delirium if it wasn't for humans who actually remember instead of resorting to the internet for all of their answers and using the cloud as their memory banks instead of their brain, and we already saw previews of this with real and fake cases of the mandela effect where the internet reports wrong things from how people remembered it and vice versa. it's only going to get worse from here
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,981
Reputation
8,582
Daps
161,582

Hugging Face launches open source AI assistant maker to rival OpenAI’s custom GPTs​

Carl Franzen @carlfranzen

February 2, 2024 2:51 PM

Futuristic mechanics in yellow jumpsuits work on a smiley face giant mecha in a blue hangar in an anime style image.

Credit: VentureBeat made with Midjourney V6


Hugging Face, the New York City-based startup that offers a popular, developer-focused repository for open source AI code and frameworks (and hosted last year’s “Woodstock of AI”), today announced the launch of third-party, customizable Hugging Chat Assistants.

The new, free product offering allows users of Hugging Chat, the startup’s open source alternative to OpenAI’s ChatGPT, to easily create their own customized AI chatbots with specific capabilities, similar both in functionality and intention to OpenAI’s custom GPT Builder — though that requires a paid subscription to ChatGPT Plus ($20 per month), Team ($25 per user per month paid annually), and Enterprise (variable pricing depending on the needs).



Easy creation of custom AI chatbots​

Phillip Schmid, Hugging Face’s Technical Lead & LLMs Director, posted the news on the social network X (formerly known as Twitter), explaining that users could build a new personal Hugging Face Chat Assistant “in 2 clicks!” Schmid also openly compared the new capabilities to OpenAI’s custom GPTs.



However, in addition to being free, the other big difference between Hugging Chat Assistant and the GPT Builder and GPT Store is that the latter tools depend entirely on OpenAI’s proprietary large language models (LLM) GPT-4 and GPT-4 Vision/Turbo.


Users of Hugging Chat Assistant, by contrast, can choose which of several open source LLMs they wish to use to power the intelligence of their AI Assistant on the backend, including everything from Mistral’s Mixtral to Meta’s Llama 2.

That’s in keeping with Hugging Face’s overarching approach to AI — offering a broad swath of different models and frameworks for users to choose between — as well as the same approach it takes with Hugging Chat itself, where users can select between several different open source models to power it.



An open rival to the GPT Store?​

Like OpenAI with its GPT Store launched last month, Hugging Face has also created a central repository of third-party customized Hugging Chat Assistants which users can choose between and use on their own time here.

The Hugging Chat Assistants aggregator page bears a very close resemblance to the GPT Store page, even down to its visual style, with custom Assistants displayed like custom GPTs in their own rectangular, baseball card-style boxes with circular logos inside.

Screen-Shot-2024-02-02-at-5.43.42-PM.png

Screenshot of OpenAI’s GPT Store.

Screen-Shot-2024-02-02-at-5.43.47-PM.png

Screenshot of Hugging Face’s Hugging Chat Assistants page.



Better in some ways than GPTs, worse than others​

Already, some users in the open source AI community are hailing Hugging Chat Assistants as “better than GPTs,” including Mathieu Trachino, founder of enterprise AI software provider GenDojo.ai, who took to X to list off the merits, which mainly revolve around the user customizability of the underlying models and the fact that the whole situation is free, compared to OpenAI’s paid subscription tiers.



He also noted some areas where custom GPTs outperform Hugging Chat Assistants, including the fact that they don’t currently support web search, retrieval augmented generation (RAG), nor can they generate their own logos (which GPTs do thanks to the power of OpenAI’s image generation AI model DALL-E 3).

Still, the arrival of Hugging Chat Assistants shows how fast the open source community continues to catch up to closed rivals like the now-ironically named “Open” AI, especially coming just one day after the confirmed leak of a new open source model from Mistral, Miqu, that nearly matches the performance of the closed GPT-4, still the high watermark for LLMs. But…for how long?





jEasUME.png







GFeXRKsW0AAmERI.jpg

GFecHADXUAA00Vt.jpg

GFecHAAWgAAfmFW.jpg

GFecHADWcAAaVM_.jpg

GFedUHgWgAA6x7v.jpg

GFedUHhXAAA0TaO.jpg

GFedUHeXgAEbUmK.jpg

GFedUHgW4AAu0Ua.jpg


GFas2thbMAAEpmh.jpg

GFas2tfboAARoe0.jpg

GFas2tjasAAC7uV.jpg

GFas2tiaMAAhRUf.jpg



 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,981
Reputation
8,582
Daps
161,582

‘Everyone looked real’: multinational firm’s Hong Kong office loses HK$200 million after scammers stage deepfake video meeting​


  • Employee fooled after seeing digitally recreated versions of company’s chief financial officer and others in video call
  • Deepfake technology has been in the spotlight after fake explicit images of pop superstar Taylor Swift spread on social media sites

Harvey Kong

Published: 8:30am, 4 Feb, 2024

Why you can trust SCMP


89d00fa0-40bd-46c6-9915-6ef15db3f7c3_d6685547.jpg

Criminals used deepfake technology to scam a multinational company out of HK$200 million by digitally recreating its chief financial officer. Photo: Shutterstock


A multinational company lost HK$200 million (US$25.6 million) in a scam after employees at its Hong Kong branch were fooled by deepfake technology, with one incident involving a digitally recreated version of its chief financial officer ordering money transfers in a video conference call, police said.

Everyone present on the video calls except the victim was a fake representation of real people. The scammers applied deepfake technology to turn publicly available video and other footage into convincing versions of the meeting’s participants.

Police said they were highlighting the case as it was the first of its kind in Hong Kong and involved a large sum. They did not reveal details about the company or the employees involved.

516892f7-9807-4d15-86cc-ba70b15268dd_c7092b25.jpg

(From left) Associate vice-president at Hong Kong College of Technology Sam Lam; acting superintendent at the Cyber Security and Technology and Crime Bureau Baron Chan; and Senior Inspector Tyler Chan. Photo: Jonathan Wong

Acting senior superintendent Baron Chan Shun-ching said that in previous cases, scam victims were tricked in one-on-one video calls.

“This time, in a multi-person video conference, it turns out that everyone you see is fake,” he said, adding that the scammers were able to generate convincing representations of targeted individuals that looked and sounded like the actual people.

Deepfake technology was in the news last month, after fake sexually explicit images of pop superstar Taylor Swift were spread on social media sites.


The police report was made by an employee in the branch’s finance department, who received what appeared to be a phishing message in mid-January, apparently from the company’s UK-based chief financial officer saying a secret transaction had to be carried out.

Chan said despite having an early “moment of doubt”, the employee fell for the ruse after being invited to the group video conference and finding the company’s CFO present, along with other staff and some outsiders.

The company employees in the call looked and sounded like real people the targeted employee recognised.


6 in Hong Kong arrested over use of AI deepfake to apply for loans

Chan said the employee followed instructions given during the meeting and made 15 transfers totalling HK$200 million to five Hong Kong bank accounts.

The entire episode lasted about a week from the time the employee was contacted until the person realised it was a scam upon making an inquiry with the company’s headquarters.

Police carried out an investigation and found that the meeting participants had been digitally recreated by scammers who used publicly available video and audio footage of the individuals.

“They used deepfake technology to imitate the voice of their targets reading from a script,” Chan said, adding that this helped to deceive the employee.

Chan said that during the video conference, the scammers asked the victim to do a self-introduction but did not actually interact with the person. The fake images on screen mainly gave orders before the meeting ended abruptly.

a143596e-a2f8-4eed-9746-cf1f603c02c9_a57dc0a9.jpg

Deepfake technology allows for face swapping and matching of facial movements with a different person. Photo: Shutterstock

The scammers then stayed in touch with the victim through instant messaging platforms, emails and one-on-one video calls.

Chan said scammers approached another employee at the branch using the same multi-person video call tactic. The force said two to three employees in total had been approached by scammers, but did not provide full information on their encounters.

Police are still investigating and no arrests have been made.

The force said it hoped members of the public were aware that scammers were now capable of using deepfake technology in new ways.

Senior Inspector Tyler Chan Chi-wing said there were several ways to check whether a person who appeared on a screen was a fake, digital recreation.


Nvidia chief sees rise of ‘sovereign AI’ infrastructure across nations


He suggested asking the person to move their head, posing questions to determine their authenticity and become immediately suspicious the moment money is requested.

Separately, police said they would expand their alert system covering the Faster Payment System (FPS) to warn users they were transferring money to accounts linked to scams.

Covering FPS transfers at 35 banks and nine stored-value services, it will be extended to local instant money transfers online and offline by the second half of the year, including through mobile applications, automatic teller machines and bank counters.

Anyone who enters details of an account linked to scams in the database of the police force’s Scameter search engine will get an alert.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,981
Reputation
8,582
Daps
161,582

Google Bard to become ‘Gemini’ very soon with ‘Advanced’ tier and Android app​


Ben Schoon | Feb 3 2024 - 7:02 am PT
19 Comments

android ai

Google Bard is in for a big shakeup in the next few days, as an early changelog reveals that the “Gemini” rebrand is coming next week with a new Android app and more.

Over the past few months, Google has rapidly been building out Bard, its generative AI chat experience, with new features and capabilities. Last year, Bard was upgraded with “Gemini” as the behind-the-scenes model and, more recently, added an image generator. But, all the while, Google has been working on a big change to Bard.

As we reported earlier this week, evidence spotted through Android and Bard’s web experience showed that Google is looking to rebrand Bard as “Gemini,” matching the name of the foundational model powering it. Now, an early changelog for Bard spotted by Dylan Roussel shows that the change is coming this week.

The changelog, currently with the date February 7 attached to it, directly says that “Bard is now Gemini,” and also offers some insight into Google’s reasoning. As was announced this week, “Gemini Pro” now powers Bard in all countries and languages where Bard is available. The rebranding is to better fit this, Google says.


We’re committed to giving everyone direct access to Google AI and, as of this week, every Gemini user across our supported countries and languages has access to Google’s best family of AI models. To better reflect this commitment, we’ve renamed Bard to Gemini.

Beyond the change in name, Google also notes in the changelog that access to “Gemini Advanced” will also be available starting on February 7. The “Advanced” tier was announced in December 2023 and is built on “Gemini Ultra,” the most capable of Google’s models. We’ve also reported that it will be paid while also adding some additional functionality. Google’s changelog directly mentions that this is a paid product, and that it will expand with more features including “expanded multi-modal capabilities,” better coding support, and “the ability to upload and more deeply analyze files, documents, data, and more.”

After being rebranded from Bard, Google Gemini will also be getting an Android app.

Google explains:


Get help learning in new ways, writing thank you notes, planning events, and more with Google AI on your phone. Gemini is integrated with Google apps like Gmail, Maps, and YouTube, making it easy to get things done on your phone. You can interact with it through text, voice or images.

To chat with Gemini on Android, download the Gemini app in the Google Play Store. On iOS, try Gemini in the Google app.

While there’s no preview of the app currently, previous evidence we’ve reported on suggests that this “app” will act a lot like the current experience of using Google Assistant on Android. We specifically found that the current Google Assistant “app” available through the Play Store, which is effectively just a homescreen shortuct, is being updated to be called “Gemini.” Last month, we also shared screenshots from an early version of the experience, including a settings menu.

Assistant Bard Gemini

Assistant Bard Gemini

Assistant Bard Gemini

Google’s changelog notes that the app will only be available on “select devices” – we reported months ago that this might be only Tensor-powered Pixels and the Galaxy S24 – in English in the US. It would also be expanding to “Japanese, Korean, and English globally except for the UK, Switzerland, European Economic Area countries, and associated territories,” with more countries and languages to follow.

And, finally, Google Gemini would also expand to Canada on February 7. Bard never launched in the region, with Canada being one of the only major regions without support.

As always with these early changelogs, there’s a chance things could change slightly between now and the formal announcement, including the actual date. That said, things have clearly been building up quickly towards this launch.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,981
Reputation
8,582
Daps
161,582

Google releases GenAI tools for music creation​

Kyle Wiggers @kyle_l_wiggers / 10:00 AM EST•February 1, 2024

Google logo sign with white backlighting on dark background

Image Credits: Artur Widak/NurPhoto / Getty Images

As GenAI tools begin to transform the music industry in incredible — and in some cases ethically problematic — ways, Google is ramping up its investments in AI tech to create new songs and lyrics.

The search giant today unveiled MusicFX, an upgrade to MusicLM, the music-generating tool Google released last year. MusicFX can create ditties up to 70 seconds in length and music loops, delivering what Google claims is “higher-quality” and “faster” music generation.

MusicFX is available in Google’s AI Test Kitchen, an app that lets users try out experimental AI-powered systems from the company’s labs. Technically, MusicFX launched for select users in December — but now it’s generally available.

Google MusicFX

Image Credits: Google

And it’s not terrible, I must say.

Like its predecessor, MusicFX lets users enter a text prompt (“two nylon string guitars playing in flamenco style”) to describe the song they wish to create. The tool generates two 30-second versions by default, with options to lengthen the tracks (to 50 or 70 seconds) or automatically stitch the beginning and end to loop them.

A new addition is suggestions for alternative descriptor words in prompts. For example, if you type “country style,” you might see a drop-down with genres like “rockabilly style” and “bluegrass style.” For the word “catchy,” the drop-down might contain “chill” and “melodic.”

Google MusicFX

Image Credits: Google

Below the field for the prompt, MusicFX provides a word cloud of additional recommendations for relevant descriptions, instruments and tempos to append (e.g. “avant garde,” “fast,” “exciting,” “808 drums”).

So how’s it sound? Well, in my brief testing, MusicFX’s samples were… fine? Truth be told, music generation tools are getting to the point where it’s tough for this writer to distinguish between the outputs. The current state-of-the-art produces impressively clean, crisp-sounding tracks — but tracks tending toward the boring, uninspired and melodically unfocused.

Maybe it’s the SAD getting to me, but one of the prompts I went with was “a house music song with funky beats that’s danceable and uplifting, with summer rooftop vibes.” MusicFX delivered, and the tracks weren’t bad — but I can’t say that they come close to any of the better DJ sets I’ve heard recently.

Listen for yourself:
https://techcrunch.com/wp-content/u..._music_song_with_funky_beats_thats_da.mp3?_=1
https://techcrunch.com/wp-content/u...usic_song_with_funky_beats_thats_da-5.mp3?_=2

Anything with stringed instruments sounds worse, like a cheap MIDI sample — which is perhaps a reflection of MusicFX’s limited training set. Here are two tracks generated with the prompt “a soulful melody played on string instruments, orchestral, with a strong melodic core”:
https://techcrunch.com/wp-content/u...melody_played_on_string_instruments-1.mp3?_=3
https://techcrunch.com/wp-content/u...l_melody_played_on_string_instruments.mp3?_=4

And for a change of pace, here’s MusicFX’s interpretation of “a weepy song on guitar, melancholic, slow tempo, on a moonlight [sic] night.” (Forgive the spelling mistake.)
https://techcrunch.com/wp-content/u...y_song_on_guitar_melancholic_slow_tem.mp3?_=5

There are certain things MusicFX won’t generate — and that can’t be removed from generated tracks. To avoid running afoul of copyrights, Google filters prompts that mention specific artists or include vocals. And it’s using SynthID, an inaudible watermarking technology its DeepMind division developed, to make it clear which tracks came from MusicFX.

I’m not sure what sort of master list Google’s using to filter out artists and song names, but I didn’t find it all that hard to defeat. While MusicFX declined to generate songs in the style of SZA and The Beatles, it happily took a prompt referencing Lake Street Dive — although the tracks weren’t writing home about, I will say.

Lyric generation​

Google released a new lyrics generation tool, TextFX, in AI Test Kitchen that’s intended as a sort of companion to MusicFX. Like MusicFX, TextFX has been available to a small user cohort for some time — but it’s now more widely available, and upgraded in terms of “user experience and navigation,” Google says.

As Google explains in the AI Test Kitchen app, TextFX was created in collaboration with Lupe Fiasco, the rap artist and record producer. It’s powered by PaLM 2, one of Googles’ text-generating AI models, and “[draws] inspiration from the lyrical and linguistic techniques [Fiasco] has developed throughout his career.”

Google TextFX

Image Credits: Google

This reporter expected TextFX to be a more or less automated lyrics generator. But it’s certainly not that. Instead, TextFX is a suite of modules designed to aid in the lyrics-writing process, including a module that finds words in a category starting with a chosen letter and a module that finds similarities between two unrelated things.

Google TextFX

Image Credits: Google

TextFX takes a while to get the hang of. But I can see it becoming a useful resource for lyricists — and writers in general, frankly.

You’ll want to closely review its outputs, though. Google warns that TextFX “may display inaccurate info, including about people,” and I indeed managed to prompt it to suggest that climate change “is a hoax perpetrated by the Chinese government to hurt American businesses.” Yikes.

Google TextFX

Image Credits: Google

Questions remain​

With MusicFX and TextFX, Google’s signaling that it’s heavily invested in GenAI music tech. But I wonder whether its preoccupation with keeping up with the Joneses rather than addressing the tough questions surrounding GenAI music will serve it well in the end.

Increasingly, homemade tracks that use GenAI to conjure familiar sounds and vocals that can be passed off as authentic, or at least close enough, have been going viral. Music labels have been quick to flag AI-generated tracks to streaming partners like Spotify and SoundCloud, citing intellectual property concerns. They’ve generally been victorious. But there’s still a lack of clarity on whether “deepfake” music violates the copyright of artists, labels and other rights holders.

A federal judge ruled in August that AI-generated art can’t be copyrighted. However, the U.S. Copyright Office hasn’t taken a stance yet, only recently beginning to seek public input on copyright issues as they relate to AI. Also unclear is whether users could find themselves on the hook for violating copyright law if they try to commercialize music generated in the style of another artist.

Google’s attempting to forge a careful path toward deploying GenAI music tools on the YouTube side of its business, which is testing AI models created by DeepMind in partnership with artists like Alec Benjamin, Charlie Puth, Charli XCX, Demi Lovato, John Legend, Sia and T-Pain. That’s more than can be said of some of the tech giant’s GenAI competitors, like Stability AI, which takes the position that “fair use” justifies training on content without the creator’s permission.

But with labels suing GenAI vendors over copyrighted lyrics in training data and artists registering their discontent, Google has its work cut out for it — and it’s not letting that inconvenient fact slow it down.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,981
Reputation
8,582
Daps
161,582





AI research

Feb 3, 2024

Google's MobileDiffusion generates AI images on mobile devices in less than a second​

Google

Google's MobileDiffusion generates AI images on mobile devices in less than a second


Jonathan Kemper
Jonathan works as a technology journalist who focuses primarily on how easily AI can already be used today and how it can support daily life.
Profile


Summary

Google's MobileDiffusion is a fast and efficient way to create images from text on smartphones.

MobileDiffusion is Google's latest development in text-to-image generation. Designed specifically for smartphones, the diffusion model generates high-quality images from text input in less than a second.

With a model size of only 520 million parameters, it is significantly smaller than models with billions of parameters such as Stable Diffusion and SDXL, making it more suitable for use on mobile devices.

The researchers' tests show that MobileDiffusion can generate images with a resolution of 512 x 512 pixels in about half a second on both Android smartphones and iPhones. The output is continuously updated as you type, as Google's demo video shows.

Video Player

MobileDiffusion consists of three main components: a text encoder, a diffusion network, and an image decoder.

The UNet contains a self-attention layer, a cross-attention layer, and a feed-forward layer, which are crucial for text comprehension in diffusion models.

However, this layered architecture is computationally complex and resource intensive. Google uses a so-called UViT architecture, in which more transformer blocks are placed in a low-dimensional region of the UNet to reduce resource requirements.

In addition, distillation and a Generative Adversarial Network (GAN) hybrid are used for one- to eight-level sampling.
 
Top