The A.I Megathread (LLM , GPT , Development)

bnew · Feb 29, 2024

Support BitNet b1.58 ternary models · Issue #5761 · ggerganov/llama.cpp

New paper just dropped on Arxiv describing a way to train models in 1.58 bits (with ternary values: 1,0,-1). Paper shows performance increases from equivalently-sized fp16 models, and perplexity ne...

github.com

bnew · Feb 29, 2024

bnew · Feb 29, 2024

DiffuseKronA

A Parameter Efficient Fine-tuning Method for Personalized Diffusion Model

In the realm of subject-driven text-to-image (T2I) generative models, recent developments like DreamBooth and BLIP-Diffusion have led to impressive results yet encounter limitations due to their intensive fine-tuning demands and substantial parameter requirements. While the low-rank adaptation (LoRA) module within DreamBooth offers a reduction in trainable parameters, it introduces a pronounced sensitivity to hyperparameters, leading to a compromise between parameter efficiency and the quality of T2I personalized image synthesis. Addressing these constraints, we introduce \textit{DiffuseKronA}, a novel Kronecker product-based adaptation module that not only significantly reduces the parameter count by 35\% and 99.947\% compared to LoRA-DreamBooth and the original DreamBooth, respectively, but also enhances the quality of image synthesis. Crucially, DiffuseKronA mitigates the issue of hyperparameter sensitivity, delivering consistent high-quality generations across a wide range of hyperparameters, thereby diminishing the necessity for extensive fine-tuning. Furthermore, a more controllable decomposition makes DiffuseKronA more interpretable and even can achieve up to a 50\% reduction with results comparable to LoRA-Dreambooth. Evaluated against diverse and complex input images and text prompts, DiffuseKronA consistently outperforms existing models, producing diverse images of higher quality with improved fidelity and a more accurate color distribution of objects, all the while upholding exceptional parameter efficiency, thus presenting a substantial advancement in the field of T2I generative modeling.

bnew · Feb 29, 2024

DiffuseKronA

A Parameter Efficient Fine-tuning Method for Personalized Diffusion Model

In the realm of subject-driven text-to-image (T2I) generative models, recent developments like DreamBooth and BLIP-Diffusion have led to impressive results yet encounter limitations due to their intensive fine-tuning demands and substantial parameter requirements. While the low-rank adaptation (LoRA) module within DreamBooth offers a reduction in trainable parameters, it introduces a pronounced sensitivity to hyperparameters, leading to a compromise between parameter efficiency and the quality of T2I personalized image synthesis. Addressing these constraints, we introduce \textit{DiffuseKronA}, a novel Kronecker product-based adaptation module that not only significantly reduces the parameter count by 35\% and 99.947\% compared to LoRA-DreamBooth and the original DreamBooth, respectively, but also enhances the quality of image synthesis. Crucially, DiffuseKronA mitigates the issue of hyperparameter sensitivity, delivering consistent high-quality generations across a wide range of hyperparameters, thereby diminishing the necessity for extensive fine-tuning. Furthermore, a more controllable decomposition makes DiffuseKronA more interpretable and even can achieve up to a 50\% reduction with results comparable to LoRA-Dreambooth. Evaluated against diverse and complex input images and text prompts, DiffuseKronA consistently outperforms existing models, producing diverse images of higher quality with improved fidelity and a more accurate color distribution of objects, all the while upholding exceptional parameter efficiency, thus presenting a substantial advancement in the field of T2I generative modeling.

https://arxiv.org/pdf/2402.17412.pdf

DiffuseKronA

A Parameter Efficient Fine-tuning Method for Personalized Diffusion Models

diffusekrona.github.io

aveling Textual Descriptions into Artistic Creations

Our method, DiffuseKronA, achieves superior image quality and accurate text-image correspondence across diverse input images and prompts, all the while upholding exceptional parameter efficiency. In this context, [V] denotes a unique token used for fine-tuning a specific subject in the text-to-image diffusion model.
For more results, please visit gallery!

Superior Fidelity and Colour Distribution

Our approach consistently produces images of superior fidelity compared to LoRA-DreamBooth, as illustrated below. Notably, the clock generated by our method faithfully reproduces the intricate details, such as the exact depiction of the numeral 3, mirroring the original image. In contrast, the output from LoRA-DreamBooth exhibits difficulties in achieving such high fidelity. Additionally, our method demonstrates improved color distribution in the generated images, a feature clearly evident in the RC Car images in below. Moreover, it struggles to maintain fidelity to the numeral 1 on the chest of the sitting toy.

Text Alignment

Our method comprehends the intricacies and complexities of text prompts provided as input, producing images that align with the given text prompts, as depicted below. The generated image of the character in response to the prompt exemplifies the meticulous attention our method pays to detail. It elegantly captures the presence of a shop in the background, a bowl with noodles in front of the character, and accompanying soup bowls. In contrast, LoRA-DreamBooth struggles to generate an image that aligns seamlessly with the complex input prompt. Our method not only generates images that align with text but is also proficient in producing a diverse range of images for a given input.

bnew · Feb 29, 2024

1/8
SUPIR released their weights today

Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild

SUPIR released their weights today

Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild

@chenxi_jw@chenxi_jw got this cool upscaler up on got this cool upscaler up on @replicate@replicate

cjwbw/supir | Run with an API on Replicate

Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild. This version uses LLaVA-13b for captioning.

replicate.com

GitHub - Fanghua-Yu/SUPIR: SUPIR aims at developing Practical Algorithms for Photo-Realistic Image Restoration In the Wild. Our new online demo is also released at suppixel.ai.

SUPIR aims at developing Practical Algorithms for Photo-Realistic Image Restoration In the Wild. Our new online demo is also released at suppixel.ai. - Fanghua-Yu/SUPIR

github.com

(CVPR2024) Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild

[Paper] [Project Page] [Online Demo (Coming soon)]
Fanghua, Yu, Jinjin Gu, Zheyuan Li, Jinfan Hu, Xiangtao Kong, Xintao Wang, Jingwen He, Yu Qiao, Chao Dong
Shenzhen Institute of Advanced Technology; Shanghai AI Laboratory; University of Sydney; The Hong Kong Polytechnic University; ARC Lab, Tencent PCG; The Chinese University of Hong Kong

⚠ Due to the large RAM (60G) and VRAM (30G x2) costs of SUPIR, we are working on the online demo releasing.

bnew · Feb 29, 2024

1/12
1/ We are releasing Playground v2.5, our latest foundation model to create images.

We tested our model across 20K+ users in a rigorous benchmark that went beyond anything we've seen to date.

This model is open weights. More information in the tweets below.

2/12
2/ You can use it on http:// right now, world-wide.

Playground v2.5 improves dramatically on color & contrast, multi aspect ratio, and aesthetics to push image quality as high possible while not changing the model arch that the community has built tools on.

3/12
3/ We did rigorous benchmarking with real users of these models beating several state-of-the-art models.

1/9
Playground AI releases Playground v2.5

latest foundation model to create images.

tested model across 20K+ users in a rigorous benchmark

This model is open weights

awful at text

https://arxiv.org/abs/2402.17245

1/3
playground-v2.5-1024px-aesthetic is very cool

Check it out on @replicate

Link below

2/3
Try it out here:

3/3
Or try out one of the prompts from here:

playgroundai/playground-v2.5-1024px-aesthetic · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Playground v2.5 is a diffusion-based text-to-image generative model, and a successor to Playground v2.

Playground v2.5 is the state-of-the-art open-source model in aesthetic quality. Our user studies demonstrate that our model outperforms SDXL, Playground v2, PixArt-α, DALL-E 3, and Midjourney 5.2.

For details on the development and training of our model, please refer to our blog post and technical report.

Model Description

Developed by: Playground
Model type: Diffusion-based text-to-image generative model
License: Playground v2.5 Community License
Summary: This model generates images based on text prompts. It is a Latent Diffusion Model that uses two fixed, pre-trained text encoders (OpenCLIP-ViT/G and CLIP-ViT/L). It follows the same architecture as Stable Diffusion XL.

bnew · Feb 29, 2024

1/11
GPT-4 with simple engineering can predict the future around as well as crowds:
https://https://arxiv.org/abs/2402.18563
On hard questions, it can do better than crowds.

If these systems become extremely good at seeing the future, they could serve as an objective, accurate third-party. This would help us better anticipate the longterm consequences of our actions and make more prudent decisions.

"The saddest aspect of life right now is that science gathers knowledge faster than society gathers wisdom." - Asimov

I didn't write this paper, but we called for AI forecasting research in Unsolved Problems in ML Safety some years back (

http://http://arxiv.org/abs/2109.13916), and concretized as a research avenue a year later in Forecasting Future World Events with Neural Networks (), and concretized as a research avenue a year later in Forecasting Future World Events with Neural Networks (https://https://arxiv.org/abs/2206.15474). Hopefully AI companies will add this feature as the election season begins.). Hopefully AI companies will add this feature as the election season begins.

2/11
paper written by @dannyhalawi15 @FredZhang0 @jcyhc_ai @JacobSteinhardt

3/11
For anyone wanting to reuse the prompt from the image:

Question: {question}

Question Background: {background}

Resolution Criteria: {resolution_criteria}

Today's date: {date_begin}
Question close date: {date_end}

We have retrieved the following information for this question:…

4/11
Reclaim up to 15 hours a week with BELAY! Delegate tasks to our U.S.-based Virtual Assistants & Accounting Services to focus on your goals. Discover flexible staffing solutions and make more time for what matters.

5/11
Rip Prediction markets

6/11
Anyone thought about making a gPT agent with the tetlock principles from super for exacting built into the model

Could also combine with upstream thinking principles

7/11
: language needs revision with humility :: eg seeing the future…

8/11

9/11
Just wait until people start asking controversial macrosocial questions and you'll find out the limits of these "foundation models" are not baked into the foundation models so much as they are baked into the organizations that create the foundation models.

In AGI formal theory,…

10/11
Travelers tv series got it right !

11/11
Predicting the future is cool, but having robots as third-party sounds sketchy.

1/12
Can we build an LLM system to forecast geo-political events at the level of human forecasters?

Introducing our work Approaching Human-Level Forecasting with Language Models!

Arxiv: https:// Joint work with @dannyhalawi15, @FredZhang0, and @jcyhc_ai

2/12
In this work, we build a LM pipeline for automated forecasting. Given any question about a future event, it retrieves and summarizes relevant articles, reasons about them, and predicts the probability that the event occurs.

3/12
We compare our system to ensembles of competitive human forecasters ("the crowd"). We approach the performance of the crowd across all questions, and beat the crowd on questions where they are less confident (probabilities between 0.3 and 0.7).

4/12
Moreover, averaging our prediction with the crowd consistently outperforms the crowd itself (as measured by Brier score, the most commonly-used metric of forecasting performance).

5/12
Our system has a number of interesting properties. For instance, our forecasted probabilities are well-calibrated, even though we perform no explicit calibration and even though the base models themselves are not (!).

6/12
Second, our model underperforms on "easy" questions (where the answer is nearly certain), because it is unwilling to give probabilities very close to 0 or 1. This is possibly an artifact of its safety training.

7/12
Finally, we provide a self-supervised method that fine-tunes models to forecast better, based on having them mimic rationales and forecasts that outperform the crowd. This is effective enough that fine-tuned GPT-3.5 can beat a carefully prompted GPT-4.

8/12
For some cool related work, see https://, which examines human-LLM forecasting teams, and https:// and https://-forecasting-tournament…, which introduce AI forecasting competitions.

9/12
We are excited to continue this work! Please email @dannyhalawi15 at dannyhalawi15@gmail.com to get in touch.

10/12
Cool! Weird question, but: reading the paper, you are finetuning a model and then testing it at again. But are you testing it on the same questions you finetuned it on? Couldn't find a discussion of this on a quick read of the first 15 pages.

11/12
Are you responsible for optimizing CI/Build/Test performance? Watch 17 presentations from the 2023 DPE Summit to learn how top engineering teams are using Developer Productivity Engineering practices to improve performance.

12/12
Dark mode for this paper for those who read at night

1/1
Large Language Models are getting better at forecasting - indeed many are capable of becoming superforecasters

Approaching Human-Level Forecasting with Language Models Large Language Models are getting better at forecasting - indeed many are capable of becoming superforecasters

Approaching Human-Level Forecasting with Language Models #LLM#LLM #AI#AI #GenAI#GenAI

In this work, during the fine-tuning phase, one of the prompts the researchers use to generate strong reasonings asks the model to build a decision tree and assign probabilities. The fine-tuned model learns the reasoning path (without being explicitly prompted to do so).

In this work, during the fine-tuning phase, one of the prompts the researchers use to generate strong reasonings asks the model to build a decision tree and assign probabilities. The fine-tuned model learns the reasoning path (without being explicitly prompted to do so).

https://https://lnkd.in/dBReygRD

[2402.18563] Approaching Human-Level Forecasting with Language Models

Computer Science > Machine Learning

[Submitted on 28 Feb 2024]

Approaching Human-Level Forecasting with Language Models

Danny Halawi, Fred Zhang, Chen Yueh-Han, Jacob Steinhardt

Forecasting future events is important for policy and decision making. In this work, we study whether language models (LMs) can forecast at the level of competitive human forecasters. Towards this goal, we develop a retrieval-augmented LM system designed to automatically search for relevant information, generate forecasts, and aggregate predictions. To facilitate our study, we collect a large dataset of questions from competitive forecasting platforms. Under a test set published after the knowledge cut-offs of our LMs, we evaluate the end-to-end performance of our system against the aggregates of human forecasts. On average, the system nears the crowd aggregate of competitive forecasters, and in some settings surpasses it. Our work suggests that using LMs to forecast the future could provide accurate predictions at scale and help to inform institutional decision making.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
Cite as:	arXiv:2402.18563 [cs.LG]
	(or arXiv:2402.18563v1 [cs.LG] for this version)
	[2402.18563] Approaching Human-Level Forecasting with Language Models Focus to learn more

Submission history

From: Danny Halawi [view email]
[v1] Wed, 28 Feb 2024 18:54:18 UTC (3,212 KB)

https://arxiv.org/pdf/2402.18563.pdf

bnew · Feb 29, 2024

1/7
Microsoft presents ResLoRA

Identity Residual Mapping in Low-Rank Adaption

As one of the most popular parameter-efficient fine-tuning (PEFT) methods, low-rank adaptation (LoRA) is commonly applied to fine-tune large language models (LLMs). However, updating the weights of LoRA blocks effectively and expeditiously is challenging due to the long calculation path in the original model. To address this, we propose ResLoRA, an improved framework of LoRA. By adding residual paths during training and using merging approaches to eliminate these extra paths during inference, our method can achieve better results in fewer training steps without any extra trainable parameters or inference cost compared to LoRA. The experiments on NLG, NLU, and text-to-image tasks demonstrate the effectiveness of our method. To the best of our knowledge, ResLoRA is the first work that combines the residual path with LoRA.

bnew · Feb 29, 2024

StarCoder 2 is a code-generating AI that runs on most GPUs | TechCrunch

StarCoder 2, the follow-up to StarCoder, is a new family of code-generating models that runs highly efficiently.

techcrunch.com

StarCoder 2 is a code-generating AI that runs on most GPUs

Kyle Wiggers @kyle_l_wiggers / 9:00 AM EST•February 28, 2024

software engineer working on laptop with circuit board

Image Credits: Tippapatt / Getty Images

Developers are adopting AI-powered code generators — services like GitHub Copilot and Amazon CodeWhisperer, along with open access models such as Meta’s CodeLlama — at an astonishingrate. But the tools are far from ideal. Many aren’t free. Others are, but only under licenses that preclude them from being used in common commercial contexts.

Perceiving the demand for alternatives, AI startup Hugging Face several years ago teamed up with ServiceNow, the workflow automation platform, to create StarCoder, an open source code generator with a less restrictive license than some of the others out there. The original came online early last year, and work has been underway on a follow-up, StarCoder 2, ever since.

StarCoder 2 isn’t a single code-generating model, but rather a family. Released today, it comes in three variants, the first two of which can run on most modern consumer GPUs:

A 3-billion-parameter (3B) model trained by ServiceNow
A 7-billion-parameter (7B) model trained by Hugging Face
A 15-billion-parameter (15B) model trained by Nvidia, the newest supporter of the StarCoder project.

(Note that “parameters” are the parts of a model learned from training data and essentially define the skill of the model on a problem, in this case generating code.)

Like most other code generators, StarCoder 2 can suggest ways to complete unfinished lines of code as well as summarize and retrieve snippets of code when asked in natural language. Trained with 4x more data than the original StarCoder (67.5 terabytes versus 6.4 terabytes), StarCoder 2 delivers what Hugging Face, ServiceNow and Nvidia characterize as “significantly” improved performance at lower costs to operate.

StarCoder 2 can be fine-tuned “in a few hours” using a GPU like the Nvidia A100 on first- or third-party data to create apps such as chatbots and personal coding assistants. And, because it was trained on a larger and more diverse data set than the original StarCoder (~619 programming languages), StarCoder 2 can make more accurate, context-aware predictions — at least hypothetically.

“StarCoder 2 was created especially for developers who need to build applications quickly,” Harm de Vries, head of ServiceNow’s StarCoder 2 development team, told TechCrunch in an interview. “With StarCoder2, developers can use its capabilities to make coding more efficient without sacrificing speed or quality.”

Now, I’d venture to say that not every developer would agree with De Vries on the speed and quality points. Code generators promise to streamline certain coding tasks — but at a cost.

A recent Stanford study found that engineers who use code-generating systems are more likely to introduce security vulnerabilities in the apps they develop. Elsewhere, a poll from Sonatype, the cybersecurity firm, shows that the majority of developers are concerned about the lack of insight into how code from code generators is produced and “code sprawl” from generators producing too much code to manage.

StarCoder 2’s license might also prove to be a roadblock for some.

StarCoder 2 is licensed under Hugging Face’s RAIL-M, which aims to promote responsible use by imposing “light touch” restrictions on both model licensees and downstream users. While less constraining than many other licenses, RAIL-M isn’t truly “open” in the sense that it doesn’t permit developers to use StarCoder 2 for every conceivable application (medical advice-giving apps are strictly off limits, for example). Some commentators say RAIL-M’s requirements may be too vague to comply with in any case — and that RAIL-M could conflict with AI-related regulations like the EU AI Act.

Setting all this aside for a moment, is StarCoder 2 really superior to the other code generators out there — free or paid?

Depending on the benchmark, it appears to be more efficient than one of the versions of CodeLlama, CodeLlama 33B. Hugging Face says that StarCoder 2 15B matches CodeLlama 33B on a subset of code completion tasks at twice the speed. It’s not clear which tasks; Hugging Face didn’t specify.

StarCoder 2, as an open source collection of models, also has the advantage of being able to deploy locally and “learn” a developer’s source code or codebase — an attractive prospect to devs and companies wary of exposing code to a cloud-hosted AI. In a 2023 survey from Portal26 and CensusWide, 85% of businesses said that they were wary of adopting GenAI like code generators due to the privacy and security risks — like employees sharing sensitive information or vendors training on proprietary data.

Hugging Face, ServiceNow and Nvidia also make the case that StarCoder 2 is more ethical — and less legally fraught — than its rivals.

All GenAI models regurgitate — in other words, spit out a mirror copy of data they were trained on. It doesn’t take an active imagination to see why this might land a developer in trouble. With code generators trained on copyrighted code, it’s entirely possible that, even with filters and additional safeguards in place, the generators could unwittingly recommend copyrighted code and fail to label it as such.

A few vendors, including GitHub, Microsoft (GitHub’s parent company) and Amazon, have pledged to provide legal coverage in situations where a code generator customer is accused of violating copyright. But coverage varies vendor-to-vendor and is generally limited to corporate clientele.

As opposed to code generators trained using copyrighted code (GitHub Copilot, among others), StarCoder 2 was trained only on data under license from the Software Heritage, the nonprofit organization providing archival services for code. Ahead of StarCoder 2’s training, BigCode, the cross-organizational team behind much of StarCoder 2’s roadmap, gave code owners a chance to opt out of the training set if they wanted.

As with the original StarCoder, StarCoder 2’s training data is available for developers to fork, reproduce or audit as they please.

Leandro von Werra, a Hugging Face machine learning engineer and co-lead of BigCode, pointed out that while there’s been a proliferation of open code generators recently, few have been accompanied by information about the data that went into training them and, indeed, how they were trained.

“From a scientific standpoint, an issue is that training is not reproducible, but also as a data producer (i.e. someone uploading their code to GitHub), you don’t know if and how your data was used,” Von Werra said in an interview. “StarCoder 2 addresses this issue by being fully transparent across the whole training pipeline from scraping pretraining data to the training itself.”

StarCoder 2 isn’t perfect, that said. Like other code generators, it’s susceptible to bias. De Vries notes that it can generate code with elements that reflect stereotypes about gender and race. And because StarCoder 2 was trained on predominantly English-language comments, Python and Java code, it performs weaker on languages other than English and “lower-resource” code like Fortran and Haksell.

Still, Von Werra asserts it’s a step in the right direction.

“We strongly believe that building trust and accountability with AI models requires transparency and auditability of the full model pipeline including training data and training recipe,” he said. “StarCoder 2 [showcases] how fully open models can deliver competitive performance.”

You might be wondering — as was this writer — what incentive Hugging Face, ServiceNow and Nvidia have to invest in a project like StarCoder 2. They’re businesses, after all — and training models isn’t cheap.

So far as I can tell, it’s a tried-and-true strategy: foster goodwill and build paid services on top of the open source releases.

ServiceNow has already used StarCoder to create Now LLM, a product for code generation fine-tuned for ServiceNow workflow patterns, use cases and processes. Hugging Face, which offers model implementation consulting plans, is providing hosted versions of the StarCoder 2 models on its platform. So is Nvidia, which is making StarCoder 2 available through an API and web front-end.

For devs expressly interested in the no-cost offline experience, StarCoder 2 — the models, source code and more — can be downloaded from the project’s GitHub page.

bnew · Mar 1, 2024

1/5
Upload's done, here she comes: 𝐦𝐢𝐪𝐮-𝟏-𝟏𝟎𝟑𝐛! This 103B LLM is based on Miqu, the leaked Mistral 70B model. This version is smaller than the 120B version so it fits on more GPUs and requires less memory usage. Have fun! https://huggingface.co/wolfram/miqu-1-103b… #AI #LLM #AmyLovesHashTags

2/5
Upload's done, here she comes: 𝐦𝐢𝐪𝐮-𝟏-𝟏𝟎𝟑𝐛! This 103B LLM is based on Miqu, the leaked Mistral 70B model. This version is smaller than the 120B version so it fits on more GPUs and requires less memory usage. Have fun! https://-103b… #AI #LLM #AmyLovesHashTags

3/5
GGUF and EXL2 quants are still in the making/uploading, so please be patient – or kindly quant them yourself, and if you can, share them as well (will gladly link your HF page). Also welcome benchmarks, or comparisons with miqu-1-120b or miquliz-120b-v2.0 (still on my TODO list).

4/5

5/5
Miqu 103B GGUF quants (Q2_K – Q8_0) are now available. Thanks, Michael Radermacher! Linked from the HF model page https://-103b… to https://iqu-1-103b-GGUF… – enjoy!

bnew · Mar 1, 2024

1/8
Announcing TTS Arena!

*sound on*

One place to test, rate and find the champion of current open models.

A continually updated space with the greatest and the best of the current TTS landscape!

Rate once, rate twice - help us find the best out there.

Starting with five open models:
1. XTTSv2
2. Pheme
3. Metavoice
4. Whisperspeech
5. StyleTTS 2

And ElevenLabs v2, too (OpenAI coming soon too) ;)

Which models would you like to see next in the arena!

Help us get more and more ratings!

2/8
Announcing TTS Arena!

*sound on*

One place to test, rate and find the champion of current open models.

A continually updated space with the greatest and the best of the current TTS landscape!

Rate once, rate twice - help us find the best out there.

Starting with five open models:
1. XTTSv2
2. Pheme
3. Metavoice
4. Whisperspeech
5. StyleTTS 2

And ElevenLabs v2, too (OpenAI coming soon too) ;)

Which models would you like to see next in the arena!

Help us get more and more ratings!

bnew · Mar 1, 2024

MEDIA=twitter]1763575943430570153[/MEDIA]

1/2
Meet TinyLLaVA: The Game-Changer in Machine Learning with Smaller Multimodal Frameworks Outperforming Larger Models

Quick read: https://marktechpost.com/2024/03/01/meet-tinyllava-the-game-changer-in-machine-learning-with-smaller-multimodal-frameworks-outperforming-larger-models/…

Researchers from Beihang University and Tsinghua University in China have introduced TinyLLaVA, a novel framework that utilizes small-scale LLMs for multimodal tasks. This framework comprises a vision encoder, a small-scale LLM decoder, an intermediate connector, and tailored training pipelines. TinyLLaVA aims to achieve high performance in multimodal learning while minimizing computational demands.

The framework trains a family of small-scale LMMs, with the best model, TinyLLaVA-3.1B, outperforming existing 7B models such as LLaVA-1.5 and Qwen-VL. It combines vision encoders like CLIP-Large and SigLIP with small-scale LMMs for better performance. The training data consists of two different datasets, LLaVA-1.5 and ShareGPT4V, used to study the impact of data quality on LMM performance. It allows the adjustment of partially learnable parameters of the LLM and vision encoder during the supervised fine-tuning stage. It also provides a unified analysis of model selections, training recipes, and data contributions to the performance of small-scale LMMs.

Paper: [2402.14289] TinyLLaVA: A Framework of Small-scale Large Multimodal Models

#ArtificialIntelligence #DataScience

2/2
UC Berkeley Researchers Unveil LoRA+: A Breakthrough in Machine Learning Model Finetuning with Optimized Learning Rates for Superior Efficiency and Performance

In[/URL][/U] deep learning, the quest for efficiency has led to a paradigm shift in how we…

1/3
TinyLLaVA

A Framework of Small-scale Large Multimodal Models

present the TinyLLaVA framework that provides a unified perspective in designing and analyzing the small-scale Large Multimodal Models (LMMs). We empirically study the effects of different vision encoders, connection modules, language models, training data and training recipes. Our extensive experiments showed that better quality of data combined with better training recipes, smaller LMMs can consistently achieve on-par performances compared to bigger LMMs. Under our framework, we train a family of small-scale LMMs. Our best model, TinyLLaVA-3.1B, achieves better overall performance against existing 7B models such as LLaVA-1.5 and Qwen-VL. We hope our findings can serve as baselines for future research in terms of data scaling, training setups and model selections. Our model weights and codes will be made public.

bnew · Mar 1, 2024

1/5
Beyond Language Models

Byte Models are Digital World Simulators

Traditional deep learning often overlooks bytes, the basic units of the digital world, where all forms of information and operations are encoded and manipulated in binary format. Inspired by the success of next

2/5
token prediction in natural language processing, we introduce bGPT, a model with next byte prediction to simulate the digital world. bGPT matches specialized models in performance across various modalities, including text, audio, and images, and offers new possibilities for

3/5
predicting, simulating, and diagnosing algorithm or hardware behaviour. It has almost flawlessly replicated the process of converting symbolic music data, achieving a low error rate of 0.0011 bits per byte in converting ABC notation to MIDI format. In addition, bGPT demonstrates

4/5
exceptional capabilities in simulating CPU behaviour, with an accuracy exceeding 99.99% in executing various operations. Leveraging next byte prediction, models like bGPT can directly learn from vast binary data, effectively simulating the intricate patterns of the digital world

5/5
paper page:

bnew · Mar 1, 2024

1/7
MOSAIC

A Modular System for Assistive and Interactive Cooking

We present MOSAIC, a modular architecture for home robots to perform complex collaborative tasks, such as cooking with everyday users. MOSAIC tightly collaborates with humans, interacts with users using natural

2/7
MOSAIC

A Modular System for Assistive and Interactive Cooking

We present MOSAIC, a modular architecture for home robots to perform complex collaborative tasks, such as cooking with everyday users. MOSAIC tightly collaborates with humans, interacts with users using natural

3/7
language, coordinates multiple robots, and manages an open vocabulary of everyday objects. At its core, MOSAIC employs modularity: it leverages multiple large-scale pre-trained models for general tasks like language and image recognition, while using streamlined modules designed

4/7
for task-specific control. We extensively evaluate MOSAIC on 60 end-to-end trials where two robots collaborate with a human user to cook a combination of 6 recipes. We also extensively test individual modules with 180 episodes of visuomotor picking, 60 episodes of human motion

5/7
forecasting, and 46 online user evaluations of the task planner. We show that MOSAIC is able to efficiently collaborate with humans by running the overall system end-to-end with a real human user, completing 68.3% (41/60) collaborative cooking trials of 6 different recipes with a

6/7
subtask completion rate of 91.6%. Finally, we discuss the limitations of the current system and exciting open challenges in this domain.

7/7
paper page:

bnew · Mar 1, 2024

1/6
Humanoid Locomotion as Next Token Prediction

We cast real-world humanoid control as a next token prediction problem, akin to predicting the next word in language. Our model is a causal transformer trained via autoregressive prediction of sensorimotor trajectories. To account for

2/6
Humanoid Locomotion as Next Token Prediction

We cast real-world humanoid control as a next token prediction problem, akin to predicting the next word in language. Our model is a causal transformer trained via autoregressive prediction of sensorimotor trajectories. To account for

3/6
the multi-modal nature of the data, we perform prediction in a modality-aligned way, and for each input token predict the next token from the same modality. This general formulation enables us to leverage data with missing modalities, like video trajectories without actions. We

4/6
train our model on a collection of simulated trajectories coming from prior neural network policies, model-based controllers, motion capture data, and YouTube videos of humans. We show that our model enables a full-sized humanoid to walk in San Francisco zero-shot. Our model can

5/6
transfer to the real world even when trained on only 27 hours of walking data, and can generalize to commands not seen during training like walking backward.

6/6
paper page:

The A.I Megathread (LLM , GPT , Development)

Veteran

Veteran

Veteran

Veteran

aveling Textual Descriptions into Artistic Creations​

Superior Fidelity and Colour Distribution​

Text Alignment​

Veteran

(CVPR2024) Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild​

⚠ Due to the large RAM (60G) and VRAM (30G x2) costs of SUPIR, we are working on the online demo releasing.​

Veteran

Model Description​

Veteran

Computer Science > Machine Learning​

Approaching Human-Level Forecasting with Language Models​

Submission history​

Veteran

Veteran

StarCoder 2 is a code-generating AI that runs on most GPUs​

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

aveling Textual Descriptions into Artistic Creations

Superior Fidelity and Colour Distribution

Text Alignment

(CVPR2024) Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild

⚠ Due to the large RAM (60G) and VRAM (30G x2) costs of SUPIR, we are working on the online demo releasing.

Model Description

Computer Science > Machine Learning

Approaching Human-Level Forecasting with Language Models

Submission history

StarCoder 2 is a code-generating AI that runs on most GPUs