bnew

Veteran
Joined
Nov 1, 2015
Messages
56,210
Reputation
8,249
Daps
157,901







1/6
Differential Diffusion

Giving Each Pixel Its Strength

Text-based image editing has advanced significantly in recent years. With the rise of diffusion models, image editing via textual instructions has become ubiquitous.

2/6
Unfortunately, current models lack the ability to customize the quantity of the change per pixel or per image fragment, resorting to changing the entire image in an equal amount, or editing a specific region using a binary mask.

3/6
In this paper, we suggest a new framework which enables the user to customize the quantity of change for each image fragment, thereby enhancing the flexibility and verbosity of modern diffusion models.

4/6
Our framework does not require model training or fine-tuning, but instead performs everything at inference time, making it easily applicable to an existing model.

5/6
paper page:

6/6
Beyond Language Models

Byte Models are Digital World Simulators

Traditional deep learning often overlooks bytes, the basic units of the digital world, where all forms of information and operations are encoded and manipulated in binary format. Inspired by the success of next
GHjKFlPWgAE4tyf.jpg

GHjz25WXwAAyX_f.jpg

GHkofmoasAAGlOD.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,210
Reputation
8,249
Daps
157,901




1/3
Microsoft The Era of 1-bit LLMs paper is now the most upvoted paper of all time on HF paper pages beating Apple's LLM in a Flash

2/3
paper page:

3/3
Beyond Language Models

Byte Models are Digital World Simulators

Traditional deep learning often overlooks bytes, the basic units of the digital world, where all forms of information and operations are encoded and manipulated in binary format. Inspired by the success of next
GHjDf7KXgAAgrbC.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,210
Reputation
8,249
Daps
157,901



1/2
Terribly excited about open-source + on-device AI these days! Great to see
@Qualcomm
release 80+ models optimized and curated for their devices and chips on
@huggingface
: qualcomm (Qualcomm)

2/2
We just crossed 100,000 organizations on HF!

Some of my favorites:
- The MLX community for on-device AI: [U][URL]https:// -[/URL][/U] The @AiEleuther org with over 150+ datasets: https://
- The @Bloomberg org to show big financial institutions can use the hub:…
GHRprVXW0AI8XSf.jpg

GHibpa-WwAAn2qo.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,210
Reputation
8,249
Daps
157,901



1/2
Video as the New Language for Real-World Decision Making

Both text and video data are abundant on the internet and support large-scale self-supervised learning through next token or frame prediction. However, they have not been equally leveraged: language models have had significant real-world impact, whereas video generation has remained largely limited to media entertainment. Yet video data captures important information about the physical world that is difficult to express in language. To address this gap, we discuss an under-appreciated opportunity to extend video generation to solve tasks in the real world. We observe how, akin to language, video can serve as a unified interface that can absorb internet knowledge and represent diverse tasks. Moreover, we demonstrate how, like language models, video generation can serve as planners, agents, compute engines, and environment simulators through techniques such as in-context learning, planning and reinforcement learning. We identify major impact opportunities in domains such as robotics, self-driving, and science, supported by recent work that demonstrates how such advanced capabilities in video generation are plausibly within reach. Lastly, we identify key challenges in video generation that mitigate progress. Addressing these challenges will enable video generation models to demonstrate unique value alongside language models in a wider array of AI applications.

2/2
paper page:
GHZJ1YZXEAAl8Xf.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,210
Reputation
8,249
Daps
157,901










1/9
Announcing `10k_prompts_ranked`, the first dataset release from Data Is Better Together. Created in <2 weeks by the community. Includes:
- 10,000+ prompt quality rankings
- Human and synthetic data rankings
- Generated by 300+ contributors
on how + why collaborative datasets

2/9
It's no secret that high-quality open data is essential for creating better open models. The open source community shares 100s of models, datasets and demos openly weekly, but collectively building open datasets has been less explored.

3/9
Datasets have a massive role in shaping what models can be created. If we want more high-quality models for all languages, domains and tasks, we need more and better open datasets for all languages, domains and tasks!

4/9
To explore how the community could build impactful datasets collectively, @argilla_io added support for HF authentication for Argilla instances hosted on a @huggingface Space. Anyone with an HF login could begin contributing to a dataset in <1 minute.

5/9
To test this new workflow, we launched a task to rank the quality of prompts (human and synthetically generated). The @nomic_ai Atlas gives an excellent sense of the coverage of the topics in the prompts.

6/9
You can find the dataset here:

7/9
In less than two weeks, we built a community of over 300 contributors for this dataset. This dataset became a reality thanks to the dedication of all the individuals who lent their support

To see the amazing people behind this dataset, visit https://ompt-collective-dashboard…

8/9
This is just the beginning! We aim to empower the community to build new datasets collaboratively. This could include:
Preference datasets for a low-resource language
Evaluations for a specific domain
Datasets for novel tasks

9/9
If this sounds interesting, keep your eyes peeled for further announcements later this week
GHWvoR_bAAAAUhr.jpg

GHWvr3ZawAAJ-Jq.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,210
Reputation
8,249
Daps
157,901


1/1
Very interesting image model trained on legally licensed data
@diffuserslib
compatible

demo: https://huggingface.co/spaces/briaai/BRIA-2.3…
GHiqTyzbcAApyt-.jpg

GHhCRJwXsAAFLjW.jpg

GHhCts8XYAA1PKC.jpg

GHhCts_XoAALP8m.jpg

GHhC10-W0AECt5n.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,210
Reputation
8,249
Daps
157,901



1/2
Meta presents Rainbow Teaming

Open-Ended Generation of Diverse Adversarial Prompts

As large language models (LLMs) become increasingly prevalent across many real-world applications, understanding and enhancing their robustness to user inputs is of paramount importance. Existing methods for identifying adversarial prompts tend to focus on specific domains, lack diversity, or require extensive human annotations. To address these limitations, we present Rainbow Teaming, a novel approach for producing a diverse collection of adversarial prompts. Rainbow Teaming casts adversarial prompt generation as a quality-diversity problem, and uses open-ended search to generate prompts that are both effective and diverse. It can uncover a model's vulnerabilities across a broad range of domains including, in this paper, safety, question answering, and cybersecurity. We also demonstrate that fine-tuning on synthetic data generated by Rainbow Teaming improves the safety of state-of-the-art LLMs without hurting their general capabilities and helpfulness, paving the path to open-ended self-improvement.

2/2
paper page:
GHUiFLYWUAAw0Go.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,210
Reputation
8,249
Daps
157,901




1/3
FuseChat

Knowledge Fusion of Chat Models

While training large language models (LLMs) from scratch can indeed lead to models with distinct capabilities and strengths, this approach incurs substantial costs and may lead to potential redundancy in competencies. An alternative strategy is to combine existing LLMs into a more robust LLM, thereby diminishing the necessity for expensive pre-training. However, due to the diverse architectures of LLMs, direct parameter blending proves to be unfeasible. Recently, FuseLLM introduced the concept of knowledge fusion to transfer the collective knowledge of multiple structurally varied LLMs into a target LLM through lightweight continual training. In this report, we extend the scalability and flexibility of the FuseLLM framework to realize the fusion of chat LLMs, resulting in FuseChat. FuseChat comprises two main stages. Firstly, we undertake knowledge fusion for structurally and scale-varied source LLMs to derive multiple target LLMs of identical structure and size via lightweight fine-tuning. Then, these target LLMs are merged within the parameter space, wherein we propose a novel method for determining the merging weights based on the variation ratio of parameter matrices before and after fine-tuning. We validate our approach using three prominent chat LLMs with diverse architectures and scales, namely NH2-Mixtral-8x7B, NH2-Solar-10.7B, and OpenChat-3.5-7B. Experimental results spanning various chat domains demonstrate the superiority of \textsc{FuseChat-7B} across a broad spectrum of chat LLMs at 7B and 34B scales, even surpassing GPT-3.5 (March) and approaching Mixtral-8x7B-Instruct.

2/3
paper page:

3/3
Beyond Language Models

Byte Models are Digital World Simulators

Traditional deep learning often overlooks bytes, the basic units of the digital world, where all forms of information and operations are encoded and manipulated in binary format. Inspired by the success of next
GHUakLEXYAAuLGw.jpg

GHjz25WXwAAyX_f.jpg

GHjf_p9WwAAbST7.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,210
Reputation
8,249
Daps
157,901



1/3
Google announces Do Large Language Models Latently Perform Multi-Hop Reasoning?

study whether Large Language Models (LLMs) latently perform multi-hop reasoning with complex prompts such as "The mother of the singer of 'Superstition' is". We look for evidence of a latent reasoning pathway where an LLM (1) latently identifies "the singer of 'Superstition'" as Stevie Wonder, the bridge entity, and (2) uses its knowledge of Stevie Wonder's mother to complete the prompt. We analyze these two hops individually and consider their co-occurrence as indicative of latent multi-hop reasoning. For the first hop, we test if changing the prompt to indirectly mention the bridge entity instead of any other entity increases the LLM's internal recall of the bridge entity. For the second hop, we test if increasing this recall causes the LLM to better utilize what it knows about the bridge entity. We find strong evidence of latent multi-hop reasoning for the prompts of certain relation types, with the reasoning pathway used in more than 80% of the prompts. However, the utilization is highly contextual, varying across different types of prompts. Also, on average, the evidence for the second hop and the full multi-hop traversal is rather moderate and only substantial for the first hop. Moreover, we find a clear scaling trend with increasing model size for the first hop of reasoning but not for the second hop. Our experimental findings suggest potential challenges and opportunities for future development and applications of LLMs.

2/3
paper page:

3/3
Beyond Language Models

Byte Models are Digital World Simulators

Traditional deep learning often overlooks bytes, the basic units of the digital world, where all forms of information and operations are encoded and manipulated in binary format. Inspired by the success of next
GHUZUqIX0AAXVyd.jpg

GHjz25WXwAAyX_f.jpg

GHksPllboAE_zVh.jpg

GHjf_p9WwAAbST7.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,210
Reputation
8,249
Daps
157,901




1/3
StructLM

Towards Building Generalist Models for Structured Knowledge Grounding

Structured data sources, such as tables, graphs, and databases, are ubiquitous knowledge sources. Despite the demonstrated capabilities of large language models (LLMs) on plain text, their proficiency in interpreting and utilizing structured data remains limited. Our investigation reveals a notable deficiency in LLMs' ability to process structured data, e.g., ChatGPT lags behind state-of-the-art (SoTA) model by an average of 35%. To augment the Structured Knowledge Grounding (SKG) capabilities in LLMs, we have developed a comprehensive instruction tuning dataset comprising 1.1 million examples. Utilizing this dataset, we train a series of models, referred to as StructLM, based on the Code-LLaMA architecture, ranging from 7B to 34B parameters. Our StructLM series surpasses task-specific models on 14 out of 18 evaluated datasets and establishes new SoTA achievements on 7 SKG tasks. Furthermore, StructLM demonstrates exceptional generalization across 6 novel SKG tasks. Contrary to expectations, we observe that scaling model size offers marginal benefits, with StructLM-34B showing only slight improvements over StructLM-7B. This suggests that structured knowledge grounding is still a challenging task and requires more innovative design to push to a new level.

2/3
paper page:

3/3
Beyond Language Models

Byte Models are Digital World Simulators

Traditional deep learning often overlooks bytes, the basic units of the digital world, where all forms of information and operations are encoded and manipulated in binary format. Inspired by the success of next
GHUf8_dX0AAYf7O.jpg

GHjz25WXwAAyX_f.jpg

GHjDf7KXgAAgrbC.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,210
Reputation
8,249
Daps
157,901










Outfit Anyone: Ultra-high quality virtual try-on for Any Clothing and Any Person​

Institute for Intelligent Computing, Alibaba Group

GitHub

图片1
图片2

图片3
图片4

Abstract​

Virtual try-on has become a transformative technology, empowering users to experiment with fashion without ever having to physically try on clothing. However, existing methods often struggle with generating high-fidelity and detail-consistent results. Diffusion models have demonstrated their ability to generate high-quality and photorealistic images, but when it comes to conditional generation scenarios like virtual try-ons, they still face challenges in achieving control and consistency. Outfit Anyone addresses these limitations by leveraging a two-stream conditional diffusion model, enabling it to adeptly handle garment deformation for more lifelike results. It distinguishes itself with scalability—modulating factors such as pose and body shape—and broad applicability, extending from anime to in-the-wild images. Outfit Anyone's performance in diverse scenarios underscores its utility and readiness for real-world deployment.​

Method​

MY ALT TEXT

The conditional Diffusion Model central to our approach processes images of the model, garments, and accompanying text prompts, using garment images as the control factor. Internally, the network segregates into two streams for independent processing of model and garment data. These streams converge within a fusion network that facilitates the embedding of garment details onto the model's feature representation. On this foundation, we have established Outfit Anyone, comprising two key elements: the Zero-shot Try-on Network for initial try-on imagery, and the Post-hoc Refiner for detailed enhancement of clothing and skin texture in the output images.

Various Try-On Results​


Real World​

We showcase Outfit Anyone's capability for versatile outfit changes, including full ensembles and individual pieces, in realistic scenarios.

Individual Garment​

MY ALT TEXT

Outfit​

MY ALT TEXT
MY ALT TEXT


Bizarre Fashion​

Here we showcase our model's ability to handle a wide range of eccentric and unique clothing styles, dress them onto the models, and even create corresponding outfit combinations when necessary.



MY ALT TEXT

MY ALT TEXT

MY ALT TEXT

MY ALT TEXT





Various Body Shapes​

Our model demonstrates the ability to generalize to various body types, including those that are fit, curve and petite, thereby catering to the try-on demands of individuals from all walks of life.



MY ALT TEXT

MY ALT TEXT

MY ALT TEXT






Anime​

we demonstrate the powerful generalization ability of our model, which can support the creation of new animation characters.

1.gif

2.gif


Refiner​

Furthermore, We showcase the effects before and after using the Refiner, demonstrating its ability to significantly enhance the texture and realism of the clothing, while maintaining consistency in the apparel.

r4.jpg

r3.jpg

r2.jpg
 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,210
Reputation
8,249
Daps
157,901

1/2
Ideogram AI presents Ideogram 1.0

text-to-image model

offers state-of-the-art text rendering, unprecedented photorealism, exceptional prompt adherence, and a new feature called Magic Prompt to help with prompting






AI in practice

Feb 29, 2024


Ideogram 1.0 outshines Midjourney and DALL-E 3 with impressive text rendering​

Ideogram prompted by THE DECODER

Ideogram 1.0 outshines Midjourney and DALL-E 3 with impressive text rendering

Ideogram prompted by THE DECODER

Matthias Bastian

Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.


Ideogram has released its most advanced text-to-image model to date, Ideogram 1.0, which aims to differentiate itself from the competition with text rendering, photorealism, improved prompt following and a new feature called Magic Prompt.

Until now, image AIs have not been good at rendering text properly within AI-generated images. Ideogram 1.0 addresses this issue with reliable text rendering capabilities that Ideogram says can be used to create personalized messages, memes, posters, t-shirt designs, birthday cards, logos and more.

The company claims that Ideogram 1.0 reduces the text error rate by nearly half compared to DALL-E 3. Midjourney is worse than DALL-E 3 when it comes to text rendering.


Ideogram is not perfect when it comes to text rendering, but it should be much better than DALL-E 3 and Midjourney. First tests confirm this. | Image: Ideogram

In comparison tests, users rated images created with Ideogram better than those created with DALL-E 3 and Midjourney v6 in all areas.


In benchmarks conducted by Ideogram, people rated images generated by Ideogram better than images generated by DALL-E 3 and Midjourney. In both cases, rendering text was the biggest advantage. | Image: Ideogram

Ideogram is capable of generating images in a wide range of aspect ratios and styles, from photorealistic to more artistic results, and is designed to handle long and complex prompts well.

The "Magic Prompt" feature, similar to OpenAI's DALL-E ChatGPT integration, automatically rewrites a short prompt into a detailed image description. Unlike in DALL-E 3, this rewriting can be turned off in Ideogram.

A first test shows that Ideogram does not have to hide behind Midjourney in terms of image quality, and may even have slight advantages over Midjourney v6 and DALL-E 3 in terms of prompt following.

Ideogram definitely has a clear advantage when it comes to text rendering, even if it is not perfect, especially when several texts are to be included in one image. For this reason, Ideogram cannot create precise infographics.


Prompt: "The letters "SORA" being generated on a digital screen" | Image: Midjourney
Prompt: "The letters "SORA" being generated on a digital screen" - Ideogram follows the prompt better and writes SORA correctly every time. | Image: Ideogram prompted by THE DECODER

In terms of image quality and composition, Midjourney and Ideogram surpass OpenAI's often kitschy and colorful DALL-3. Of the three, Midjourney currently offers the most features for image editing, such as changing individual elements in the image using text commands.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,210
Reputation
8,249
Daps
157,901


1/1
Want to serve the LLaMA-7B with a context length of up to 1 million on a single A100-80GB GPU and up to 10 million on an 8-GPU system

Paper - "KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization"

The existing problem - LLMs are seeing growing use for applications such as document analysis and summarization which require large context windows, and with these large context windows KV cache activations surface as the dominant contributor to memory consumption during inference. Quantization is a promising approach for compressing KV cache activations; however, existing solutions fail to represent activations accurately in ultra-low precisions, such as sub-4-bit.

This paper presents KVQuant, which addresses this problem by incorporating novel methods for quantizing cached KV activations, including:

(i) Per-Channel Key Quantization, where we adjust the dimension along which we quantize the Key activations to better match the distribution;

(ii) Pre-RoPE Key Quantization, where we quantize Key activations before the rotary positional embedding to mitigate its impact on quantization;

(iii) Non-Uniform KV Cache Quantization, where we derive per-layer sensitivity-weighted non-uniform datatypes that better represent the distributions;

(iv) Per-Vector Dense-and-Sparse Quantization, where we isolate outliers separately for each vector to minimize skews in quantization ranges; and

(v) Q-Norm, where we normalize quantization centroids in order to mitigate distribution shift, providing additional benefits for 2-bit quantization.

By applying this method to the LLaMA, LLaMA-2, and Mistral models, the paper achieves <0.1 perplexity degradation with 3-bit quantization on both Wikitext-2 and C4, outperforming existing approaches.

----

On a related note, a recent paper "Activation Beacon" was about extending context length, but "KVQuant" the technique proposed in this paper is about making context more compact in memory.

"KVQuant" is not for extending context beyond trained-in limits. It's only about making KV cache more compact by quantizing it. Quantization cannot prevent catastrophic loss of coherence when 8K context model goes beyond 8K.

And from page-6 of "KVQuant" paper, for context extension they used Longlora and Lm-infinite.

Also memory footprint is not the only concern. Transformers slow down with growing context size because each crank requires looking back at all preceding tokens. Over long ranges it becomes just too slow. Beacons paper tries to address that too.
GHbZqqCWUAAkTCI.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,210
Reputation
8,249
Daps
157,901




1/4
The Orca-Math paper does a comparison of DPO and KTO for mathematical reasoning, finding that KTO is slightly better when all data is used and 25+ pts better when you have fewer positive examples than negative examples.

2/4
Big gap (25 points!) when you discard prompts for which there are only positive responses. Great work by

@Arindam1408 @corby_rosset @hamedkhanpour @AhmedHAwadallah !

3/4
cc @4evaBehindSOTA @Teknium1 @fentpot @alexgraveley

4/4
Going from a 1:1 ratio of positive:negative examples to a 1:5 or 5:1 ratio only drops winrate by like 5 points, so pretty robust, provided you adjust the weights on the losses so that the effective impact of positive and negative examples is the same, despite the imbalance. Going
GHoGyZqaQAA2TJS.png

GHOrTEIWgAAk_M8.png

GHoHAR6awAEdYqn.jpg

GHoHce-bEAA9QqT.png




1/2
Microsoft presents Orca-Math

Unlocking the potential of SLMs in Grade School Math

Mathematical word problem-solving has long been recognized as a complex task for small language models (SLMs). A recent study hypothesized that the smallest model size, needed to achieve over 80% accuracy on the GSM8K benchmark, is 34 billion parameters. To reach this level of performance with smaller models, researcher often train SLMs to generate Python code or use tools to help avoid calculation errors. Additionally, they employ ensembling, where outputs of up to 100 model runs are combined to arrive at a more accurate result. Result selection is done using consensus, majority vote or a separate a verifier model used in conjunction with the SLM. Ensembling provides a substantial boost in accuracy but at a significant cost increase with multiple calls to the model (e.g., Phi-GSM uses top-48 to boost the performance from 68.2 to 81.5). In this work, we present Orca-Math, a 7-billion-parameter SLM based on the Mistral-7B, which achieves 86.81% on GSM8k without the need for multiple model calls or the use of verifiers, code execution or any other external tools. Our approach has the following key elements: (1) A high quality synthetic dataset of 200K math problems created using a multi-agent setup where agents collaborate to create the data, (2) An iterative learning techniques that enables the SLM to practice solving problems, receive feedback on its solutions and learn from preference pairs incorporating the SLM solutions and the feedback. When trained with Supervised Fine-Tuning alone, Orca-Math achieves 81.50% on GSM8k pass@1 metric. With iterative preference learning, Orca-Math achieves 86.81% pass@1. Orca-Math surpasses the performance of significantly larger models such as LLAMA-2-70B, WizardMath-70B, Gemini-Pro, ChatGPT-3.5. It also significantly outperforms other smaller models while using much smaller data (hundreds of thousands vs. millions of problems).

2/2
paper page:
GHOrTEIWgAAk_M8.png
 
Top