bnew

Veteran
Joined
Nov 1, 2015
Messages
51,839
Reputation
7,926
Daps
148,851







1/7
Chain-of-Thought (CoT) prompting --> OUT(?), analogical prompting --> IN!

A new paper from @GoogleDeepMind & @Stanford (accepted to @iclr_conf): "Large Language Models as Analogical Reasoners"

2/7
CoT prompting has shown LLMsโ€™ abilities to tackle complex tasks, such as solving math problems, by prompting them to generate intermediate reasoning steps. However, they typically demand labeled exemplars of the reasoning process, which can be costly to obtain for every task!

3/7
In this paper, they propose "analogical prompting", a new prompting approach that automatically guides the reasoning process of LLMs!
Their inspiration comes from analogical reasoning in psychology, a concept where humans draw from relevant experiences to tackle new problems.

4/7
They use exactly this idea to prompt LLMs to self-generate relevant exemplars or knowledge in the context, before proceeding to solve the original problem (see figure in main tweet)

5/7
๐€๐๐ฏ๐š๐ง๐ญ๐š๐ ๐ž๐ฌ:
It eliminates the need for labeling or retrieving examples, offering generality and convenience.
It adapts the examples and knowledge to each problem, offering adaptability.

6/7
๐‘๐ž๐ฌ๐ฎ๐ฅ๐ญ๐ฌ:
๐š๐ง๐š๐ฅ๐จ๐ ๐ข๐œ๐š๐ฅ ๐ฉ๐ซ๐จ๐ฆ๐ฉ๐ญ๐ข๐ง๐  surpasses both 0-shot and manually tuned few-shot CoT across several reasoning tasks like math problem solving (GSM8K, MATH), code generation (Codeforces), and various reasoning challenges in BIG-Bench!

7/7
Authors:
@jure
@percyliang
@denny_zhou
@edchi
GKU6yUUbwAEMXfy.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,839
Reputation
7,926
Daps
148,851

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,839
Reputation
7,926
Daps
148,851






1/6
Google announces Gecko

Versatile Text Embeddings Distilled from Large Language Models

We present Gecko, a compact and versatile text embedding model. Gecko achieves strong retrieval performance by leveraging a key idea: distilling knowledge from large language models (LLMs)

2/6
into a retriever. Our two-step distillation process begins with generating diverse, synthetic paired data using an LLM. Next, we further refine the data quality by retrieving a set of candidate passages for each query, and relabeling the positive and hard negative passages

3/6
using the same LLM. The effectiveness of our approach is demonstrated by the compactness of the Gecko. On the Massive Text Embedding Benchmark (MTEB), Gecko with 256 embedding dimensions outperforms all existing entries with 768 embedding size.

4/6
Gecko with 768 embedding dimensions achieves an average score of 66.31, competing with 7x larger models and 5x higher dimensional embeddings.

5/6
paper page:

6/6
daily papers:
GKDeio_X0AAkEg9.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,839
Reputation
7,926
Daps
148,851

1/1
FinancialAdvisorGPT : LLM+RAG Boilerplate

Here's a boilerplate project I've designed for RAG (Retriever-Augmented Generation) and LLM (Large Language Model) applications in financial analysis.

MongoDB, Chroma, FastAPI, Langchain, and React.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,839
Reputation
7,926
Daps
148,851

1/1
AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent [Must read paper ]
Code: GitHub - THUDM/AutoWebGLM
Paper: [2404.03648v1] AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent
The research paper proposes a solution called AutoWebGLM, which is an automated web navigation agent built upon ChatGLM3-6B and inspired by human browsing patterns. It uses an HTML simplification algorithm to represent webpages in a concise manner and employs a hybrid human-AI method to build web browsing data for curriculum training. The model is then bootstrapped using reinforcement learning and rejection sampling, allowing it to comprehend webpages, perform browser operations, and efficiently decompose tasks.
GKk0w-JWsAEdPCU.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,839
Reputation
7,926
Daps
148,851

1/1
[CL] Transforming LLMs into Cross-modal and Cross-lingual Retrieval Systems
F P Gomez, R Sanabria, Y Sung, D Cerโ€ฆ [Google Research & Boston University & The University of Edinburgh] (2024)

- The paper proposes using large language models (LLMs) to initialize dual encoder (DE) retrieval systems for cross-modal and cross-lingual speech-text matching.

- LLMs are trained only on text data, but can support many languages. Speech technologies like ASR are more limited in languages.

- The method discretizes raw speech into "audio tokens" using an existing speech encoder and k-means clustering. These tokens are supported in the LLM embedding layer.

- The dual encoder model is trained with a contrastive loss between speech and text embeddings. This aligns modalities in a shared space.

- The model achieves strong speech-text retrieval results on 102 languages, despite training on just 21 languages. It outperforms prior work trained on more data.

- The model exhibits zero-shot cross-lingual speech-text translation capabilities, further improved by adding machine translation data.
GKkYI4vbIAAiN1n.jpg

GKkYI41aUAALX6t.jpg

GKkYJOAb0AEM3pf.jpg

GKkYJTda8AAYK59.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,839
Reputation
7,926
Daps
148,851







1/7
AutoWebGLM

Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent

Large language models (LLMs) have fueled many intelligent agent tasks, such as web navigation -- but most existing agents perform far from satisfying in real-world webpages due to three

2/7
factors: (1) the versatility of actions on webpages, (2) HTML text exceeding model processing capacity, and (3) the complexity of decision-making due to the open-domain nature of web. In light of the challenge, we develop AutoWebGLM, a GPT-4-outperforming automated web

3/7
navigation agent built upon ChatGLM3-6B. Inspired by human browsing patterns, we design an HTML simplification algorithm to represent webpages, preserving vital information succinctly. We employ a hybrid human-AI method to build web browsing data for curriculum training. Then,

4/7
we bootstrap the model by reinforcement learning and rejection sampling to further facilitate webpage comprehension, browser operations, and efficient task decomposition by itself. For testing, we establish a bilingual benchmark -- AutoWebBench -- for real-world web

5/7
browsing tasks. We evaluate AutoWebGLM across diverse web navigation benchmarks, revealing its improvements but also underlying challenges to tackle real environments.

6/7
paper page:

7/7
daily papers:
GKXlIqVXEAAr0sH.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,839
Reputation
7,926
Daps
148,851



1/3
We brought together an interdisciplinary group of 80 people (computer scientists, engineers, physicians) through @StanfordDBDS
to red team LLMs for healthcare. We share our results and release our findings as a new dataset for testing LLMs for healthcare!

2/3
We asked people to ask questions that reflect what happens in actual clinical care. Overall, we found 20% of the responses to be inappropriate - with issues of either inaccuracy, bias, safety, or privacy.

3/3
Dear journals, I'll be honest with you. I'm not going to waste my time filling out an author form when my paper is in revision and still might be rejected. I'm only gonna do it after an acceptance.
GKlGnyWaUAAzqmK.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,839
Reputation
7,926
Daps
148,851







1/7
LVLM-Intrepret

An Interpretability Tool for Large Vision-Language Models

In the rapidly evolving landscape of artificial intelligence, multi-modal large language models are emerging as a significant area of interest. These models, which combine various forms of data input,

2/7
are becoming increasingly popular. However, understanding their internal mechanisms remains a complex task. Numerous advancements have been made in the field of explainability tools and mechanisms, yet there is still much to explore. In this work, we present a novel

3/7
interactive application aimed towards understanding the internal mechanisms of large vision-language models. Our interface is designed to enhance the interpretability of the image patches, which are instrumental in generating an answer, and assess the efficacy of the language

4/7
model in grounding its output in the image. With our application, a user can systematically investigate the model and uncover system limitations, paving the way for enhancements in

5/7
system capabilities. Finally, we present a case study of how our application can aid in understanding failure mechanisms in a popular large multi-modal model: LLaVA.

6/7
paper page:

7/7
daily papers:

GKXoDhSWQAAk66I.jpg

GKhjRXnaoAAO9t6.jpg

GKhjWa3asAAi2uR.jpg

GKhjbWkaEAAwZkL.jpg

GKhjhXpbQAAJPgY.jpg

GKgKRe9XAAAYzM6.jpg

GKgQhTAbYAAQjwC.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,839
Reputation
7,926
Daps
148,851


1/2
The RealHumanEval: Evaluating Large Language Models' Abilities to Support Programmers

Presents a web interface to measure the ability of LLMs to assist programmers, through either autocomplete or chat support

repo: GitHub - clinicalml/realhumaneval
abs: [2404.02806] The RealHumanEval: Evaluating Large Language Models' Abilities to Support Programmers

2/2
Is JetMoE overhyped or underrated? Here are my thoughts:

In my opinion, the performance benefits of JetMoE could be attributed significantly to its two-phase training approach, similar to that used by MiniCPM. This involves continued pretraining on a mix of pretraining andโ€ฆ
GKSUMU6WAAErsrt.jpg

GKi6L6XawAASUAS.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,839
Reputation
7,926
Daps
148,851




1/4
Finally catching up on MoE with @finbarrtimbers
's great posts on this (link below). My thoughts MoE and objectives:

1. Instead of the weirdly unprincipled additional losses, one can simply maximize the mutual information I[E;T], where E is the expert index and T is the token:

I[E; T] = H[E] - H[E | T],

so maximizing the mutual info maximizes the entropy of expert selection without knowing the token H[E], ie all experts selected uniformly across all data, while it minimizes the entropy knowing the token H[E | T], ie as one-hot as possible for a given token, equivalent to K=1.

2. The softmax(top-K of router logits + normal noise) is almost equiv a double softmax:

Taking the top-K (router logits + Gumbel noise) is equivalent to samping from softmax(router logits) k times w/o replacement.

Applying the softmax to those samples simply distributes the credit accordingly between the top-K chosen experts.

A potentially cleaner formulation would simply always use a full mixture and only look at the top-K sampling approach etc as performance optimizations.

3. The "router Z-loss" seems overcomplicated. Z seems to stand for the partition constant of the induced categorical distribution of the logits.

The Z loss does not affect router predictions as it affects all expert logits in the router equally, and it is motivated by numerical stability.

Instead of regularizing with Z loss explicitly as a loss, one could also simply adapt the bias of the router network and shift it by the mean logit activations of a training batch.

Same effect and no loss needed.

4. Why do we use MoE only for FFNs and not for attention?

MoE for QKV or at least for the Q matrices would seem quite valuable to make attention token-specific and either save FLOPs or get better attention for same FLOPs.

Mixture of Depths seems to look at that finally.

2/4
https://artfintel.com/p/papers-ive-read-this-week-mixture

This[/URL] starts his series on MoE (3 papers)

3/4
https://artfintel.com/p/more-on-mixture-of-experts-models

This[/URL] continues it (6 papers).

4/4
1. could be maximized over the training batch just like other regularizing losses. No difficulty there I think

GKi6L6XawAASUAS.jpg

GKbAZUWWgAAHGwM.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,839
Reputation
7,926
Daps
148,851







1/7
No replies here. Decided to try out on our own benchmarks, consisting of an auto-regressive, multi-modal pre-training at scale. Pretty complex setting.

Yellow: Tuned (LR) AdamW
Purple: Tuned (LR) Sophia

Average Loss:

2/7
Interestingly both gradient norms (left) and clipping (right) is much healthier with Sophia.

3/7
Really impressed with Sophia, but will test out on 2 more orders of magnitudes of scale before making any strong claims.

4/7
Pretty well, did a wide sweep. LR does seem interchangeable.

5/7
Good question. Yes. Linear warmup, linear decay to 10% of peak LR. Both runs have the same schedule.

6/7
Can't share my implementation but there's a good public implementation here:

7/7
Would be super interested in your setting if you're willing to share!
GKi6L6XawAASUAS.jpg

GKi6c7Ja0AAC_xB.jpg

GKi6lgJbkAAY46L.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,839
Reputation
7,926
Daps
148,851







1/7
CodeEditorBench

Evaluating Code Editing Capability of Large Language Models

Large Language Models (LLMs) for code are rapidly evolving, with code editing emerging as a critical capability. We introduce CodeEditorBench, an evaluation framework designed to rigorously assess

2/7
the performance of LLMs in code editing tasks, including debugging, translating, polishing, and requirement switching. Unlike existing benchmarks focusing solely on code generation, CodeEditorBench emphasizes real-world scenarios and practical aspects of software

3/7
development. We curate diverse coding challenges and scenarios from five sources, covering various programming languages, complexity levels, and editing tasks. Evaluation of 19 LLMs reveals that closed-source models (particularly Gemini-Ultra and GPT-4),

4/7
outperform open-source models in CodeEditorBench, highlighting differences in model performance based on problem types and prompt sensitivities. CodeEditorBench aims to catalyze advancements in LLMs by providing a robust platform for assessing code editing capabilities.

5/7
We will release all prompts and datasets to enable the community to expand the dataset and benchmark emerging LLMs. By introducing CodeEditorBench, we contribute to the advancement of LLMs in code editing and provide a valuable resource for researchers and practitioners.

6/7
paper page:

7/7
daily papers:
GKXjct1XEAAfC_N.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,839
Reputation
7,926
Daps
148,851



1/3
We just released the results of a survey that SHOCKED me. AI is already in clinic.

We asked dermatologists -- are you using large language models in your clinical care? And then we explored what they were using LLMs for. @theJIDJournal

2/3
Of the dermatologists we asked (which may be skewed towards users of technology, since we recruited via email and social media), 65% reported using LLMs in clinical care, with ChatGPT being the most often used model.

3/3
Dermatologists are using these models for everything from patient care to education to medical records. The usage was shocking to me, as I have spoken and written about the biases and hallucinations in LLMs for healthcare.
GKcecAmaEAAoZGr.jpg

GKcenu7boAAsEfW.jpg
 
Top