Large Language Models News & Discussions

bnew · Oct 24, 2023

Researchers create magnetic microrobots that work together to assemble objects in 3D environments

For the first time ever, researchers at the Surgical Robotics Laboratory of the University of Twente successfully made two microrobots work together to pick up, move and assemble passive objects in 3D environments. This achievement opens new horizons for promising biomedical applications.

techxplore.com

OCTOBER 24, 2023
Editors' notes

Researchers create magnetic microrobots that work together to assemble objects in 3D environments

by K. W. Wesselink-Schram, University of Twente

Breakthrough in collaborative magnetic microrobotics

Experimental collaborative grasping and assembly results. The magnetic agents are 1 mm stainless steel spheres and the passive objects are 2 mm 3D printed cubes. A) The procedure consisted of four steps, approach, grasping, translation, and release. The solid red arrows represent the motion of the magnetic agents and the dashed green arrow represents the motion of the ensemble. B) Snapshots of the grasping and stacking of three cubes experiment. C) Snapshots of the grasping and stacking of a beam on top of two cubes experiment. The passive objects (cubes and beam) have been highlighted in the top view to increase clarity. Credit: Advanced Intelligent Systems (2023). DOI: 10.1002/aisy.202300365

For the first time ever, researchers at the Surgical Robotics Laboratory of the University of Twente successfully made two microrobots work together to pick up, move and assemble passive objects in 3D environments. This achievement opens new horizons for promising biomedical applications.

Imagine you need surgery somewhere inside your body. However, the part that needs surgery is very difficult for a surgeon to reach. In the future, a couple of robots smaller than a grain of salt might go into your body and perform the surgery. These microrobots could work together to perform all kinds of complex tasks. "It's almost like magic," says Franco Piñan Basualdo, corresponding author of the publication.

Researchers from the University of Twente successfully exploited two of these 1-millimeter-sized magnetic microrobots to perform several operations. Like clockwork, the microrobots were able to pick up, move and assemble cubes. Unique to this achievement is the 3D environment in which the robots performed their tasks.

Achieving this was quite a challenge. Just like regular magnets stick together when they get too close, these tiny magnetic robots behave similarly. This means they have a limit to how close they can get before they start sticking together. But the researchers at the Surgical Robotics Laboratory found a way to use this natural attraction to their advantage. With a custom-made controller, the team could move the individual robots and make them interact with each other.

Credit: University of Twente

The microrobots are biocompatible and can be controlled in difficult-to-reach and even enclosed environments. This makes the technology promising for biomedical studies and applications. "We can remotely manipulate biomedical samples without contaminating them. This could improve existing procedures and open the door to new ones," says Piñan Basualdo.

Piñan Basualdo is a postdoctoral researcher at the Surgical Robotics Laboratory. His research interests include micro-robotics, non-contact control, swarm robotics, active matter, microfluidics, and interfacial phenomena.

This research was performed at the Surgical Robotics Laboratory. Prof. Sarthak Misra, head of the lab, focuses on developing innovative solutions for a broad range of clinically relevant challenges, including biomedical imaging, automation of medical procedures, and the development of microrobotic tools.

The research was performed in the framework of the European RĔGO project (Horizon Europe program), which aims to develop an innovative set of AI-powered, microsized, untethered, stimuli-responsive swarms of robots. The findings were published in a paper titled "Collaborative Magnetic Agents for 3D Microrobotic Grasping," in the journal Advanced Intelligent Systems.

More information: Franco N. Piñan Basualdo et al, Collaborative Magnetic Agents for 3D Microrobotic Grasping, Advanced Intelligent Systems (2023). DOI: 10.1002/aisy.202300365

bnew · Oct 24, 2023

[2310.12397] GPT-4 Doesn't Know It's Wrong: An Analysis of Iterative Prompting for Reasoning Problems

Computer Science > Artificial Intelligence

[Submitted on 19 Oct 2023]

GPT-4 Doesn't Know It's Wrong: An Analysis of Iterative Prompting for Reasoning Problems

Kaya Stechly, Matthew Marquez, Subbarao Kambhampati

There has been considerable divergence of opinion on the reasoning abilities of Large Language Models (LLMs). While the initial optimism that reasoning might emerge automatically with scale has been tempered thanks to a slew of counterexamples, a wide spread belief in their iterative self-critique capabilities persists. In this paper, we set out to systematically investigate the effectiveness of iterative prompting of LLMs in the context of Graph Coloring, a canonical NP-complete reasoning problem that is related to propositional satisfiability as well as practical problems like scheduling and allocation. We present a principled empirical study of the performance of GPT4 in solving graph coloring instances or verifying the correctness of candidate colorings. In iterative modes, we experiment with the model critiquing its own answers and an external correct reasoner verifying proposed solutions. In both cases, we analyze whether the content of the criticisms actually affects bottom line performance. The study seems to indicate that (i) LLMs are bad at solving graph coloring instances (ii) they are no better at verifying a solution--and thus are not effective in iterative modes with LLMs critiquing LLM-generated solutions (iii) the correctness and content of the criticisms--whether by LLMs or external solvers--seems largely irrelevant to the performance of iterative prompting. We show that the observed increase in effectiveness is largely due to the correct solution being fortuitously present in the top-k completions of the prompt (and being recognized as such by an external verifier). Our results thus call into question claims about the self-critiquing capabilities of state of the art LLMs.

Comments:	18 pages, 3 figures
Subjects:	Artificial Intelligence (cs.AI)
Cite as:	arXiv:2310.12397 [cs.AI]
	(or arXiv:2310.12397v1 [cs.AI] for this version)
	[2310.12397] GPT-4 Doesn't Know It's Wrong: An Analysis of Iterative Prompting for Reasoning Problems Focus to learn more

Submission history

From: Kaya Stechly [view email]

[v1] Thu, 19 Oct 2023 00:56:37 UTC (335 KB)

https://arxiv.org/pdf/2310.12397.pdf

bnew · Oct 24, 2023

bnew · Oct 25, 2023

bnew · Oct 25, 2023

https://archive.ph/u11uY

bnew · Oct 25, 2023

https://archive.ph/xXZNe

https://github.com/orgs/nomic-ai/projects/2

bnew · Oct 25, 2023

https://archive.ph/Waiao

Run LLMs on Any GPU: GPT4All Universal GPU Support

Nomic AI releases support for edge LLM inference on all AMD, Intel, Samsung, Qualcomm and Nvidia GPU's in GPT4All.

blog.nomic.ai

bnew · Oct 25, 2023

[Long read] Deep dive into AutoGPT: A comprehensive and in-depth step-by-step guide to how it works

Motivation I've recently started experimenting with AI agents and stumbled upon AutoGPT. My curiosity led me to wonder about the mechanisms behind it. To gain a better understanding of AutoGPT's inner workings, I embarked on a journey of practical ex...

airt.hashnode.dev

[Long read] Deep dive into AutoGPT: A comprehensive and in-depth step-by-step guide to how it works

bnew · Oct 25, 2023

https://archive.ph/4Nkan

️ Multi-Vector Retriever for RAG on tables, text, and images

️

Seamless question-answering across diverse data types (images, text, tables) is one of the holy grails of RAG.

We’re releasing three new cookbooks that showcase the multi-vector retriever to tackle this challenge.

We released the multi-vector retriever back in August w/ a simple idea:

1/ embed a doc reference (e.g., summary) that is optimized for search

2/ but retrieve the raw doc (table, text, image) to give complete context for LLM answer synthesis

Using @UnstructuredIO to parse images, text, and tables (e.g., from pdfs) ...

.. the multi-vector retriever enables RAG on semi-structured data w/ table summaries - github.com/langchain-ai/lang…

..we extend the idea to images, using LLaVA-7b (c/o @imhaotian) to produce image summaries - github.com/langchain-ai/lang…

... and this full RAG pipeline can be run laptop w/ llama.cpp c/o @ggerganov, @ollama_ai, @nomic_ai embeddings, and @trychroma: github.com/langchain-ai/lang… Blog: blog.langchain.dev/semi-stru…

bnew · Oct 25, 2023

https://archive.ph/TihzE

[2309.16609] Qwen Technical Report

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, Tianhang Zhu

Large language models (LLMs) have revolutionized the field of artificial intelligence, enabling natural language processing tasks that were previously thought to be exclusive to humans. In this work, we introduce Qwen, the first installment of our large language model series. Qwen is a comprehensive language model series that encompasses distinct models with varying parameter counts. It includes Qwen, the base pretrained language models, and Qwen-Chat, the chat models finetuned with human alignment techniques. The base language models consistently demonstrate superior performance across a multitude of downstream tasks, and the chat models, particularly those trained using Reinforcement Learning from Human Feedback (RLHF), are highly competitive. The chat models possess advanced tool-use and planning capabilities for creating agent applications, showcasing impressive performance even when compared to bigger models on complex tasks like utilizing a code interpreter. Furthermore, we have developed coding-specialized models, Code-Qwen and Code-Qwen-Chat, as well as mathematics-focused models, Math-Qwen-Chat, which are built upon base language models. These models demonstrate significantly improved performance in comparison with open-source models, and slightly fall behind the proprietary models.

Comments:	59 pages, 5 figures
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2309.16609 [cs.CL]
	(or arXiv:2309.16609v1 [cs.CL] for this version)
	[2309.16609] Qwen Technical Report Focus to learn more

Submission history

From: An Yang [view email]

[v1] Thu, 28 Sep 2023 17:07:49 UTC (995 KB)

https://arxiv.org/pdf/2309.16609.pdf

bnew · Oct 25, 2023

https://archive.ph/jczAg

GitHub - morph-labs/rift: Rift: an AI-native language server for your personal AI software engineer

Rift: an AI-native language server for your personal AI software engineer - GitHub - morph-labs/rift: Rift: an AI-native language server for your personal AI software engineer

www.github.com

bnew · Oct 26, 2023

https://archive.ph/OqSYS

bnew · Oct 26, 2023

BERT explained: Training (Masked Language Model, Next Sentence Prediction), Inference, Self-Attention, [CLS] token, Left and Right context, Comparative analysis BERT vs GPT/LLamA, Fine tuning, Text Classification, Question Answering

bnew · Oct 26, 2023

https://archive.ph/wip/qwF30

Big update: Meta's Long Llama beats GPT-3.5 in long contexts and goes toe-to-toe with GPT-4 in summarization.

Highlights:
▸ Context: Supports up to 32k.
▸ Performance: Matches GPT-4 in summarizing, beats GPT-3.5 in long tasks.
▸ Efficiency: 40% less computing cost, same performance.

Technical Stuff:
▸ Positional Encoding: Tweaks made for better long-text handling.
▸ Extra Training: More datasets used, including longer text.

Instruction Tuning:
▸ QA Tasks: Generated from long docs.
▸ Validation: Llama 2 70B checked the QA pairs.
▸ Fine-Tuning: Used synthetic and short instruction data.

Full paper here: arxiv.org/abs/2309.16039

Effective Long-Context Scaling of Foundation Models

Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, Madian Khabsa, Han Fang, Yashar Mehdad, Sharan Narang, Kshytiz Malik, Angela Fan, Shruti Bhosale, Sergey Edunov, Mike Lewis, Sinong Wang, Hao Ma

We present a series of long-context LLMs that support effective context windows of up to 32,768 tokens. Our model series are built through continual pretraining from Llama 2 with longer training sequences and on a dataset where long texts are upsampled. We perform extensive evaluation on language modeling, synthetic context probing tasks, and a wide range of research benchmarks. On research benchmarks, our models achieve consistent improvements on most regular tasks and significant improvements on long-context tasks over Llama 2. Notably, with a cost-effective instruction tuning procedure that does not require human-annotated long instruction data, the 70B variant can already surpass gpt-3.5-turbo-16k's overall performance on a suite of long-context tasks. Alongside these results, we provide an in-depth analysis on the individual components of our method. We delve into Llama's position encodings and discuss its limitation in modeling long dependencies. We also examine the impact of various design choices in the pretraining process, including the data mix and the training curriculum of sequence lengths -- our ablation experiments suggest that having abundant long texts in the pretrain dataset is not the key to achieving strong performance, and we empirically verify that long context continual pretraining is more efficient and similarly effective compared to pretraining from scratch with long sequences.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2309.16039 [cs.CL]
	(or arXiv:2309.16039v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2309.16039 Focus to learn more

Submission history

From: Wenhan Xiong [view email]
[v1] Wed, 27 Sep 2023 21:41:49 UTC (2,078 KB)
[v2] Tue, 17 Oct 2023 17:32:17 UTC (2,078 KB)

https://arxiv.org/pdf/2309.16039.pdf

bnew · Oct 26, 2023

NVIDIA Research: RAG with Long Context LLMs

This blog post dives into NVIDIA’s recent study comparing retrieval-augmentation with and without long-context LLMs.

blog.llamaindex.ai

NVIDIA Research: RAG with Long Context LLMs

Ravi Theja

Published in
LlamaIndex Blog

6 min read
4 days ago

Introduction

Why Long Context Matters and How Retrieval Augmentation Steps In:

In the dynamic landscape of LLMs, two methods have gained traction and seem to be taking center stage: expanding the context window of Large Language Models (LLMs) and enhancing these models with retrieval capabilities. The continued evolution of GPU technology, coupled with breakthroughs in attention mechanisms, has given rise to long-context LLMs. Simultaneously, the concept of retrieval — where LLMs pick up only the most relevant context from a standalone retriever — promises a revolution in efficiency and speed.

In the midst of these evolving narratives, some interesting questions emerge:

Retrieval-augmentation versus long context window, which one is better for downstream tasks?
Can both methods be combined to get the best of both worlds?

To dissect these questions, in this blog post we turn to NVIDIA’s recent study, which harnesses the power of two powerful LLMs: the proprietary GPT — 43B and LLaMA2–70B, the research strives to provide actionable insights for AI practitioners.

Prior Research and the NVIDIA Divergence:

Interestingly, while NVIDIA’s findings are interesting in many respects, Another recent work by Bai et al. (2023) also ventured into similar territory, although with differing outcomes.

Their work explored the impact of retrieval on long context LLMs, evaluating models like GPT-3.5-Turbo-16k and Llama2–7B-chat-4k. However, their findings diverge from NVIDIA’s in crucial ways. Bai et al. discerned that retrieval was beneficial only for the Llama2–7B-chat-4k with a 4K context window, but not for extended context models like GPT-3.5-Turbo-16k. One hypothesis for this difference centers on the challenges tied to experiments using black-box APIs and the smaller white-box LLMs they employed, which potentially had limited capability to integrate context through retrieval.

NVIDIA’s work distinguishes itself by tapping into much larger LLMs, yielding results that not only match top-tier models like ChatGPT-3.5 but even indicate further enhancements when incorporating retrieval methods.

Models, Datasets, and Evaluation Metrics

Large Language Models (LLMs) Explored:

The researchers delved deep into the potential of large language models for tasks like generative QA and summarization. Specifically, two models were the primary focus:

Nemo GPT-43B: A proprietary 43 billion parameter model trained on 1.1T tokens, 70% of which were in English. This model was fed a rich diet of web archives, Wikipedia, Reddit, books, and more. It contains 48 layers and is trained using RoPE embeddings.
LLaMA2–70B: A publicly available 70B parameter model trained on 2T tokens, primarily in English. It’s structured with 80 layers and also utilizes RoPE embeddings.

Context Window Extension:

To enhance the models’ capability to process longer contexts, their initial 4K context window length was augmented. The GPT-43B was modified to handle 16K, while the LLaMA2–70B was expanded to both 16K and 32K, employing the position interpolation method.

Instruction Tuning:

To optimize the LLMs for the tasks at hand, instruction tuning was implemented. A diverse dataset blend, comprising sources like Soda, ELI5, FLAN, and others, was created. A consistent format template was adopted for multi-turn dialogue training, and the models were meticulously fine-tuned to accentuate the answer segment.

Retrieval Models Tested:

Three retrieval systems were put to the test:

Dragon: A state-of-the-art dual encoder model for both supervised and zero-shot information retrieval.
Contriever: Utilizes a basic contrastive learning framework and operates unsupervised.
OpenAI embedding: The latest version was used, accepting a maximum input of 8,191 tokens.

The retrieval approach entailed segmenting each document into 300-word sections, encoding both questions and these chunks, and then merging the most pertinent chunks for response generation.

Datasets Used for Evaluation:

The study employed seven diverse datasets, sourced from the Scroll benchmark and LongBench.

A snapshot of these datasets includes:

QMSum: A query-based summarization dataset, QMSum consists of transcripts from diverse meetings and their corresponding summaries, built upon contextual queries.
Qasper: A question-answering dataset centered on NLP papers, Qasper offers a mix of abstractive, extractive, yes/no, and unanswerable questions from the Semantic Scholar Open Research Corpus.
NarrativeQA: Aimed at question-answering over entire books and movie scripts, NarrativeQA provides question-answer pairs created from summaries of these extensive sources.
QuALITY: A multiple-choice question answering set based on stories and articles, QuALITY emphasizes thorough reading, with half the questions designed to be challenging and require careful consideration.
MuSiQue: Designed for multi-hop reasoning in question answering, MuSiQue creates multi-hop questions from single-hop ones, emphasizing connected reasoning and minimizing shortcuts.
HotpotQA: Based on Wikipedia, HotpotQA requires reading multiple supporting documents for reasoning. It features diverse questions and provides sentence-level support for answers.
MultiFieldQA-en: Curated to test long-context understanding across fields, MFQA uses sources like legal documents and academic papers, with annotations done by Ph.D. students.

Evaluation Metrics:

The research team used a wide range of metrics suited to each dataset. The geometric mean of ROUGE scores for QM, the exact matching (EM) score for QLTY, and F1 scores for others were the primary metrics.

Results

Baseline models without retrieval, having a 4K sequence length, performed poorly since valuable texts get truncated.
With retrieval, performance for 4K models like LLaMA2–70B-4K and GPT-43B-4K significantly improved.
HotpotQA, a multi-hop dataset, particularly benefits from longer sequence models.
Models with longer contexts (16K, 32K) outperform their 4K counterparts even when fed the same evidence chunks.
There exists a unique “U-shaped” performance curve for LLMs due to the lost in the middle phenomenon, making them better at utilizing information at the beginning or end of the input.
The study presents a contrasting perspective to LongBench’s findings, emphasizing that retrieval is beneficial for models regardless of their context window size.

Comparing to OpenAI Models:

The LLaMA2–70B-32k model with retrieval surpasses the performance of GPT-3.5-turbo variants and is competitive with Davinci-003, underscoring its robustness in handling long context tasks.

Comparison of Different Retrievers:

Retrieval consistently enhances the performance across different retrievers.
Public retrievers outperformed proprietary ones like OpenAI embeddings.

Comparing with the number of retrieved chunks:

The best performance is achieved by retrieving the top 5 or 10 chunks. Retrieving more, up to 20 chunks, doesn’t offer additional benefits and can even degrade performance.
The deterioration in performance when adding more chunks could be due to the lost-in-the-middle phenomenon or the model being sidetracked by non-relevant information.

Conclusion

As we delved deep into understanding how retrieval augmentation and long-context extension interact when applied to leading language models fine-tuned for long-context question-answering and summarization tasks. Here are some things to be noted:

Boost in Performance with Retrieval: Implementing retrieval techniques significantly enhances the performance of both shorter 4K context language models and their longer 16K/32K context counterparts.
Efficiency of 4K Models with Retrieval: 4K context language models, when combined with retrieval augmentation, can achieve performance levels similar to 16K long context models. Plus, they have the added advantage of being faster during the inference process.
Best Model Performance: After enhancing with both context window extension and retrieval augmentation, the standout model, LLaMA2–70B-32k-ret (LLaMA2–70B-32k with retrieval), surpasses well-known models like GPT-3.5-turbo-16k and davinci-003.

References:

We trust that this blog post on the review of the paper on retrieval augmentation with long-context LLMs has furnished you with meaningful insights. We’re keen to hear if your experiments align with our findings or present new perspectives — divergent results always make for interesting discussions and further exploration.

Large Language Models News & Discussions

Veteran

Researchers create magnetic microrobots that work together to assemble objects in 3D environments​

Veteran

Computer Science > Artificial Intelligence​

GPT-4 Doesn't Know It's Wrong: An Analysis of Iterative Prompting for Reasoning Problems​

Submission history​

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

[Long read] Deep dive into AutoGPT: A comprehensive and in-depth step-by-step guide to how it works​

Veteran

Veteran

Qwen Technical Report​

Submission history​

Veteran

Veteran

Veteran

BERT explained: Training (Masked Language Model, Next Sentence Prediction), Inference, Self-Attention, [CLS] token, Left and Right context, Comparative analysis BERT vs GPT/LLamA, Fine tuning, Text Classification, Question Answering​

Veteran

Effective Long-Context Scaling of Foundation Models​

Submission history​

Veteran

NVIDIA Research: RAG with Long Context LLMs​

Introduction​

Why Long Context Matters and How Retrieval Augmentation Steps In:​

Prior Research and the NVIDIA Divergence:​

Models, Datasets, and Evaluation Metrics​

Large Language Models (LLMs) Explored:​

Context Window Extension:​

Instruction Tuning:​

Retrieval Models Tested:​

Datasets Used for Evaluation:​

Evaluation Metrics:​

Results​

Comparing to OpenAI Models:​

Comparison of Different Retrievers:​

Comparing with the number of retrieved chunks:​

Conclusion​

References:​

Researchers create magnetic microrobots that work together to assemble objects in 3D environments

Computer Science > Artificial Intelligence

GPT-4 Doesn't Know It's Wrong: An Analysis of Iterative Prompting for Reasoning Problems

Submission history

[Long read] Deep dive into AutoGPT: A comprehensive and in-depth step-by-step guide to how it works

Qwen Technical Report

Submission history

BERT explained: Training (Masked Language Model, Next Sentence Prediction), Inference, Self-Attention, [CLS] token, Left and Right context, Comparative analysis BERT vs GPT/LLamA, Fine tuning, Text Classification, Question Answering

Effective Long-Context Scaling of Foundation Models

Submission history

NVIDIA Research: RAG with Long Context LLMs

Introduction

Why Long Context Matters and How Retrieval Augmentation Steps In:

Prior Research and the NVIDIA Divergence:

Models, Datasets, and Evaluation Metrics

Large Language Models (LLMs) Explored:

Context Window Extension:

Instruction Tuning:

Retrieval Models Tested:

Datasets Used for Evaluation:

Evaluation Metrics:

Results

Comparing to OpenAI Models:

Comparison of Different Retrievers:

Comparing with the number of retrieved chunks:

Conclusion

References: