The A.I Megathread (LLM , GPT , Development)

bnew · Oct 25, 2023

️ Multi-Vector Retriever for RAG on tables, text, and images

️

Seamless question-answering across diverse data types (images, text, tables) is one of the holy grails of RAG.

We’re releasing three new cookbooks that showcase the multi-vector retriever to tackle this challenge.

We released the multi-vector retriever back in August w/ a simple idea:

1/ embed a doc reference (e.g., summary) that is optimized for search

2/ but retrieve the raw doc (table, text, image) to give complete context for LLM answer synthesis

Using @UnstructuredIO to parse images, text, and tables (e.g., from pdfs) ...

.. the multi-vector retriever enables RAG on semi-structured data w/ table summaries - github.com/langchain-ai/lang…

..we extend the idea to images, using LLaVA-7b (c/o @imhaotian) to produce image summaries - github.com/langchain-ai/lang…

... and this full RAG pipeline can be run laptop w/ llama.cpp c/o @ggerganov, @ollama_ai, @nomic_ai embeddings, and @trychroma: github.com/langchain-ai/lang… Blog: blog.langchain.dev/semi-stru…

bnew · Oct 25, 2023

https://archive.ph/TihzE

[2309.16609] Qwen Technical Report

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, Tianhang Zhu

Large language models (LLMs) have revolutionized the field of artificial intelligence, enabling natural language processing tasks that were previously thought to be exclusive to humans. In this work, we introduce Qwen, the first installment of our large language model series. Qwen is a comprehensive language model series that encompasses distinct models with varying parameter counts. It includes Qwen, the base pretrained language models, and Qwen-Chat, the chat models finetuned with human alignment techniques. The base language models consistently demonstrate superior performance across a multitude of downstream tasks, and the chat models, particularly those trained using Reinforcement Learning from Human Feedback (RLHF), are highly competitive. The chat models possess advanced tool-use and planning capabilities for creating agent applications, showcasing impressive performance even when compared to bigger models on complex tasks like utilizing a code interpreter. Furthermore, we have developed coding-specialized models, Code-Qwen and Code-Qwen-Chat, as well as mathematics-focused models, Math-Qwen-Chat, which are built upon base language models. These models demonstrate significantly improved performance in comparison with open-source models, and slightly fall behind the proprietary models.

Comments:	59 pages, 5 figures
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2309.16609 [cs.CL]
	(or arXiv:2309.16609v1 [cs.CL] for this version)
	[2309.16609] Qwen Technical Report Focus to learn more

Submission history

From: An Yang [view email]

[v1] Thu, 28 Sep 2023 17:07:49 UTC (995 KB)

https://arxiv.org/pdf/2309.16609.pdf

bnew · Oct 25, 2023

https://archive.ph/jczAg

GitHub - morph-labs/rift: Rift: an AI-native language server for your personal AI software engineer

Rift: an AI-native language server for your personal AI software engineer - GitHub - morph-labs/rift: Rift: an AI-native language server for your personal AI software engineer

www.github.com

bnew · Oct 26, 2023

https://archive.ph/OqSYS

bnew · Oct 26, 2023

BERT explained: Training (Masked Language Model, Next Sentence Prediction), Inference, Self-Attention, [CLS] token, Left and Right context, Comparative analysis BERT vs GPT/LLamA, Fine tuning, Text Classification, Question Answering

bnew · Oct 26, 2023

https://archive.ph/wip/qwF30

Big update: Meta's Long Llama beats GPT-3.5 in long contexts and goes toe-to-toe with GPT-4 in summarization.

Highlights:
▸ Context: Supports up to 32k.
▸ Performance: Matches GPT-4 in summarizing, beats GPT-3.5 in long tasks.
▸ Efficiency: 40% less computing cost, same performance.

Technical Stuff:
▸ Positional Encoding: Tweaks made for better long-text handling.
▸ Extra Training: More datasets used, including longer text.

Instruction Tuning:
▸ QA Tasks: Generated from long docs.
▸ Validation: Llama 2 70B checked the QA pairs.
▸ Fine-Tuning: Used synthetic and short instruction data.

Full paper here: arxiv.org/abs/2309.16039

Effective Long-Context Scaling of Foundation Models

Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, Madian Khabsa, Han Fang, Yashar Mehdad, Sharan Narang, Kshytiz Malik, Angela Fan, Shruti Bhosale, Sergey Edunov, Mike Lewis, Sinong Wang, Hao Ma

We present a series of long-context LLMs that support effective context windows of up to 32,768 tokens. Our model series are built through continual pretraining from Llama 2 with longer training sequences and on a dataset where long texts are upsampled. We perform extensive evaluation on language modeling, synthetic context probing tasks, and a wide range of research benchmarks. On research benchmarks, our models achieve consistent improvements on most regular tasks and significant improvements on long-context tasks over Llama 2. Notably, with a cost-effective instruction tuning procedure that does not require human-annotated long instruction data, the 70B variant can already surpass gpt-3.5-turbo-16k's overall performance on a suite of long-context tasks. Alongside these results, we provide an in-depth analysis on the individual components of our method. We delve into Llama's position encodings and discuss its limitation in modeling long dependencies. We also examine the impact of various design choices in the pretraining process, including the data mix and the training curriculum of sequence lengths -- our ablation experiments suggest that having abundant long texts in the pretrain dataset is not the key to achieving strong performance, and we empirically verify that long context continual pretraining is more efficient and similarly effective compared to pretraining from scratch with long sequences.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2309.16039 [cs.CL]
	(or arXiv:2309.16039v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2309.16039 Focus to learn more

Submission history

From: Wenhan Xiong [view email]
[v1] Wed, 27 Sep 2023 21:41:49 UTC (2,078 KB)
[v2] Tue, 17 Oct 2023 17:32:17 UTC (2,078 KB)

https://arxiv.org/pdf/2309.16039.pdf

bnew · Oct 26, 2023

NVIDIA Research: RAG with Long Context LLMs

This blog post dives into NVIDIA’s recent study comparing retrieval-augmentation with and without long-context LLMs.

blog.llamaindex.ai

NVIDIA Research: RAG with Long Context LLMs

Ravi Theja

Published in
LlamaIndex Blog

6 min read
4 days ago

Introduction

Why Long Context Matters and How Retrieval Augmentation Steps In:

In the dynamic landscape of LLMs, two methods have gained traction and seem to be taking center stage: expanding the context window of Large Language Models (LLMs) and enhancing these models with retrieval capabilities. The continued evolution of GPU technology, coupled with breakthroughs in attention mechanisms, has given rise to long-context LLMs. Simultaneously, the concept of retrieval — where LLMs pick up only the most relevant context from a standalone retriever — promises a revolution in efficiency and speed.

In the midst of these evolving narratives, some interesting questions emerge:

Retrieval-augmentation versus long context window, which one is better for downstream tasks?
Can both methods be combined to get the best of both worlds?

To dissect these questions, in this blog post we turn to NVIDIA’s recent study, which harnesses the power of two powerful LLMs: the proprietary GPT — 43B and LLaMA2–70B, the research strives to provide actionable insights for AI practitioners.

Prior Research and the NVIDIA Divergence:

Interestingly, while NVIDIA’s findings are interesting in many respects, Another recent work by Bai et al. (2023) also ventured into similar territory, although with differing outcomes.

Their work explored the impact of retrieval on long context LLMs, evaluating models like GPT-3.5-Turbo-16k and Llama2–7B-chat-4k. However, their findings diverge from NVIDIA’s in crucial ways. Bai et al. discerned that retrieval was beneficial only for the Llama2–7B-chat-4k with a 4K context window, but not for extended context models like GPT-3.5-Turbo-16k. One hypothesis for this difference centers on the challenges tied to experiments using black-box APIs and the smaller white-box LLMs they employed, which potentially had limited capability to integrate context through retrieval.

NVIDIA’s work distinguishes itself by tapping into much larger LLMs, yielding results that not only match top-tier models like ChatGPT-3.5 but even indicate further enhancements when incorporating retrieval methods.

Models, Datasets, and Evaluation Metrics

Large Language Models (LLMs) Explored:

The researchers delved deep into the potential of large language models for tasks like generative QA and summarization. Specifically, two models were the primary focus:

Nemo GPT-43B: A proprietary 43 billion parameter model trained on 1.1T tokens, 70% of which were in English. This model was fed a rich diet of web archives, Wikipedia, Reddit, books, and more. It contains 48 layers and is trained using RoPE embeddings.
LLaMA2–70B: A publicly available 70B parameter model trained on 2T tokens, primarily in English. It’s structured with 80 layers and also utilizes RoPE embeddings.

Context Window Extension:

To enhance the models’ capability to process longer contexts, their initial 4K context window length was augmented. The GPT-43B was modified to handle 16K, while the LLaMA2–70B was expanded to both 16K and 32K, employing the position interpolation method.

Instruction Tuning:

To optimize the LLMs for the tasks at hand, instruction tuning was implemented. A diverse dataset blend, comprising sources like Soda, ELI5, FLAN, and others, was created. A consistent format template was adopted for multi-turn dialogue training, and the models were meticulously fine-tuned to accentuate the answer segment.

Retrieval Models Tested:

Three retrieval systems were put to the test:

Dragon: A state-of-the-art dual encoder model for both supervised and zero-shot information retrieval.
Contriever: Utilizes a basic contrastive learning framework and operates unsupervised.
OpenAI embedding: The latest version was used, accepting a maximum input of 8,191 tokens.

The retrieval approach entailed segmenting each document into 300-word sections, encoding both questions and these chunks, and then merging the most pertinent chunks for response generation.

Datasets Used for Evaluation:

The study employed seven diverse datasets, sourced from the Scroll benchmark and LongBench.

A snapshot of these datasets includes:

QMSum: A query-based summarization dataset, QMSum consists of transcripts from diverse meetings and their corresponding summaries, built upon contextual queries.
Qasper: A question-answering dataset centered on NLP papers, Qasper offers a mix of abstractive, extractive, yes/no, and unanswerable questions from the Semantic Scholar Open Research Corpus.
NarrativeQA: Aimed at question-answering over entire books and movie scripts, NarrativeQA provides question-answer pairs created from summaries of these extensive sources.
QuALITY: A multiple-choice question answering set based on stories and articles, QuALITY emphasizes thorough reading, with half the questions designed to be challenging and require careful consideration.
MuSiQue: Designed for multi-hop reasoning in question answering, MuSiQue creates multi-hop questions from single-hop ones, emphasizing connected reasoning and minimizing shortcuts.
HotpotQA: Based on Wikipedia, HotpotQA requires reading multiple supporting documents for reasoning. It features diverse questions and provides sentence-level support for answers.
MultiFieldQA-en: Curated to test long-context understanding across fields, MFQA uses sources like legal documents and academic papers, with annotations done by Ph.D. students.

Evaluation Metrics:

The research team used a wide range of metrics suited to each dataset. The geometric mean of ROUGE scores for QM, the exact matching (EM) score for QLTY, and F1 scores for others were the primary metrics.

Results

Baseline models without retrieval, having a 4K sequence length, performed poorly since valuable texts get truncated.
With retrieval, performance for 4K models like LLaMA2–70B-4K and GPT-43B-4K significantly improved.
HotpotQA, a multi-hop dataset, particularly benefits from longer sequence models.
Models with longer contexts (16K, 32K) outperform their 4K counterparts even when fed the same evidence chunks.
There exists a unique “U-shaped” performance curve for LLMs due to the lost in the middle phenomenon, making them better at utilizing information at the beginning or end of the input.
The study presents a contrasting perspective to LongBench’s findings, emphasizing that retrieval is beneficial for models regardless of their context window size.

Comparing to OpenAI Models:

The LLaMA2–70B-32k model with retrieval surpasses the performance of GPT-3.5-turbo variants and is competitive with Davinci-003, underscoring its robustness in handling long context tasks.

Comparison of Different Retrievers:

Retrieval consistently enhances the performance across different retrievers.
Public retrievers outperformed proprietary ones like OpenAI embeddings.

Comparing with the number of retrieved chunks:

The best performance is achieved by retrieving the top 5 or 10 chunks. Retrieving more, up to 20 chunks, doesn’t offer additional benefits and can even degrade performance.
The deterioration in performance when adding more chunks could be due to the lost-in-the-middle phenomenon or the model being sidetracked by non-relevant information.

Conclusion

As we delved deep into understanding how retrieval augmentation and long-context extension interact when applied to leading language models fine-tuned for long-context question-answering and summarization tasks. Here are some things to be noted:

Boost in Performance with Retrieval: Implementing retrieval techniques significantly enhances the performance of both shorter 4K context language models and their longer 16K/32K context counterparts.
Efficiency of 4K Models with Retrieval: 4K context language models, when combined with retrieval augmentation, can achieve performance levels similar to 16K long context models. Plus, they have the added advantage of being faster during the inference process.
Best Model Performance: After enhancing with both context window extension and retrieval augmentation, the standout model, LLaMA2–70B-32k-ret (LLaMA2–70B-32k with retrieval), surpasses well-known models like GPT-3.5-turbo-16k and davinci-003.

References:

We trust that this blog post on the review of the paper on retrieval augmentation with long-context LLMs has furnished you with meaningful insights. We’re keen to hear if your experiments align with our findings or present new perspectives — divergent results always make for interesting discussions and further exploration.

bnew · Oct 26, 2023

https://archive.ph/iHxHS

Quality-Diversity through AI Feedback

: qdaif.github.io/

: arxiv.org/abs/2310.13032

https://arxiv.org/pdf/2310.13032.pdf

bnew · Oct 26, 2023

https://huggingface.co/amazon/MistralLite

MistralLite Model

MistralLite is a fine-tuned Mistral-7B-v0.1 language model, with enhanced capabilities of processing long context (up to 32K tokens). By utilizing an adapted Rotary Embedding and sliding window during fine-tuning, MistralLite is able to perform significantly better on several long context retrieve and answering tasks, while keeping the simple model structure of the original model. MistralLite is useful for applications such as long context line and topic retrieval, summarization, question-answering, and etc. MistralLite can be deployed on a single AWS g5.2x instance with Sagemaker Huggingface Text Generation Inference (TGI) endpoint, making it suitable for applications that require high performance in resource-constrained environments. You can also serve the MistralLite model directly using TGI docker containers. Also, MistralLite supports other ways of serving like vLLM, and you can use MistralLite in Python by using the HuggingFace transformers and FlashAttention-2 library.

MistralLite is similar to Mistral-7B-Instruct-v0.1, and their similarities and differences are summarized below:

Model	Fine-tuned on long contexts	Max context length	RotaryEmbedding adaptation	Sliding Window Size
Mistral-7B-Instruct-v0.1	up to 8K tokens	32K	rope_theta = 10000	4096
MistralLite	up to 16K tokens	32K	rope_theta = 1000000	16384

Motivation of Developing MistralLite

Since the release of Mistral-7B-Instruct-v0.1, the model became increasingly popular because its strong performance on a wide range of benchmarks. But most of the benchmarks are evaluated on short context, and not much has been investigated on its performance on long context tasks. Then We evaluated Mistral-7B-Instruct-v0.1 against benchmarks that are specifically designed to assess the capabilities of LLMs in handling longer context. Although the performance of the models on long context was fairly competitive on long context less than 4096 tokens, there were some limitations on its performance on longer context. Motivated by improving its performance on longer context, we finetuned the Mistral 7B model, and produced Mistrallite. The model managed to significantly boost the performance of long context handling over Mistral-7B-Instruct-v0.1. The detailed long context evalutaion results are as below:

Topic Retrieval

Model Name	Input length	Input length	Input length	Input length	Input length
	2851	5568	8313	11044	13780
Mistral-7B-Instruct-v0.1	100%	50%	2%	0%	0%
MistralLite	100%	100%	100%	100%	98%

Line Retrieval

Model Name	Input length	Input length	Input length	Input length	Input length	Input length
	3818	5661	7505	9354	11188	12657
Mistral-7B-Instruct-v0.1	98%	62%	42%	42%	32%	30%
MistralLite	98%	92%	88%	76%	70%	60%

https://github.com/awslabs/extending-the-context-length-of-open-source-llms

https://github.com/awslabs/extending-the-context-length-of-open-source-llms/tree/main/MistralLite#mistrallite-model

bnew · Oct 26, 2023

https://archive.ph/7IIkd

bnew · Oct 26, 2023

Jina AI Launches World's First Open-Source 8K Text Embedding, Rivaling OpenAI

Jina AI introduces jina-embeddings-v2, the world's first open-source model boasting an 8K context length. Matching the prowess of OpenAI's proprietary models, this innovation is now publicly accessible on Huggingface, signaling a significant milestone in the landscape of text embeddings.

jina.ai

Jina AI Launches World's First Open-Source 8K Text Embedding, Rivaling OpenAI

Jina AI

October 25, 2023 • 4 minutes read

Berlin, Germany - October 25, 2023 – Jina AI, the Berlin-based artificial intelligence company, is thrilled to announce the launch of its second-generation text embedding model: jina-embeddings-v2. This cutting-edge model is now the only open-source offering that supports an impressive 8K (8192 tokens) context length, putting it on par with OpenAI's proprietary model, text-embedding-ada-002, in terms of both capabilities and performance on the Massive Text Embedding Benchmark (MTEB) leaderboard.

Benchmarking Against the Best 8K Model from Open AI

When directly compared with OpenAI's 8K model text-embedding-ada-002, jina-embeddings-v2 showcases its mettle. Below is a performance comparison table, highlighting areas where jina-embeddings-v2 particularly excels:

Rank	Model	Model Size (GB)	Embedding Dimensions	Sequence Length	Average (56 datasets)	Classification Average (12 datasets)	Reranking Average (4 datasets)	Retrieval Average (15 datasets)	Summarization Average (1 dataset)
15	text-embedding-ada-002	Unknown	1536	8191	60.99	70.93	84.89	56.32	30.8
17	jina-embeddings-v2-base-en	0.27	768	8192	60.38	73.45	85.38	56.98	31.6

Notably, jina-embedding-v2 outperforms its OpenAI counterpart in Classification Average, Reranking Average, Retrieval Average, and Summarization Average.

Features and Benefits

Jina AI’s dedication to innovation is evident in this latest offering:

From Scratch to Superiority: The jina-embeddings-v2 was built from the ground up. Over the last three months, the team at Jina AI engaged in intensive R&D, data collection, and tuning. The outcome is a model that marks a significant leap from its predecessor.
Unlocking Extended Context Potential with 8K: jina-embeddings-v2 isn’t just a technical feat; its 8K context length opens doors to new industry applications:
- Legal Document Analysis: Ensure every detail in extensive legal texts is captured and analyzed.
- Medical Research: Embed scientific papers holistically for advanced analytics and discovery.
- Literary Analysis: Dive deep into long-form content, capturing nuanced thematic elements.
- Financial Forecasting: Attain superior insights from detailed financial reports.
- Conversational AI: Improve chatbot responses to intricate user queries.

Benchmarking shows that in several datasets, this extended context enabled jina-embeddings-v2 to outperform other leading base embedding models, emphasizing the practical advantages of longer context capabilities.

Availability: Both models are freely available for download on Huggingface:
- Base Model (0.27G) - Designed for heavy-duty tasks requiring higher accuracy, like academic research or business analytics.
- Small Model (0.07G) - Crafted for lightweight applications such as mobile apps or devices with limited computing resources.
Size Options for Different Needs: Understanding the diverse needs of the AI community, Jina AI offers two versions of the model:

jinaai/jina-embeddings-v2-base-en · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.

jinaai/jina-embeddings-v2-small-en · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
https://huggingface.co/jinaai/jina-embeddings-v2-small-en?ref=jina-ai-gmbh.ghost.io

In reflecting on the journey and significance of this launch, Dr. Han Xiao, CEO of Jina AI, shared his thoughts:

"In the ever-evolving world of AI, staying ahead and ensuring open access to breakthroughs is paramount. With jina-embeddings-v2, we've achieved a significant milestone. Not only have we developed the world's first open-source 8K context length model, but we have also brought it to a performance level on par with industry giants like OpenAI. Our mission at Jina AI is clear: we aim to democratize AI and empower the community with tools that were once confined to proprietary ecosystems. Today, I am proud to say, we have taken a giant leap towards that vision."

This pioneering spirit is evident in Jina AI's forward-looking plans.

A Glimpse into the Future

Jina AI is committed to leading the forefront of innovation in AI. Here’s what’s next on their roadmap:

Academic Insights: An academic paper detailing the technical intricacies and benchmarks of jina-embeddings-v2 will soon be published, allowing the AI community to gain deeper insights.
API Development: The team is in the advanced stages of developing an OpenAI-like embeddings API platform. This will provide users with the capability to effortlessly scale the embedding model according to their needs.
Language Expansion: Venturing into multilingual embeddings, Jina AI is setting its sights on launching German-English models, further expanding its repertoire.

About Jina AI GmbH:

Located at Ohlauer Str. 43 (1st floor), zone A, 10999 Berlin, Germany, Jina AI is at the vanguard of reshaping the landscape of multimodal artificial intelligence. For inquiries, please reach out at contact@jina.ai.

bnew · Oct 26, 2023

GitHub - mpoon/gpt-repository-loader: Convert code repos into an LLM prompt-friendly format. Mostly built by GPT-4.

Convert code repos into an LLM prompt-friendly format. Mostly built by GPT-4. - GitHub - mpoon/gpt-repository-loader: Convert code repos into an LLM prompt-friendly format. Mostly built by GPT-4.

github.com

gpt-repository-loader

gpt-repository-loader is a command-line tool that converts the contents of a Git repository into a text format, preserving the structure of the files and file contents. The generated output can be interpreted by AI language models, allowing them to process the repository's contents for various tasks, such as code review or documentation generation.

Contributing

Some context around building this is located here. Appreciate any issues and pull requests in the spirit of having mostly GPT build out this tool. Using ChatGPT Plus is recommended for quick access to GPT-4.

Getting Started

To get started with gpt-repository-loader, follow these steps:

Ensure you have Python 3 installed on your system.
Clone or download the gpt-repository-loader repository.
Navigate to the repository's root directory in your terminal.
Run gpt-repository-loader with the following command:
python gpt_repository_loader.py /path/to/git/repository [-p /path/to/preamble.txt] [-o /path/to/output_file.txt]

Replace /path/to/git/repository with the path to the Git repository you want to process. Optionally, you can specify a preamble file with -p or an output file with -o. If not specified, the default output file will be named output.txt in the current directory.
The tool will generate an output.txt file containing the text representation of the repository. You can now use this file as input for AI language models or other text-based processing tasks.

Running Tests

To run the tests for gpt-repository-loader, follow these steps:

Ensure you have Python 3 installed on your system.
Navigate to the repository's root directory in your terminal.
Run the tests with the following command:
python -m unittest test_gpt_repository_loader.py

Now, the test harness is added to the gpt-repository-loader project. You can run the tests by executing the command python -m unittest test_gpt_repository_loader.py in your terminal.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Convert code repos into an LLM prompt-friendly format. Mostly built by GPT-4.

bnew · Oct 26, 2023

AI ‘breakthrough’: neural net has human-like ability to generalize language

A neural-network-based artificial intelligence outperforms ChatGPT at quickly folding new words into its lexicon, a key aspect of human intelligence.

www.nature.com

AI ‘breakthrough’: neural net has human-like ability to generalize language

A neural-network-based artificial intelligence outperforms ChatGPT at quickly folding new words into its lexicon, a key aspect of human intelligence.

Email

An chalkboard illustration of two figures communicating and understanding each other.

A version of the human ability to apply new vocabulary in flexible ways has been achieved by a neural network.Credit: marrio31/Getty

Scientists have created a neural network with the human-like ability to make generalizations about language1. The artificial intelligence (AI) system performs about as well as humans at folding newly learned words into an existing vocabulary and using them in fresh contexts, which is a key aspect of human cognition known as systematic generalization.

The researchers gave the same task to the AI model that underlies the chatbot ChatGPT, and found that it performs much worse on such a test than either the new neural net or people, despite the chatbot’s uncanny ability to converse in a human-like manner.

The work, published on 25 October in Nature, could lead to machines that interact with people more naturally than do even the best AI systems today. Although systems based on large language models, such as ChatGPT, are adept at conversation in many contexts, they display glaring gaps and inconsistencies in others.

The neural network’s human-like performance suggests there has been a “breakthrough in the ability to train networks to be systematic”, says Paul Smolensky, a cognitive scientist who specializes in language at Johns Hopkins University in Baltimore, Maryland.

Language lessons

Systematic generalization is demonstrated by people’s ability to effortlessly use newly acquired words in new settings. For example, once someone has grasped the meaning of the word ‘photobomb’, they will be able to use it in a variety of situations, such as ‘photobomb twice’ or ‘photobomb during a Zoom call’. Similarly, someone who understands the sentence ‘the cat chases the dog’ will also understand ‘the dog chases the cat’ without much extra thought.

But this ability does not come innately to neural networks, a method of emulating human cognition that has dominated artificial-intelligence research, says Brenden Lake, a cognitive computational scientist at New York University and co-author of the study. Unlike people, neural nets struggle to use a new word until they have been trained on many sample texts that use that word. AI researchers have sparred for nearly 40 years as to whether neural networks could ever be a plausible model of human cognition if they cannot demonstrate this type of systematicity.

DeepMind AI learns simple physics like a baby

To attempt to settle this debate, the authors first tested 25 people on how well they deploy newly learnt words to different situations. The researchers ensured the participants would be learning the words for the first time by testing them on a pseudo-language consisting of two categories of nonsense words. ‘Primitive’ words such as ‘dax,’ ‘wif’ and ‘lug’ represented basic, concrete actions such as ‘skip’ and ‘jump’. More abstract ‘function’ words such as ‘blicket’, ‘kiki’ and ’fep’ specified rules for using and combining the primitives, resulting in sequences such as ‘jump three times’ or ‘skip backwards’.

Participants were trained to link each primitive word with a circle of a particular colour, so a red circle represents ‘dax’, and a blue circle represents ‘lug’. The researchers then showed the participants combinations of primitive and function words alongside the patterns of circles that would result when the functions were applied to the primitives. For example, the phrase ‘dax fep’ was shown with three red circles, and ‘lug fep’ with three blue circles, indicating that fep denotes an abstract rule to repeat a primitive three times.

Finally, the researchers tested participants’ ability to apply these abstract rules by giving them complex combinations of primitives and functions. They then had to select the correct colour and number of circles and place them in the appropriate order.

Cognitive benchmark

As predicted, people excelled at this task; they chose the correct combination of coloured circles about 80% of the time, on average. When they did make errors, the researchers noticed that these followed a pattern that reflected known human biases.

Next, the researchers trained a neural network to do a task similar to the one presented to participants, by programming it to learn from its mistakes. This approach allowed the AI to learn as it completed each task rather than using a static data set, which is the standard approach to training neural nets. To make the neural net human-like, the authors trained it to reproduce the patterns of errors they observed in humans’ test results. When the neural net was then tested on fresh puzzles, its answers corresponded almost exactly to those of the human volunteers, and in some cases exceeded their performance.

A test of artificial intelligence

By contrast, GPT-4 struggled with the same task, failing, on average, between 42 and 86% of the time, depending on how the researchers presented the task. “It’s not magic, it’s practice,” Lake says. “Much like a child also gets practice when learning their native language, the models improve their compositional skills through a series of compositional learning tasks.”

Melanie Mitchell, a computer and cognitive scientist at the Santa Fe Institute in New Mexico, says this study is an interesting proof of principle, but it remains to be seen whether this training method can scale up to generalize across a much larger data set or even to images. Lake hopes to tackle this problem by studying how people develop a knack for systematic generalization from a young age, and incorporating those findings to build a more robust neural net.

Elia Bruni, a specialist in natural language processing at the University of Osnabrück in Germany, says this research could make neural networks more-efficient learners. This would reduce the gargantuan amount of data necessary to train systems such as ChatGPT and would minimize ‘hallucination’, which occurs when AI perceives patterns that are non-existent and creates inaccurate outputs. “Infusing systematicity into neural networks is a big deal,” Bruni says. “It could tackle both these issues at the same time.”

doi: https://doi.org/10.1038/d41586-023-03272-3

References

Lake, B. M. & Baroni, M. Nature https://doi.org/10.1038/s41586-023-06668-3 (2023).
Article Google Scholar

Download references
Reprints and Permissions

bnew · Oct 26, 2023

https://www.tomshardware.com/news/ai-learns-like-a-pigeon-researchers-say

AI Learns Like a Pigeon, Researchers Say

By Mark Tyson
published about 22 hours ago

Both pigeons and AI models can be better than humans at solving some complex tasks

(Image credit: Shutterstock)

Researchers at Ohio State University have found that pigeons tackle some problems in a very similar way to modern computer AI models. In essence, pigeons have been found to use a ‘brute force’ learning method called "associative learning." Thus pigeons, and modern computer AIs, can reach solutions to complex problems that befuddle human thinking patterns.

Brandon Turner, lead author of the new study and professor of psychology at Ohio State University, worked with Edward Wasserman, a professor of psychology at the University of Iowa, on the new study, published in iScience.

Here are the key findings:

Pigeons can solve an exceptionally broad range of visual categorization tasks
Some of these tasks seem to require advanced cognitive and attentional processes, yet computational modeling indicates that pigeons don’t deploy such complex processes
A simple associative mechanism may be sufficient to account for the pigeon’s success

Turner told the Ohio State news blog that the research started with a strong hunch that pigeons learned in a similar way to computer AIs. Initial research confirmed earlier thoughts and observations. “We found really strong evidence that the mechanisms guiding pigeon learning are remarkably similar to the same principles that guide modern machine learning and AI techniques,” said Turner.

A pigeon’s “associative learning” can find solutions to complex problems that are hard to reach by humans or other primates. Primate thinking is typically steered by selective attention and explicit rule use, which can get in the way of solving some problems.

(Image credit: Ohio State University)

For the study, pigeons were tested with a range of four tasks. In easier tasks, it was found pigeons could learn the correct choices over time and grow their success rates from about 55% to 95%. The most complex tasks didn’t see such a stark improvement over the study time, going from 55% to only 68%. Nevertheless, the results served to show close parallels between pigeon performance and AI model learning performance. Both pigeon and machine learners seemed to use both associative learning and error correction techniques to steer their decisions toward success.

Further insight was provided by Turner in comments on human vs pigeon vs AI learning models. He noted that some of the tasks would really frustrate humans as making rules wouldn’t help simplify problems, leading to task abandonment. Meanwhile, for pigeons (and machine AIs), in some tasks “this brute force way of trial and error and associative learning... helps them perform better than humans.”

Interestingly, the study recalls that in his Letter to the Marquess of Newcastle (1646), French philosopher René Descartes argued that animals were nothing more than beastly mechanisms — bête-machines, simply following impulses from organic reactions.

The conclusion of the Ohio State blog highlighted how humans have traditionally looked down upon pigeons as dim-witted. Now we have to admit something: our latest crowning technological achievement of computer AI relies on relatively simple brute-force pigeon-like learning mechanisms.

Will this new research have any influence on computer science going forward? It seems like those involved in AI / machine learning and those developing neuromorphic computing might find some useful crossover here.

bnew · Oct 27, 2023

https://arstechnica.com/information-technology/2023/10/people-are-speaking-with-chatgpt-for-hours-bringing-2013s-her-closer-to-reality/

People are speaking with ChatGPT for hours, bringing 2013’s Her closer to reality

Long mobile conversations with the AI assistant using AirPods echo the sci-fi film.

BENJ EDWARDS - Today at undefined

Enlarge / Joaquin Phoenix talking with AI in Her (2013).

Warner Bros.
57WITH

In 2013, Spike Jonze's Her imagined a world where humans form deep emotional connections with AI, challenging perceptions of love and loneliness. Ten years later, thanks to ChatGPT's recently added voice features, people are playing out a small slice of Her in reality, having hours-long discussions with the AI assistant on the go.

The A.I Megathread (LLM , GPT , Development)

Veteran

Veteran

Qwen Technical Report​

Submission history​

Veteran

Veteran

Veteran

BERT explained: Training (Masked Language Model, Next Sentence Prediction), Inference, Self-Attention, [CLS] token, Left and Right context, Comparative analysis BERT vs GPT/LLamA, Fine tuning, Text Classification, Question Answering​

Veteran

Effective Long-Context Scaling of Foundation Models​

Submission history​

Veteran

NVIDIA Research: RAG with Long Context LLMs​

Introduction​

Why Long Context Matters and How Retrieval Augmentation Steps In:​

Prior Research and the NVIDIA Divergence:​

Models, Datasets, and Evaluation Metrics​

Large Language Models (LLMs) Explored:​

Context Window Extension:​

Instruction Tuning:​

Retrieval Models Tested:​

Datasets Used for Evaluation:​

Evaluation Metrics:​

Results​

Comparing to OpenAI Models:​

Comparison of Different Retrievers:​

Comparing with the number of retrieved chunks:​

Conclusion​

References:​

Veteran

Quality-Diversity through AI Feedback​

Veteran

MistralLite Model​

Motivation of Developing MistralLite​

Veteran

Veteran

Benchmarking Against the Best 8K Model from Open AI​

Features and Benefits​

A Glimpse into the Future​

Veteran

gpt-repository-loader​

Contributing​

Getting Started​

Running Tests​

License​

About​

Veteran

AI ‘breakthrough’: neural net has human-like ability to generalize language​

Language lessons​

Cognitive benchmark​

References​

Veteran

AI Learns Like a Pigeon, Researchers Say​

Veteran

People are speaking with ChatGPT for hours, bringing 2013’s Her closer to reality​

Long mobile conversations with the AI assistant using AirPods echo the sci-fi film.​

FURTHER READING​

FURTHER READING​

Qwen Technical Report

Submission history

BERT explained: Training (Masked Language Model, Next Sentence Prediction), Inference, Self-Attention, [CLS] token, Left and Right context, Comparative analysis BERT vs GPT/LLamA, Fine tuning, Text Classification, Question Answering

Effective Long-Context Scaling of Foundation Models

Submission history

NVIDIA Research: RAG with Long Context LLMs

Introduction

Why Long Context Matters and How Retrieval Augmentation Steps In:

Prior Research and the NVIDIA Divergence:

Models, Datasets, and Evaluation Metrics

Large Language Models (LLMs) Explored:

Context Window Extension:

Instruction Tuning:

Retrieval Models Tested:

Datasets Used for Evaluation:

Evaluation Metrics:

Results

Comparing to OpenAI Models:

Comparison of Different Retrievers:

Comparing with the number of retrieved chunks:

Conclusion

References:

Quality-Diversity through AI Feedback

MistralLite Model

Motivation of Developing MistralLite

Benchmarking Against the Best 8K Model from Open AI

Features and Benefits

A Glimpse into the Future

gpt-repository-loader

Contributing

Getting Started

Running Tests

License

About

AI ‘breakthrough’: neural net has human-like ability to generalize language

Language lessons

Cognitive benchmark

References

AI Learns Like a Pigeon, Researchers Say

People are speaking with ChatGPT for hours, bringing 2013’s Her closer to reality

Long mobile conversations with the AI assistant using AirPods echo the sci-fi film.

FURTHER READING

FURTHER READING