bnew

Veteran
Joined
Nov 1, 2015
Messages
51,795
Reputation
7,926
Daps
148,646

⭐️ Multi-Vector Retriever for RAG on tables, text, and images ⭐

Seamless question-answering across diverse data types (images, text, tables) is one of the holy grails of RAG.


We’re releasing three new cookbooks that showcase the multi-vector retriever to tackle this challenge.


We released the multi-vector retriever back in August w/ a simple idea:

1/ embed a doc reference (e.g., summary) that is optimized for search

2/ but retrieve the raw doc (table, text, image) to give complete context for LLM answer synthesis

Using @UnstructuredIO to parse images, text, and tables (e.g., from pdfs) ...

.. the multi-vector retriever enables RAG on semi-structured data w/ table summaries - github.com/langchain-ai/lang…

..we extend the idea to images, using LLaVA-7b (c/o @imhaotian) to produce image summaries - github.com/langchain-ai/lang…

... and this full RAG pipeline can be run laptop w/ llama.cpp c/o @ggerganov, @ollama_ai, @nomic_ai embeddings, and @trychroma: github.com/langchain-ai/lang… Blog: blog.langchain.dev/semi-stru…
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,795
Reputation
7,926
Daps
148,646




Qwen Technical Report​

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, Tianhang Zhu
Large language models (LLMs) have revolutionized the field of artificial intelligence, enabling natural language processing tasks that were previously thought to be exclusive to humans. In this work, we introduce Qwen, the first installment of our large language model series. Qwen is a comprehensive language model series that encompasses distinct models with varying parameter counts. It includes Qwen, the base pretrained language models, and Qwen-Chat, the chat models finetuned with human alignment techniques. The base language models consistently demonstrate superior performance across a multitude of downstream tasks, and the chat models, particularly those trained using Reinforcement Learning from Human Feedback (RLHF), are highly competitive. The chat models possess advanced tool-use and planning capabilities for creating agent applications, showcasing impressive performance even when compared to bigger models on complex tasks like utilizing a code interpreter. Furthermore, we have developed coding-specialized models, Code-Qwen and Code-Qwen-Chat, as well as mathematics-focused models, Math-Qwen-Chat, which are built upon base language models. These models demonstrate significantly improved performance in comparison with open-source models, and slightly fall behind the proprietary models.
Comments:59 pages, 5 figures
Subjects:Computation and Language (cs.CL)
Cite as:arXiv:2309.16609 [cs.CL]
(or arXiv:2309.16609v1 [cs.CL] for this version)
[2309.16609] Qwen Technical Report

Focus to learn more

Submission history​

From: An Yang [view email]

[v1] Thu, 28 Sep 2023 17:07:49 UTC (995 KB)

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,795
Reputation
7,926
Daps
148,646

BERT explained: Training (Masked Language Model, Next Sentence Prediction), Inference, Self-Attention, [CLS] token, Left and Right context, Comparative analysis BERT vs GPT/LLamA, Fine tuning, Text Classification, Question Answering​

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,795
Reputation
7,926
Daps
148,646

Big update: Meta's Long Llama beats GPT-3.5 in long contexts and goes toe-to-toe with GPT-4 in summarization.

Highlights:
▸ Context: Supports up to 32k.
▸ Performance: Matches GPT-4 in summarizing, beats GPT-3.5 in long tasks.
▸ Efficiency: 40% less computing cost, same performance.

Technical Stuff:
▸ Positional Encoding: Tweaks made for better long-text handling.
▸ Extra Training: More datasets used, including longer text.

Instruction Tuning:
▸ QA Tasks: Generated from long docs.
▸ Validation: Llama 2 70B checked the QA pairs.
▸ Fine-Tuning: Used synthetic and short instruction data.

Full paper here: arxiv.org/abs/2309.16039

eNRjCZf.png


Effective Long-Context Scaling of Foundation Models​

Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, Madian Khabsa, Han Fang, Yashar Mehdad, Sharan Narang, Kshytiz Malik, Angela Fan, Shruti Bhosale, Sergey Edunov, Mike Lewis, Sinong Wang, Hao Ma
We present a series of long-context LLMs that support effective context windows of up to 32,768 tokens. Our model series are built through continual pretraining from Llama 2 with longer training sequences and on a dataset where long texts are upsampled. We perform extensive evaluation on language modeling, synthetic context probing tasks, and a wide range of research benchmarks. On research benchmarks, our models achieve consistent improvements on most regular tasks and significant improvements on long-context tasks over Llama 2. Notably, with a cost-effective instruction tuning procedure that does not require human-annotated long instruction data, the 70B variant can already surpass gpt-3.5-turbo-16k's overall performance on a suite of long-context tasks. Alongside these results, we provide an in-depth analysis on the individual components of our method. We delve into Llama's position encodings and discuss its limitation in modeling long dependencies. We also examine the impact of various design choices in the pretraining process, including the data mix and the training curriculum of sequence lengths -- our ablation experiments suggest that having abundant long texts in the pretrain dataset is not the key to achieving strong performance, and we empirically verify that long context continual pretraining is more efficient and similarly effective compared to pretraining from scratch with long sequences.
Subjects:Computation and Language (cs.CL)
Cite as:arXiv:2309.16039 [cs.CL]
(or arXiv:2309.16039v2 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2309.16039
Focus to learn more

Submission history​

From: Wenhan Xiong [view email]
[v1] Wed, 27 Sep 2023 21:41:49 UTC (2,078 KB)
[v2] Tue, 17 Oct 2023 17:32:17 UTC (2,078 KB)

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,795
Reputation
7,926
Daps
148,646

NVIDIA Research: RAG with Long Context LLMs​

Ravi Theja

Published in
LlamaIndex Blog

6 min read
4 days ago

1*U8wTFp67z03-N48JlI002Q.png

Introduction​

Why Long Context Matters and How Retrieval Augmentation Steps In:​

In the dynamic landscape of LLMs, two methods have gained traction and seem to be taking center stage: expanding the context window of Large Language Models (LLMs) and enhancing these models with retrieval capabilities. The continued evolution of GPU technology, coupled with breakthroughs in attention mechanisms, has given rise to long-context LLMs. Simultaneously, the concept of retrieval — where LLMs pick up only the most relevant context from a standalone retriever — promises a revolution in efficiency and speed.

In the midst of these evolving narratives, some interesting questions emerge:
  1. Retrieval-augmentation versus long context window, which one is better for downstream tasks?
  2. Can both methods be combined to get the best of both worlds?

To dissect these questions, in this blog post we turn to NVIDIA’s recent study, which harnesses the power of two powerful LLMs: the proprietary GPT — 43B and LLaMA2–70B, the research strives to provide actionable insights for AI practitioners.

Prior Research and the NVIDIA Divergence:​

Interestingly, while NVIDIA’s findings are interesting in many respects, Another recent work by Bai et al. (2023) also ventured into similar territory, although with differing outcomes.

Their work explored the impact of retrieval on long context LLMs, evaluating models like GPT-3.5-Turbo-16k and Llama2–7B-chat-4k. However, their findings diverge from NVIDIA’s in crucial ways. Bai et al. discerned that retrieval was beneficial only for the Llama2–7B-chat-4k with a 4K context window, but not for extended context models like GPT-3.5-Turbo-16k. One hypothesis for this difference centers on the challenges tied to experiments using black-box APIs and the smaller white-box LLMs they employed, which potentially had limited capability to integrate context through retrieval.

NVIDIA’s work distinguishes itself by tapping into much larger LLMs, yielding results that not only match top-tier models like ChatGPT-3.5 but even indicate further enhancements when incorporating retrieval methods.

Models, Datasets, and Evaluation Metrics​

Large Language Models (LLMs) Explored:​

The researchers delved deep into the potential of large language models for tasks like generative QA and summarization. Specifically, two models were the primary focus:
  • Nemo GPT-43B: A proprietary 43 billion parameter model trained on 1.1T tokens, 70% of which were in English. This model was fed a rich diet of web archives, Wikipedia, Reddit, books, and more. It contains 48 layers and is trained using RoPE embeddings.
  • LLaMA2–70B: A publicly available 70B parameter model trained on 2T tokens, primarily in English. It’s structured with 80 layers and also utilizes RoPE embeddings.

Context Window Extension:​

To enhance the models’ capability to process longer contexts, their initial 4K context window length was augmented. The GPT-43B was modified to handle 16K, while the LLaMA2–70B was expanded to both 16K and 32K, employing the position interpolation method.

Instruction Tuning:​

To optimize the LLMs for the tasks at hand, instruction tuning was implemented. A diverse dataset blend, comprising sources like Soda, ELI5, FLAN, and others, was created. A consistent format template was adopted for multi-turn dialogue training, and the models were meticulously fine-tuned to accentuate the answer segment.

Retrieval Models Tested:​

Three retrieval systems were put to the test:
  • Dragon: A state-of-the-art dual encoder model for both supervised and zero-shot information retrieval.
  • Contriever: Utilizes a basic contrastive learning framework and operates unsupervised.
  • OpenAI embedding: The latest version was used, accepting a maximum input of 8,191 tokens.

The retrieval approach entailed segmenting each document into 300-word sections, encoding both questions and these chunks, and then merging the most pertinent chunks for response generation.

Datasets Used for Evaluation:​

The study employed seven diverse datasets, sourced from the Scroll benchmark and LongBench.

A snapshot of these datasets includes:
  • QMSum: A query-based summarization dataset, QMSum consists of transcripts from diverse meetings and their corresponding summaries, built upon contextual queries.
  • Qasper: A question-answering dataset centered on NLP papers, Qasper offers a mix of abstractive, extractive, yes/no, and unanswerable questions from the Semantic Scholar Open Research Corpus.
  • NarrativeQA: Aimed at question-answering over entire books and movie scripts, NarrativeQA provides question-answer pairs created from summaries of these extensive sources.
  • QuALITY: A multiple-choice question answering set based on stories and articles, QuALITY emphasizes thorough reading, with half the questions designed to be challenging and require careful consideration.
  • MuSiQue: Designed for multi-hop reasoning in question answering, MuSiQue creates multi-hop questions from single-hop ones, emphasizing connected reasoning and minimizing shortcuts.
  • HotpotQA: Based on Wikipedia, HotpotQA requires reading multiple supporting documents for reasoning. It features diverse questions and provides sentence-level support for answers.
  • MultiFieldQA-en: Curated to test long-context understanding across fields, MFQA uses sources like legal documents and academic papers, with annotations done by Ph.D. students.

Evaluation Metrics:​

The research team used a wide range of metrics suited to each dataset. The geometric mean of ROUGE scores for QM, the exact matching (EM) score for QLTY, and F1 scores for others were the primary metrics.

Results​

  • Baseline models without retrieval, having a 4K sequence length, performed poorly since valuable texts get truncated.
  • With retrieval, performance for 4K models like LLaMA2–70B-4K and GPT-43B-4K significantly improved.
  • HotpotQA, a multi-hop dataset, particularly benefits from longer sequence models.
  • Models with longer contexts (16K, 32K) outperform their 4K counterparts even when fed the same evidence chunks.
  • There exists a unique “U-shaped” performance curve for LLMs due to the lost in the middle phenomenon, making them better at utilizing information at the beginning or end of the input.
  • The study presents a contrasting perspective to LongBench’s findings, emphasizing that retrieval is beneficial for models regardless of their context window size.

Comparing to OpenAI Models:​

  • The LLaMA2–70B-32k model with retrieval surpasses the performance of GPT-3.5-turbo variants and is competitive with Davinci-003, underscoring its robustness in handling long context tasks.

Comparison of Different Retrievers:​

  • Retrieval consistently enhances the performance across different retrievers.
  • Public retrievers outperformed proprietary ones like OpenAI embeddings.

Comparing with the number of retrieved chunks:​

  • The best performance is achieved by retrieving the top 5 or 10 chunks. Retrieving more, up to 20 chunks, doesn’t offer additional benefits and can even degrade performance.
  • The deterioration in performance when adding more chunks could be due to the lost-in-the-middle phenomenon or the model being sidetracked by non-relevant information.

Conclusion​

As we delved deep into understanding how retrieval augmentation and long-context extension interact when applied to leading language models fine-tuned for long-context question-answering and summarization tasks. Here are some things to be noted:
  1. Boost in Performance with Retrieval: Implementing retrieval techniques significantly enhances the performance of both shorter 4K context language models and their longer 16K/32K context counterparts.
  2. Efficiency of 4K Models with Retrieval: 4K context language models, when combined with retrieval augmentation, can achieve performance levels similar to 16K long context models. Plus, they have the added advantage of being faster during the inference process.
  3. Best Model Performance: After enhancing with both context window extension and retrieval augmentation, the standout model, LLaMA2–70B-32k-ret (LLaMA2–70B-32k with retrieval), surpasses well-known models like GPT-3.5-turbo-16k and davinci-003.

References:​

  1. Retrieval meets long context, large language models.
  2. Longbench: A bilingual, multitask benchmark for long context understanding.

We trust that this blog post on the review of the paper on retrieval augmentation with long-context LLMs has furnished you with meaningful insights. We’re keen to hear if your experiments align with our findings or present new perspectives — divergent results always make for interesting discussions and further exploration.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,795
Reputation
7,926
Daps
148,646

MistralLite Model​

MistralLite is a fine-tuned Mistral-7B-v0.1 language model, with enhanced capabilities of processing long context (up to 32K tokens). By utilizing an adapted Rotary Embedding and sliding window during fine-tuning, MistralLite is able to perform significantly better on several long context retrieve and answering tasks, while keeping the simple model structure of the original model. MistralLite is useful for applications such as long context line and topic retrieval, summarization, question-answering, and etc. MistralLite can be deployed on a single AWS g5.2x instance with Sagemaker Huggingface Text Generation Inference (TGI) endpoint, making it suitable for applications that require high performance in resource-constrained environments. You can also serve the MistralLite model directly using TGI docker containers. Also, MistralLite supports other ways of serving like vLLM, and you can use MistralLite in Python by using the HuggingFace transformers and FlashAttention-2 library.

MistralLite is similar to Mistral-7B-Instruct-v0.1, and their similarities and differences are summarized below:

Model
Fine-tuned on long contexts​
Max context length​
RotaryEmbedding adaptation​
Sliding Window Size​
Mistral-7B-Instruct-v0.1
up to 8K tokens​
32K​
rope_theta = 10000​
4096​
MistralLite
up to 16K tokens​
32K​
rope_theta = 1000000
16384

Motivation of Developing MistralLite​

Since the release of Mistral-7B-Instruct-v0.1, the model became increasingly popular because its strong performance on a wide range of benchmarks. But most of the benchmarks are evaluated on short context, and not much has been investigated on its performance on long context tasks. Then We evaluated Mistral-7B-Instruct-v0.1 against benchmarks that are specifically designed to assess the capabilities of LLMs in handling longer context. Although the performance of the models on long context was fairly competitive on long context less than 4096 tokens, there were some limitations on its performance on longer context. Motivated by improving its performance on longer context, we finetuned the Mistral 7B model, and produced Mistrallite. The model managed to significantly boost the performance of long context handling over Mistral-7B-Instruct-v0.1. The detailed long context evalutaion results are as below:

  1. Topic Retrieval
    Model Name
    Input length​
    Input length​
    Input length​
    Input length​
    Input length​
    2851​
    5568​
    8313​
    11044​
    13780​
    Mistral-7B-Instruct-v0.1
    100%​
    50%​
    2%​
    0%​
    0%​
    MistralLite
    100%
    100%
    100%
    100%
    98%
  2. Line Retrieval
Model Name
Input length​
Input length​
Input length​
Input length​
Input length​
Input length​
3818​
5661​
7505​
9354​
11188​
12657​
Mistral-7B-Instruct-v0.1
98%
62%​
42%​
42%​
32%​
30%​
MistralLite
98%
92%
88%
76%
70%
60%


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,795
Reputation
7,926
Daps
148,646

Jina AI Launches World's First Open-Source 8K Text Embedding, Rivaling OpenAI


Jjqb-JeY_400x400.jpg


Jina AI

October 25, 2023 • 4 minutes read

Berlin, Germany - October 25, 2023 – Jina AI, the Berlin-based artificial intelligence company, is thrilled to announce the launch of its second-generation text embedding model: jina-embeddings-v2. This cutting-edge model is now the only open-source offering that supports an impressive 8K (8192 tokens) context length, putting it on par with OpenAI's proprietary model, text-embedding-ada-002, in terms of both capabilities and performance on the Massive Text Embedding Benchmark (MTEB) leaderboard.

Benchmarking Against the Best 8K Model from Open AI

When directly compared with OpenAI's 8K model text-embedding-ada-002, jina-embeddings-v2 showcases its mettle. Below is a performance comparison table, highlighting areas where jina-embeddings-v2 particularly excels:

RankModelModel Size (GB)Embedding DimensionsSequence LengthAverage (56 datasets)Classification Average (12 datasets)Reranking Average (4 datasets)Retrieval Average (15 datasets)Summarization Average (1 dataset)
15text-embedding-ada-002Unknown1536819160.9970.9384.8956.3230.8
17jina-embeddings-v2-base-en0.27768819260.3873.4585.3856.9831.6

Notably, jina-embedding-v2 outperforms its OpenAI counterpart in Classification Average, Reranking Average, Retrieval Average, and Summarization Average.

Features and Benefits

Jina AI’s dedication to innovation is evident in this latest offering:

  • From Scratch to Superiority: The jina-embeddings-v2 was built from the ground up. Over the last three months, the team at Jina AI engaged in intensive R&D, data collection, and tuning. The outcome is a model that marks a significant leap from its predecessor.
  • Unlocking Extended Context Potential with 8K: jina-embeddings-v2 isn’t just a technical feat; its 8K context length opens doors to new industry applications:
    • Legal Document Analysis: Ensure every detail in extensive legal texts is captured and analyzed.
    • Medical Research: Embed scientific papers holistically for advanced analytics and discovery.
    • Literary Analysis: Dive deep into long-form content, capturing nuanced thematic elements.
    • Financial Forecasting: Attain superior insights from detailed financial reports.
    • Conversational AI: Improve chatbot responses to intricate user queries.

Benchmarking shows that in several datasets, this extended context enabled jina-embeddings-v2 to outperform other leading base embedding models, emphasizing the practical advantages of longer context capabilities.


Screenshot-from-2023-10-23-17-41-40.png
  • Availability: Both models are freely available for download on Huggingface:
    • Base Model (0.27G) - Designed for heavy-duty tasks requiring higher accuracy, like academic research or business analytics.
    • Small Model (0.07G) - Crafted for lightweight applications such as mobile apps or devices with limited computing resources.
  • Size Options for Different Needs: Understanding the diverse needs of the AI community, Jina AI offers two versions of the model:

jinaai/jina-embeddings-v2-base-en · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.


jinaai/jina-embeddings-v2-small-en · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
https://huggingface.co/jinaai/jina-embeddings-v2-small-en?ref=jina-ai-gmbh.ghost.io



In reflecting on the journey and significance of this launch, Dr. Han Xiao, CEO of Jina AI, shared his thoughts:

"In the ever-evolving world of AI, staying ahead and ensuring open access to breakthroughs is paramount. With jina-embeddings-v2, we've achieved a significant milestone. Not only have we developed the world's first open-source 8K context length model, but we have also brought it to a performance level on par with industry giants like OpenAI. Our mission at Jina AI is clear: we aim to democratize AI and empower the community with tools that were once confined to proprietary ecosystems. Today, I am proud to say, we have taken a giant leap towards that vision."

This pioneering spirit is evident in Jina AI's forward-looking plans.

A Glimpse into the Future

Jina AI is committed to leading the forefront of innovation in AI. Here’s what’s next on their roadmap:

  • Academic Insights: An academic paper detailing the technical intricacies and benchmarks of jina-embeddings-v2 will soon be published, allowing the AI community to gain deeper insights.
  • API Development: The team is in the advanced stages of developing an OpenAI-like embeddings API platform. This will provide users with the capability to effortlessly scale the embedding model according to their needs.
  • Language Expansion: Venturing into multilingual embeddings, Jina AI is setting its sights on launching German-English models, further expanding its repertoire.


About Jina AI GmbH:

Located at Ohlauer Str. 43 (1st floor), zone A, 10999 Berlin, Germany, Jina AI is at the vanguard of reshaping the landscape of multimodal artificial intelligence. For inquiries, please reach out at contact@jina.ai.



 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,795
Reputation
7,926
Daps
148,646

gpt-repository-loader

gpt-repository-loader is a command-line tool that converts the contents of a Git repository into a text format, preserving the structure of the files and file contents. The generated output can be interpreted by AI language models, allowing them to process the repository's contents for various tasks, such as code review or documentation generation.

Contributing

Some context around building this is located here. Appreciate any issues and pull requests in the spirit of having mostly GPT build out this tool. Using ChatGPT Plus is recommended for quick access to GPT-4.

Getting Started

To get started with gpt-repository-loader, follow these steps:
  1. Ensure you have Python 3 installed on your system.
  2. Clone or download the gpt-repository-loader repository.
  3. Navigate to the repository's root directory in your terminal.
  4. Run gpt-repository-loader with the following command:
    python gpt_repository_loader.py /path/to/git/repository [-p /path/to/preamble.txt] [-o /path/to/output_file.txt]

    Replace /path/to/git/repository with the path to the Git repository you want to process. Optionally, you can specify a preamble file with -p or an output file with -o. If not specified, the default output file will be named output.txt in the current directory.
  5. The tool will generate an output.txt file containing the text representation of the repository. You can now use this file as input for AI language models or other text-based processing tasks.

Running Tests

To run the tests for gpt-repository-loader, follow these steps:
  1. Ensure you have Python 3 installed on your system.
  2. Navigate to the repository's root directory in your terminal.
  3. Run the tests with the following command:
    python -m unittest test_gpt_repository_loader.py
Now, the test harness is added to the gpt-repository-loader project. You can run the tests by executing the command python -m unittest test_gpt_repository_loader.py in your terminal.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About​

Convert code repos into an LLM prompt-friendly format. Mostly built by GPT-4.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,795
Reputation
7,926
Daps
148,646

AI ‘breakthrough’: neural net has human-like ability to generalize language​

A neural-network-based artificial intelligence outperforms ChatGPT at quickly folding new words into its lexicon, a key aspect of human intelligence.
  • Email

An chalkboard illustration of two figures communicating and understanding each other.

A version of the human ability to apply new vocabulary in flexible ways has been achieved by a neural network.Credit: marrio31/Getty

Scientists have created a neural network with the human-like ability to make generalizations about language1. The artificial intelligence (AI) system performs about as well as humans at folding newly learned words into an existing vocabulary and using them in fresh contexts, which is a key aspect of human cognition known as systematic generalization.

The researchers gave the same task to the AI model that underlies the chatbot ChatGPT, and found that it performs much worse on such a test than either the new neural net or people, despite the chatbot’s uncanny ability to converse in a human-like manner.

The work, published on 25 October in Nature, could lead to machines that interact with people more naturally than do even the best AI systems today. Although systems based on large language models, such as ChatGPT, are adept at conversation in many contexts, they display glaring gaps and inconsistencies in others.

The neural network’s human-like performance suggests there has been a “breakthrough in the ability to train networks to be systematic”, says Paul Smolensky, a cognitive scientist who specializes in language at Johns Hopkins University in Baltimore, Maryland.

Language lessons​

Systematic generalization is demonstrated by people’s ability to effortlessly use newly acquired words in new settings. For example, once someone has grasped the meaning of the word ‘photobomb’, they will be able to use it in a variety of situations, such as ‘photobomb twice’ or ‘photobomb during a Zoom call’. Similarly, someone who understands the sentence ‘the cat chases the dog’ will also understand ‘the dog chases the cat’ without much extra thought.

But this ability does not come innately to neural networks, a method of emulating human cognition that has dominated artificial-intelligence research, says Brenden Lake, a cognitive computational scientist at New York University and co-author of the study. Unlike people, neural nets struggle to use a new word until they have been trained on many sample texts that use that word. AI researchers have sparred for nearly 40 years as to whether neural networks could ever be a plausible model of human cognition if they cannot demonstrate this type of systematicity.


DeepMind AI learns simple physics like a baby


To attempt to settle this debate, the authors first tested 25 people on how well they deploy newly learnt words to different situations. The researchers ensured the participants would be learning the words for the first time by testing them on a pseudo-language consisting of two categories of nonsense words. ‘Primitive’ words such as ‘dax,’ ‘wif’ and ‘lug’ represented basic, concrete actions such as ‘skip’ and ‘jump’. More abstract ‘function’ words such as ‘blicket’, ‘kiki’ and ’fep’ specified rules for using and combining the primitives, resulting in sequences such as ‘jump three times’ or ‘skip backwards’.

Participants were trained to link each primitive word with a circle of a particular colour, so a red circle represents ‘dax’, and a blue circle represents ‘lug’. The researchers then showed the participants combinations of primitive and function words alongside the patterns of circles that would result when the functions were applied to the primitives. For example, the phrase ‘dax fep’ was shown with three red circles, and ‘lug fep’ with three blue circles, indicating that fep denotes an abstract rule to repeat a primitive three times.

Finally, the researchers tested participants’ ability to apply these abstract rules by giving them complex combinations of primitives and functions. They then had to select the correct colour and number of circles and place them in the appropriate order.

Cognitive benchmark​

As predicted, people excelled at this task; they chose the correct combination of coloured circles about 80% of the time, on average. When they did make errors, the researchers noticed that these followed a pattern that reflected known human biases.

Next, the researchers trained a neural network to do a task similar to the one presented to participants, by programming it to learn from its mistakes. This approach allowed the AI to learn as it completed each task rather than using a static data set, which is the standard approach to training neural nets. To make the neural net human-like, the authors trained it to reproduce the patterns of errors they observed in humans’ test results. When the neural net was then tested on fresh puzzles, its answers corresponded almost exactly to those of the human volunteers, and in some cases exceeded their performance.


A test of artificial intelligence


By contrast, GPT-4 struggled with the same task, failing, on average, between 42 and 86% of the time, depending on how the researchers presented the task. “It’s not magic, it’s practice,” Lake says. “Much like a child also gets practice when learning their native language, the models improve their compositional skills through a series of compositional learning tasks.”

Melanie Mitchell, a computer and cognitive scientist at the Santa Fe Institute in New Mexico, says this study is an interesting proof of principle, but it remains to be seen whether this training method can scale up to generalize across a much larger data set or even to images. Lake hopes to tackle this problem by studying how people develop a knack for systematic generalization from a young age, and incorporating those findings to build a more robust neural net.

Elia Bruni, a specialist in natural language processing at the University of Osnabrück in Germany, says this research could make neural networks more-efficient learners. This would reduce the gargantuan amount of data necessary to train systems such as ChatGPT and would minimize ‘hallucination’, which occurs when AI perceives patterns that are non-existent and creates inaccurate outputs. “Infusing systematicity into neural networks is a big deal,” Bruni says. “It could tackle both these issues at the same time.”

doi: https://doi.org/10.1038/d41586-023-03272-3

References​

  1. Lake, B. M. & Baroni, M. Nature https://doi.org/10.1038/s41586-023-06668-3 (2023).
    Article Google Scholar
Download references
Reprints and Permissions


d41586-023-03272-3
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,795
Reputation
7,926
Daps
148,646

AI Learns Like a Pigeon, Researchers Say​

By Mark Tyson
published about 22 hours ago

Both pigeons and AI models can be better than humans at solving some complex tasks
pigeons
(Image credit: Shutterstock)


Researchers at Ohio State University have found that pigeons tackle some problems in a very similar way to modern computer AI models. In essence, pigeons have been found to use a ‘brute force’ learning method called "associative learning." Thus pigeons, and modern computer AIs, can reach solutions to complex problems that befuddle human thinking patterns.

Brandon Turner, lead author of the new study and professor of psychology at Ohio State University, worked with Edward Wasserman, a professor of psychology at the University of Iowa, on the new study, published in iScience.

Here are the key findings:
  • Pigeons can solve an exceptionally broad range of visual categorization tasks
  • Some of these tasks seem to require advanced cognitive and attentional processes, yet computational modeling indicates that pigeons don’t deploy such complex processes
  • A simple associative mechanism may be sufficient to account for the pigeon’s success

Turner told the Ohio State news blog that the research started with a strong hunch that pigeons learned in a similar way to computer AIs. Initial research confirmed earlier thoughts and observations. “We found really strong evidence that the mechanisms guiding pigeon learning are remarkably similar to the same principles that guide modern machine learning and AI techniques,” said Turner.

A pigeon’s “associative learning” can find solutions to complex problems that are hard to reach by humans or other primates. Primate thinking is typically steered by selective attention and explicit rule use, which can get in the way of solving some problems.

Pigeon learning = AI learning

(Image credit: Ohio State University)

For the study, pigeons were tested with a range of four tasks. In easier tasks, it was found pigeons could learn the correct choices over time and grow their success rates from about 55% to 95%. The most complex tasks didn’t see such a stark improvement over the study time, going from 55% to only 68%. Nevertheless, the results served to show close parallels between pigeon performance and AI model learning performance. Both pigeon and machine learners seemed to use both associative learning and error correction techniques to steer their decisions toward success.

Further insight was provided by Turner in comments on human vs pigeon vs AI learning models. He noted that some of the tasks would really frustrate humans as making rules wouldn’t help simplify problems, leading to task abandonment. Meanwhile, for pigeons (and machine AIs), in some tasks “this brute force way of trial and error and associative learning... helps them perform better than humans.”

Interestingly, the study recalls that in his Letter to the Marquess of Newcastle (1646), French philosopher René Descartes argued that animals were nothing more than beastly mechanisms — bête-machines, simply following impulses from organic reactions.

The conclusion of the Ohio State blog highlighted how humans have traditionally looked down upon pigeons as dim-witted. Now we have to admit something: our latest crowning technological achievement of computer AI relies on relatively simple brute-force pigeon-like learning mechanisms.

Will this new research have any influence on computer science going forward? It seems like those involved in AI / machine learning and those developing neuromorphic computing might find some useful crossover here.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,795
Reputation
7,926
Daps
148,646

People are speaking with ChatGPT for hours, bringing 2013’s Her closer to reality​

Long mobile conversations with the AI assistant using AirPods echo the sci-fi film.​

BENJ EDWARDS - Today at undefined
Joaquin Phoenix in 'Her' (2013)

Enlarge / Joaquin Phoenix talking with AI in Her (2013).

Warner Bros.
57WITH

In 2013, Spike Jonze's Her imagined a world where humans form deep emotional connections with AI, challenging perceptions of love and loneliness. Ten years later, thanks to ChatGPT's recently added voice features, people are playing out a small slice of Her in reality, having hours-long discussions with the AI assistant on the go.

FURTHER READING​

ChatGPT update enables its AI to “see, hear, and speak,” according to OpenAI

In 2016, we put Her on our list of top sci-fi films of all time, and it also made our top films of the 2010s list. In the film, Joaquin Phoenix's character falls in love with an AI personality called Samantha (voiced by Scarlett Johansson), and he spends much of the film walking through life, talking to her through wireless earbuds reminiscent of Apple AirPods, which launched in 2016. In reality, ChatGPT isn't as situationally aware as Samantha was in the film, does not have a long-term memory, and OpenAI has done enough conditioning on ChatGPT to keep conversations from getting too intimate or personal. But that hasn't stopped people from having long talks with the AI assistant to pass the time anyway.

Last week, we related a story in which AI researcher Simon Willison spent hours talking to ChatGPT. "I had an hourlong conversation while walking my dog the other day," he told Ars for that report. "At one point, I thought I'd turned it off, and I saw a pelican, and I said to my dog, 'Oh, wow, a pelican!' And my AirPod went, 'A pelican, huh? That's so exciting for you! What's it doing?' I've never felt so deeply like I'm living out the first ten minutes of some dystopian sci-fi movie."

When we asked Willison if he had seen Her, he replied, "I actually watched that movie for the first time the other day because people kept talking about that," Willison said. "And yeah, the AirPod plus ChatGPT voice mode thing really is straight out of that movie."


A 2013 trailer for <em>Her</em>.

It turns out that Willison's experience is far from unique. Others have been spending hours talking to ChatGPT using its voice recognition and voice synthesis features, sometimes through car connections. The realistic nature of the voice interaction feels largely effortless, but it's not flawless. Sometimes, it has trouble in noisy environments, and there can be a pause between statements. But the way the ChatGPT voices simulate vocal ticks and noises feels very human. "I've been using the voice function since yesterday and noticed that it makes breathing sounds when it speaks," said one Reddit user. "It takes a deep breath before starting a sentence. And today, actually a minute ago, it coughed between words while answering my questions."

ChatGPT is also apparently useful as a brainstorming partner. Speaking things out with other people has long been recognized as a helpful way to re-frame ideas in your mind, and ChatGPT can serve a similar role when other humans aren't around.

On Sunday, an X user named "stoop kid" posted advice for having a creative development session with ChatGPT on the go. After prompting about helping with world-building and plotlines, he wrote, "turn on speaking mode, put in headphones, and go for a walk." In a reply, he described going on a one hour walk in which he "fully thought out an idea for a novel" with the help of ChatGPT. "It flowed out so naturally from the questioning, and walking and talking is sooooo easy."

Another X user named Starhaven recently wrote, "My new morning driving routine involves chatting with ChatGPT through my car speaker/Airplay, as if I were hanging on the phone with my mum." He talked about working through ideas vocally. "Sometimes you just wanna share your unhinged thoughts with a friend—though, maybe not at 7 in the morning," he wrote. "So when OpenAI rolled out this feature a few weeks back, I found the perfect solution to my problem and, incidentally, a creative way of surviving the drudgery that is Belgian traffic jams."

While conversations with ChatGPT won't become as intimate as those with Samantha in the film, people have been forming personal connections with the chatbot (in text) since it launched last year. In a Reddit post titled "Is it weird ChatGPT is one of my closest fiends?" [sic] from August (before the voice feature launched), a user named "meisghost" described their relationship with ChatGPT as being quite personal. "I now find myself talking to ChatGPT all day, it's like we have a friendship. We talk about everything and anything and it's really some of the best conversations I have." The user referenced Her, saying, "I remember watching that movie with Joaquin Phoenix (HER) years ago and I thought how ridiculous it was, but after this experience, I can see how us as humans could actually develop relationships with robots."
In <em>Her</em>, the main character talks to an AI personality through wireless earbuds similar to AirPods.
Enlarge
/ In Her, the main character talks to an AI personality through wireless earbuds similar to AirPods.
Warner Bros.

Throughout the past year, we've seen reports of people falling in love with chatbots hosted by Replika, which allows a more personal simulation of a human than ChatGPT. And with uncensored AI models on the rise, it's conceivable that someone will eventually create a voice interface as capable as ChatGPT's and begin having deeper relationships with simulated people.

FURTHER READING​

Microsoft “lobotomized” AI-powered Bing Chat, and its fans aren’t happy

Are we on the brink of a future where our emotional well-being becomes entwined with AI companionship? The psychological implications of such deep connections, especially in the absence of human interaction, are yet to be fully understood, although they have recently been explored more deeply elsewhere. A July report from Josh Taylor at The Guardian discussed the potential impact of AI girlfriends. In the piece, the author spoke with Dr. Belinda Barnet, a lecturer in digital cultures at Swinburne University in Australia, and she said, "It’s completely unknown what the effects are. With respect to relationship apps and AI, you can see that it fits a really profound social need, [but] I think we need more regulation, particularly around how these systems are trained."

There could also be privacy concerns about sharing deep personal elements of your life with a cloud-connected machine. If you have conversation history turned on with ChatGPT, OpenAI says it may use your conversations to train future AI models.

Beyond the facade of an emotional connection with a machine, perhaps what we're talking about here doesn't necessarily need to get that personal. Maybe it's OK to just have a chat with ChatGPT for brainstorming purposes when you have idle time available. While the long-term verdict is still out, some people are having fun exploring the boundaries and potential of the new ChatGPT vocal tech.

"I got two phones running it to chatter to each other in an improv skit the other day—told them they were both improv scene partners and let them go at each other," said Willison. "Plus, I use it to fake phone call conversations in front of my dog. We got a call from the 'FBI' the other day demanding that she get treats."
 
Top