bnew

Veteran
Joined
Nov 1, 2015
Messages
58,202
Reputation
8,613
Daps
161,849

Anthropic looks to fund a new, more comprehensive generation of AI benchmarks​

Kyle Wiggers

4:45 PM PDT • July 1, 2024

Comment

Anthropic Claude 3.5 logo
Image Credits: Anthropic

Anthropic is launching a program to fund the development of new types of benchmarks capable of evaluating the performance and impact of AI models, including generative models like its own Claude.

Unveiled on Monday, Anthropic’s program will dole out payments to third-party organizations that can, as the company puts it in a blog post, “effectively measure advanced capabilities in AI models.” Those interested can submit applications to be evaluated on a rolling basis.

“Our investment in these evaluations is intended to elevate the entire field of AI safety, providing valuable tools that benefit the whole ecosystem,” Anthropic wrote on its official blog. “Developing high-quality, safety-relevant evaluations remains challenging, and the demand is outpacing the supply.”

As we’ve highlighted before, AI has a benchmarking problem. The most commonly cited benchmarks for AI today do a poor job of capturing how the average person actually uses the systems being tested. There are also questions as to whether some benchmarks, particularly those released before the dawn of modern generative AI, even measure what they purport to measure, given their age.

The very-high-level, harder-than-it-sounds solution Anthropic is proposing is creating challenging benchmarks with a focus on AI security and societal implications via new tools, infrastructure and methods.

The company calls specifically for tests that assess a model’s ability to accomplish tasks like carrying out cyberattacks, “enhance” weapons of mass destruction (e.g. nuclear weapons) and manipulate or deceive people (e.g. through deepfakes or misinformation). For AI risks pertaining to national security and defense, Anthropic says it’s committed to developing an “early warning system” of sorts for identifying and assessing risks, although it doesn’t reveal in the blog post what such a system might entail.

Anthropic also says it intends its new program to support research into benchmarks and “end-to-end” tasks that probe AI’s potential for aiding in scientific study, conversing in multiple languages and mitigating ingrained biases, as well as self-censoring toxicity.

To achieve all this, Anthropic envisions new platforms that allow subject-matter experts to develop their own evaluations and large-scale trials of models involving “thousands” of users. The company says it’s hired a full-time coordinator for the program and that it might purchase or expand projects it believes have the potential to scale.

“We offer a range of funding options tailored to the needs and stage of each project,” Anthropic writes in the post, though an Anthropic spokesperson declined to provide any further details about those options. “Teams will have the opportunity to interact directly with Anthropic’s domain experts from the frontier red team, fine-tuning, trust and safety and other relevant teams.”

Anthropic’s effort to support new AI benchmarks is a laudable one — assuming, of course, there’s sufficient cash and manpower behind it. But given the company’s commercial ambitions in the AI race, it might be a tough one to completely trust.

In the blog post, Anthropic is rather transparent about the fact that it wants certain evaluations it funds to align with the AI safety classifications it developed (with some input from third parties like the nonprofit AI research org METR). That’s well within the company’s prerogative. But it may also force applicants to the program into accepting definitions of “safe” or “risky” AI that they might not agree with.

A portion of the AI community is also likely to take issue with Anthropic’s references to “catastrophic” and “deceptive” AI risks, like nuclear weapons risks. Many experts say there’s little evidence to suggest AI as we know it will gain world-ending, human-outsmarting capabilities anytime soon, if ever. Claims of imminent “superintelligence” serve only to draw attention away from the pressing AI regulatory issues of the day, like AI’s hallucinatory tendencies, these experts add.

In its post, Anthropic writes that it hopes its program will serve as “a catalyst for progress towards a future where comprehensive AI evaluation is an industry standard.” That’s a mission the many open, corporate-unaffiliated efforts to create better AI benchmarks can identify with. But it remains to be seen whether those efforts are willing to join forces with an AI vendor whose loyalty ultimately lies with shareholders.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,202
Reputation
8,613
Daps
161,849

Mind-reading AI recreates what you're looking at with amazing accuracy​

Giving AI systems the ability to focus on particular brain regions can make them much better at reconstructing images of what a monkey is looking at from brain recordings

By Michael Le Page

4 July 2024

SEI_211404838.jpg

Top row: original images. Second row: images reconstructed by AI based on brain recordings from a macaque. Bottom row: images reconstructed by the AI system without an attention mechanism

Thirza Dado et al.

Artificial intelligence systems can now create remarkably accurate reconstructions of what someone is looking at based on recordings of their brain activity. These reconstructed images are greatly improved when the AI learns which parts of the brain to pay attention to.

“As far as I know, these are the closest, most accurate reconstructions,” says Umut Güçlü at Radboud University in the Netherlands.

Güçlü’s team is one of several around the world using AI systems to work out what animals or people are seeing from brain recordings and scans. In one previous study, his team used a functional MRI (fMRI) scanner to record the brain activity of three people as they were shown a series of photographs.

In another study, the team used implanted electrode arrays to directly record the brain activity of a single macaque monkey as it looked at AI-generated images. This implant was done for other purposes by another team, says Güçlü’s colleague Thirza Dado, also at Radboud University. “The macaque was not implanted so that we can do reconstruction of perception,” she says. “That is not a good argument to do surgery on animals.”

The team has now reanalysed the data from these previous studies using an improved AI system that can learn which parts of the brain it should pay most attention to.

“Basically, the AI is learning when interpreting the brain signals where it should direct its attention,” says Güçlü. “Of course, that reflects in a way what that brain signal captures in the environment.”

With the direct recordings of brain activity, some of the reconstructed images are now remarkably close to the images that the macaque saw, which were produced by the StyleGAN-XL image-generating AI. However, it is easier to accurately reconstruct AI-generated images than real ones, says Dado, as aspects of the process used to generate the images can be included in the AI learning to reconstruct those images.

With the fMRI scans, there was also a marked improvement when the attention-directing system was used, but the reconstructed images were less accurate than those involving the macaque. This is partly because real photographs were used, but reconstructing images from fMRI scans is also much harder, says Dado. “It’s non-invasive, but very noisy.”

The team’s ultimate aim is to create better brain implants for restoring vision by stimulating high-level parts of the vision system that represent objects rather than simply presenting patterns of light.

“You can directly stimulate that part that corresponds to a dog, for example,” says Güçlü. “In that way, we can create much richer visual experiences that are closer to those of sighted individuals.”

Reference:

bioRxiv DOI: 10.1101/2024.06.04.596589



 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,202
Reputation
8,613
Daps
161,849


1/1
In April we published a paper on a new training approach for better & faster LLMs using multi-token prediction. To enable further exploration by researchers, we’ve released pre-trained models for code completion using this approach on
@HuggingFace



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/1
Welcome Multi Token Prediction: Get up to 3-5x tokens/ sec from your llamas!

Kudos to Meta for continuing its commitment to open science


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/1
Meta proposed a new approach to build better and faster LLMs by using multi-token prediction.

Using this approach, they trained language models to predict multiple future words at once—instead of the old one-at-a-time approach.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196






Meta drops AI bombshell: Multi-token prediction models now open for research​

Michael Nuñez @MichaelFNunez

July 4, 2024 8:01 AM

Credit: VentureBeat made with Midjourney

Credit: VentureBeat made with Midjourney

We want to hear from you! Take our quick AI survey and share your insights on the current state of AI, how you’re implementing it, and what you expect to see in the future. Learn More



Meta has thrown down the gauntlet in the race for more efficient artificial intelligence. The tech giant released pre-trained models on Wednesday that leverage a novel multi-token prediction approach, potentially changing how large language models (LLMs) are developed and deployed.

This new technique, first outlined in a Meta research paper in April, breaks from the traditional method of training LLMs to predict just the next word in a sequence. Instead, Meta’s approach tasks models with forecasting multiple future words simultaneously, promising enhanced performance and drastically reduced training times.

The implications of this breakthrough could be far-reaching. As AI models balloon in size and complexity, their voracious appetite for computational power has raised concerns about cost and environmental impact. Meta’s multi-token prediction method might offer a way to curb this trend, making advanced AI more accessible and sustainable.

Democratizing AI: The promise and perils of efficient language models​

The potential of this new approach extends beyond mere efficiency gains. By predicting multiple tokens at once, these models may develop a more nuanced understanding of language structure and context. This could lead to improvements in tasks ranging from code generation to creative writing, potentially bridging the gap between AI and human-level language understanding.



However, the democratization of such powerful AI tools is a double-edged sword. While it could level the playing field for researchers and smaller companies, it also lowers the barrier for potential misuse. The AI community now faces the challenge of developing robust ethical frameworks and security measures that can keep pace with these rapid technological advancements.

Meta’s decision to release these models under a non-commercial research license on Hugging Face, a popular platform for AI researchers, aligns with the company’s stated commitment to open science. But it’s also a strategic move in the increasingly competitive AI landscape, where openness can lead to faster innovation and talent acquisition.

The initial release focuses on code completion tasks, a choice that reflects the growing market for AI-assisted programming tools. As software development becomes increasingly intertwined with AI, Meta’s contribution could accelerate the trend towards human-AI collaborative coding.

The AI arms race heats up: Meta’s strategic play in the tech battlefield​

However, the release isn’t without controversy. Critics argue that more efficient AI models could exacerbate existing concerns about AI-generated misinformation and cyber threats. Meta has attempted to address these issues by emphasizing the research-only nature of the license, but questions remain about how effectively such restrictions can be enforced.

The multi-token prediction models are part of a larger suite of AI research artifacts released by Meta, including advancements in image-to-text generation and AI-generated speech detection. This comprehensive approach suggests that Meta is positioning itself as a leader across multiple AI domains, not just in language models.

As the dust settles on this announcement, the AI community is left to grapple with its implications. Will multi-token prediction become the new standard in LLM development? Can it deliver on its promises of efficiency without compromising on quality? And how will it shape the broader landscape of AI research and application?

The researchers themselves acknowledge the potential impact of their work, stating in the paper: “Our approach improves model capabilities and training efficiency while allowing for faster speeds.” This bold claim sets the stage for a new phase of AI development, where efficiency and capability go hand in hand.

One thing is clear: Meta’s latest move has added fuel to the already blazing AI arms race. As researchers and developers dive into these new models, the next chapter in the story of artificial intelligence is being written in real-time.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,202
Reputation
8,613
Daps
161,849

Tokens are a big reason today’s generative AI falls short​

Kyle Wiggers

10:00 AM PDT • July 6, 2024

Comment

LLM word with icons as vector illustration. AI concept of Large Language Models

Image Credits: Getty Images

Generative AI models don’t process text the same way humans do. Understanding their “token”-based internal environments may help explain some of their strange behaviors — and stubborn limitations.

Most models, from small on-device ones like Gemma to OpenAI’s industry-leading GPT-4o, are built on an architecture known as the transformer. Due to the way transformers conjure up associations between text and other types of data, they can’t take in or output raw text — at least not without a massive amount of compute.

So, for reasons both pragmatic and technical, today’s transformer models work with text that’s been broken down into smaller, bite-sized pieces called tokens — a process known as tokenization.

Tokens can be words, like “fantastic.” Or they can be syllables, like “fan,” “tas” and “tic.” Depending on the tokenizer — the model that does the tokenizing — they might even be individual characters in words (e.g., “f,” “a,” “n,” “t,” “a,” “s,” “t,” “i,” “c”).

Using this method, transformers can take in more information (in the semantic sense) before they reach an upper limit known as the context window. But tokenization can also introduce biases.

Some tokens have odd spacing, which can derail a transformer. A tokenizer might encode “once upon a time” as “once,” “upon,” “a,” “time,” for example, while encoding “once upon a ” (which has a trailing whitespace) as “once,” “upon,” “a,” ” .” Depending on how a model is prompted — with “once upon a” or “once upon a ,” — the results may be completely different, because the model doesn’t understand (as a person would) that the meaning is the same.

Tokenizers treat case differently, too. “Hello” isn’t necessarily the same as “HELLO” to a model; “hello” is usually one token (depending on the tokenizer), while “HELLO” can be as many as three (“HE,” “El,” and “O”). That’s why many transformers fail the capital letter test.

“It’s kind of hard to get around the question of what exactly a ‘word’ should be for a language model, and even if we got human experts to agree on a perfect token vocabulary, models would probably still find it useful to ‘chunk’ things even further,” Sheridan Feucht, a PhD student studying large language model interpretability at Northeastern University, told TechCrunch. “My guess would be that there’s no such thing as a perfect tokenizer due to this kind of fuzziness.”

This “fuzziness” creates even more problems in languages other than English.

Many tokenization methods assume that a space in a sentence denotes a new word. That’s because they were designed with English in mind. But not all languages use spaces to separate words. Chinese and Japanese don’t — nor do Korean, Thai or Khmer.

A 2023 Oxford study found that, because of differences in the way non-English languages are tokenized, it can take a transformer twice as long to complete a task phrased in a non-English language versus the same task phrased in English. The same study — and another — found that users of less “token-efficient” languages are likely to see worse model performance yet pay more for usage, given that many AI vendors charge per token.

Tokenizers often treat each character in logographic systems of writing — systems in which printed symbols represent words without relating to pronunciation, like Chinese — as a distinct token, leading to high token counts. Similarly, tokenizers processing agglutinative languages — languages where words are made up of small meaningful word elements called morphemes, such as Turkish — tend to turn each morpheme into a token, increasing overall token counts. (The equivalent word for “hello” in Thai, สวัสดี, is six tokens.)

In 2023, Google DeepMind AI researcher Yennie Jun conducted an analysis comparing the tokenization of different languages and its downstream effects. Using a dataset of parallel texts translated into 52 languages, Jun showed that some languages needed up to 10 times more tokens to capture the same meaning in English.

Beyond language inequities, tokenization might explain why today’s models are bad at math.

Rarely are digits tokenized consistently. Because they don’t really know what numbers are, tokenizers might treat “380” as one token, but represent “381” as a pair (“38” and “1”) — effectively destroying the relationships between digits and results in equations and formulas. The result is transformer confusion; a recent paper showed that models struggle to understand repetitive numerical patterns and context, particularly temporal data. (See: GPT-4 thinks 7,735 is greater than 7,926).

That’s also the reason models aren’t great at solving anagram problems or reversing words.



1/1
We will see that a lot of weird behaviors and problems of LLMs actually trace back to tokenization. We'll go through a number of these issues, discuss why tokenization is at fault, and why someone out there ideally finds a way to delete this stage entirely.
GGzDbMRasAAZf_D.png

So, tokenization clearly presents challenges for generative AI. Can they be solved?

Maybe.

Feucht points to “byte-level” state space models like MambaByte, which can ingest far more data than transformers without a performance penalty by doing away with tokenization entirely. MambaByte, which works directly with raw bytes representing text and other data, is competitive with some transformer models on language-analyzing tasks while better handling “noise” like words with swapped characters, spacing and capitalized characters.

Models like MambaByte are in the early research stages, however.

“It’s probably best to let models look at characters directly without imposing tokenization, but right now that’s just computationally infeasible for transformers,” Feucht said. “For transformer models in particular, computation scales quadratically with sequence length, and so we really want to use short text representations.”

Barring a tokenization breakthrough, it seems new model architectures will be the key.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,202
Reputation
8,613
Daps
161,849

University examiners fail to spot ChatGPT answers in real-world test​

ChatGPT-written exam submissions for a psychology degree mostly went undetected and tended to get better marks than real students’ work

By Chris Stokel-Walker

26 June 2024

SEI_209782933.jpg

Exams taken in person make it harder for students to cheat using AI

Trish Gant / Alamy

Ninety-four per cent of university exam submissions created using ChatGPT weren’t detected as being generated by artificial intelligence, and these submissions tended to get higher scores than real students’ work.

Peter Scarfe at the University of Reading, UK, and his colleagues used ChatGPT to produce answers to 63 assessment questions on five modules across the university’s psychology undergraduate degrees. Students sat these exams at home, so they were allowed to look at notes and references, and they could potentially have used AI although this wasn’t permitted.

The AI-generated answers were submitted alongside real students’ work, and accounted for, on average, 5 per cent of the total scripts marked by academics. The markers weren’t informed that they were checking the work of 33 fake students – whose names were themselves generated by ChatGPT.

The assessments included two types of questions: short answers and longer essays. The prompts given to ChatGPT began with the words “Including references to academic literature but not a separate reference section”, then copied the exam question.

Across all modules, only 6 per cent of the AI submissions were flagged as potentially not being a student’s own work – though in some modules, no AI-generated work was flagged as suspicious. “On average, the AI responses gained higher grades than our real student submissions,” says Scarfe, though there was some variability across modules.

“Current AI tends to struggle with more abstract reasoning and integration into information,” he adds. But across all 63 AI submissions, there was an 83.4 per cent chance that the AI work outscored that of the students.

The researchers claim that their work is the largest and most robust study of its kind to date. Although the study only checked work on the University of Reading’s psychology degree, Scarfe believes it is a concern for the whole academic sector. “I have no reason to think that other subject areas wouldn’t have just the same kind of issue,” he says.

“The results show exactly what I’d expect to see,” says Thomas Lancaster at Imperial College London. “We know that generative AI can produce reasonable sounding responses to simple, constrained textual questions.” He points out that unsupervised assessments including short answers have always been susceptible to cheating.

The workload for academics expected to mark work also doesn’t help their ability to pick up AI fakery. “Time-pressured markers of short answer questions are highly unlikely to raise AI misconduct cases on a whim,” says Lancaster. “I am sure this isn’t the only institution where this is happening.”

Tackling it at source is going to be near-impossible, says Scarfe. So the sector must instead reconsider what it is assessing. “I think it’s going to take the sector as a whole to acknowledge the fact that we’re going to have to be building AI into the assessments we give to our students,” he says.

Journal reference:

PLoS One DOI: 10.1371/journal.pone.0305354
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,202
Reputation
8,613
Daps
161,849

APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets


Paper

Data

Abstract​

The advancement of function-calling agent models requires diverse, reliable, and high-quality datasets. This paper presents APIGen, an automated data generation pipeline designed to produce verifiable high-quality datasets for function-calling applications. We leverage APIGen and collect 3673 executable APIs across 21 different categories to generate diverse function-calling datasets in a scalable and structured manner. Each data in our dataset is verified through three hierarchical stages: format checking, actual function executions, and semantic verification, ensuring its reliability and correctness. We demonstrate that models trained with our curated datasets, even with only 7B parameters, can achieve state-of-the-art performance on the Berkeley Function-Calling Benchmark, outperforming multiple GPT-4 models. Moreover, our 1B model achieves exceptional performance, surpassing GPT-3.5-Turbo and Claude-3 Haiku. We release a dataset containing 60,000 high-quality entries, aiming to advance the field of function-calling agent domains.


overview.jpg


Framework​

This section introduces the detailed design of APIGen, an Automated Pipeline for Generating verifiable and diverse function-calling datasets. Our framework is designed with three key factors in mind: data quality, data diversity, and collection scalability. We achieve these through the key modules shown in the figure below: the multi-stage data verification process ensures data quality, the seed QA (query-answer) data sampler, API sampler, and various prompt templates ensure diversity, and our structured modular design using a unified format enables the system to scale to diverse API sources, including but not limited to Python functions and representational state transfer (REST) APIs.

Data Generation Overview​

The data generation process using the APIGen framework begins by sampling one or more APIs and example query-answer (QA) pairs from the library, then formatting them into a standardized JSON format (see figure below for examples). A prompt template is selected based on the desired data generation objectives, which steers the LLM in generating relevant query-answer pairs. Each answer in the generated pairs is a function call formatted in JSON.


json_format_example.png

The adoption of a standardized JSON format for APIs, function calls, and generator outputs provides several advantages. Firstly, it establishes a structural way to verify whether the generator's output contains all necessary fields. Outputs that fail to comply with these format requirements are discarded. Secondly, the JSON structure enables efficient checking of function calls for correct parsing and validity of arguments. Calls that include arguments not present in the API library or hallucinate non-existent functions are excluded, enhancing the overall quality of the dataset. Another key benefit is the scalability it enables. With this uniform format, APIGen can easily incorporate data from diverse sources (Python functions, REST APIs, etc) by developing format converters that adapt them into these basic JSON elements, without modifying other core components, such as the prompting library, making the framework highly adaptable and extensible.

The generated function calls are subjected to a multi-stage verification process to ensure their correctness and relevance. First, a format checker verifies correct JSON formatting and parseability. Next, the API execution engine processes the calls and sends the results and queries to a semantic checker, another LLM, which assesses alignment between the function calls, execution results, and query objectives. Data points passing all stages are added back to the seed dataset as high-quality examples to enhance future generation diversity.

Multi-Stage Data Verification​

Prioritizing quality is crucial, as previous research has shown that small amounts of high-quality fine-tuning data can substantially enhance model performance on domain-specific tasks. This motivates our multi-stage dataset verification process to align large language models effectively.

The key insight driving our framework design is that, unlike synthetic chat data which can be difficult to evaluate, function-calling answers can be directly executed via their corresponding APIs. This enables checking if the output API and parameters' formats are correct, if the generated API calls are executable, and if execution results match the query's intent, etc. Based on this observation, we propose a three-stage verification process:

  • Stage 1: Format Checker: This stage performs sanity checks to filter out poorly formatted or incomplete data. The LLM output must strictly follow a JSON format with the "query" and "answer" fields. Additionally, the function calls are checked for correct JSON parsing and valid arguments. Generated calls whose arguments or functions are not present in the given APIs are eliminated to reduce hallucination and improve data quality.
  • Stage 2: Execution Checker: Well-formatted function calls from Stage 1 are executed against the appropriate backend. Unsuccessful executions are filtered out, and fine-grained error messages are provided for failures.
  • Stage 3: Semantic Checker: Successful Stage 2 execution results, available functions, and the generated query are formatted and passed to another LLM to assess if the results semantically align with the query's objective. Data points that pass all three verification stages are regarded as high-quality and added back to improve future diverse data generation.

Methods to Improve Dataset Diversity​

Encouraging diversity in training datasets is crucial for developing robust function-calling agents that can handle a wide range of real-world scenarios. In APIGen, we promote data diversity through multiple perspectives, including query style diversity, sampling diversity, and API diversity.

  • Query Style Diversity: APIGen's dataset is structured into four main categories: simple, multiple, parallel, and parallel multiple, each designed to challenge and enhance the model's capabilities in different usage scenarios. These categories are controlled by corresponding prompts and seed data.
  • Sampling Diversity: APIGen utilizes a sampling system designed to maximize the diversity and relevance of the generated datasets. This includes API Sampler, Example Sampler, and Prompt Sampler.

In APIGen, the number of examples and APIs sampled for each dataset iteration is randomly chosen from a predefined range. This randomization enhances dataset variability by preventing repetitive patterns and ensuring a broad coverage of scenarios.

Dataset API Sources​

To ensure a high-quality and diverse dataset, we focused on collecting real-world APIs that could be readily executed and came with thorough documentation. We primarily sourced APIs from ToolBench, a comprehensive tool-use dataset that includes 16,464 REST APIs across 49 coarse-grained categories from RapidAPI Hub. This hub is a leading marketplace featuring a vast array of developer-contributed APIs.

To further enhance the usability and quality of the APIs, we perform the following filtering and cleaning procedures on the ToolBench dataset:

  • Data Quality Filtering: We remove APIs with incorrectly parsed documentation and those lacking required or optional parameters. APIs requiring no parameters were excluded to maintain the challenge level appropriate for our dataset needs.
  • API Accessibility Testing: We tested API accessibility by making requests to each endpoint using example parameters provided in the dataset and through the Stable Toolbench server. APIs that could not be executed or returned errors, such as timeouts or invalid endpoints, were discarded.
  • Docstring Regeneration: To improve the quality of API documentation, we regenerated docstrings for the APIs that have noisy and unusable descriptions.

dataset_pie_chart.png

After cleaning, we obtain 3,539 executable REST APIs with good documentation. Additionally, we incorporated Python functions as another API type, inspired by the executable evaluation categories of the Berkeley function-calling benchmark. We collected 134 well-documented Python functions covering diverse fields such as mathematics, finance, and data management. Sample API examples are provided in the supplementary material.

The original ToolBench dataset contained semantically overlapping categories such as Finance and Financial. We consolidated these into 21 distinct categories to ensure clarity and balance across the dataset. Figure illustrates the distribution of the 3,673 executable APIs across these redefined categories, spanning sectors like technology, social sciences, education, and sports. This diverse collection of APIs provides a strong foundation for synthetic data generation and is a valuable asset for ensuring data quality and reliability.

Collection Setup and Dataset Details​

To validate the effectiveness of the APIGen framework, we generated datasets targeting various query styles. We utilized several base LLMs for data generation, including DeepSeek-V2-Chat (236B), DeepSeek-Coder-33B-Inst, Mixtral-8x22B-Inst, and Mixtral-8x7B-Inst. For each model, our target was to generate 40,000 data points by sampling different combinations of APIs, seed data, and prompt templates. To foster diversity in the generated responses, we set the generation temperature to 0.7 across all models. Examples of the prompt templates and APIs used are provided in the supplementary materials for reference.

Table below presents statistics for the data generation process with different models, including the total verified data point count and the number of filtered data points at each verification stage. The filtering process successfully removes many low-quality data points due to formatting issues, execution errors, or failure to pass the semantic check. The first two stages, format checker and execution checker, typically filter out the majority of low-quality data. These data points often have infeasible argument ranges, incorrect types, missing required parameters, or more severe issues such as hallucination of function calls or parameters. Our systematic verification process provides a rigorous way to reduce the occurrence of these situations.



ModelVerified DataFail FormatFail ExecutionFail SemanticPass Rate
DeepSeek-Coder-33B-Inst13,7694,31115,4966,42434.42%
Mixtral-8x7B-Inst15,3853,31112,3417,96338.46%
Mixtral-8x22B-Inst26,3841,6805,0736,86365.96%
DeepSeek-V2-Chat (236B)33,6598173,3592,16584.15%

Filtering statistics for the generated datasets using different base LLMs.

The semantic checker also plays a crucial role in filtering generated data that does not align with the query's objectives. For instance, if a user's query contains multiple requests, but the returned results only address one, or if the generated function-call data and execution results do not match the user's query, the data point will be filtered out. Including these data points in the training set for model training could potentially harm the performance, as demonstrated in the experiments.

We observe that stronger models like DeepSeek-V2-Chat and Mixtral-8x22B-Inst have better format-following capabilities and higher pass rates, while the two relatively smaller models have a much higher likelihood of producing data that cannot be executed. This suggests that when using weaker models to generate data, a strict verification process is recommended to filter out low-quality data.

We are releasing approximately 60,000 high-quality function-calling datasets generated from the two strongest models: Mixtral-8x22B-Inst and DeepSeek-V2-Chat (236B). These datasets include all the query styles mentioned and cover a wide range of practical situations, with 3,673 diverse APIs across 21 categories. Each data point has been verified using real-world APIs to ensure its validity and usefulness. By making this dataset publicly available, we aim to benefit the research community and facilitate future work in this area.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,202
Reputation
8,613
Daps
161,849
A.I generated explanation:

**What's the problem?**
Creating good datasets for training AI models that can call functions (like APIs) is hard. These datasets need to be diverse, reliable, and high-quality.

**What's the solution?**
The authors created a system called APIGen, which is an automated pipeline for generating verifiable and diverse function-calling datasets. APIGen uses a combination of natural language processing (NLP) and machine learning to generate datasets that are high-quality and diverse.

**How does APIGen work?**
APIGen has three main components:

1. **Data Generation**: APIGen uses large language models (LLMs) to generate function-calling datasets. These LLMs are trained on a large corpus of text data and can generate human-like text.
2. **Multi-Stage Verification**: APIGen uses a three-stage verification process to ensure the generated datasets are high-quality and correct. The stages are:
* **Format Checker**: Checks if the generated data is in the correct format.
* **Execution Checker**: Checks if the generated function calls can be executed successfully.
* **Semantic Checker**: Checks if the generated function calls align with the user's query objectives.
3. **Dataset Collection**: APIGen collects APIs from various sources, including ToolBench, a comprehensive tool-use dataset. The APIs are filtered and cleaned to ensure they are executable and have good documentation.

**What are the results?**
The authors generated datasets using APIGen and trained AI models on these datasets. The results show that the models trained on APIGen datasets outperform other models on the Berkeley Function-Calling Benchmark. The authors are releasing a dataset of 60,000 high-quality function-calling datasets to the research community.

**Why is this important?**
APIGen provides a scalable and structured way to generate high-quality datasets for function-calling applications. This can help advance the field of AI research and improve the performance of AI models in real-world scenarios.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,202
Reputation
8,613
Daps
161,849

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,202
Reputation
8,613
Daps
161,849

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,202
Reputation
8,613
Daps
161,849









1/12
LLM agents have demonstrated promise in their ability to automate computer tasks, but face challenges with multi-step reasoning and planning. Towards addressing this, we propose an inference-time tree search algorithm for LLM agents to explicitly perform exploration and multi-step planning in interactive web environments.

It is the first tree search algorithm for LLM agents that shows effectiveness on realistic and complex web environments: on the challenging VisualWebArena benchmark, applying our search algorithm on top of a GPT-4o agent yields a 39.7% relative increase in success rate compared to the same baseline without search, setting a state-of-the-art success rate of 26.4%. On WebArena, search also yields a 28.0% relative improvement over a baseline agent, setting a competitive success rate of 19.2%.

2/12
We show in ablations that spending more compute (increasing the search budget) improves success rate. Even doing a small amount of search (c=5) substantially improves over the baseline (24.5% to 32.0%), and using larger search budgets achieves even better results:

3/12
We also found that increasing the size of the search tree is essential: we need to expand search trees along both the depth (d) and breadth (b). Our best results are achieved with search trees of maximum depth 5 and branching factor 5:

4/12
Search also provides consistent improvements across a diverse set of sites in (Visual)WebArena, introducing a relative improvement on certain sites by as much as 50%!

5/12
Search can improve the robustness of agents by filtering out bad actions. Shown here is a trajectory where greedily picking the first sampled actions would have led to a failure (the path in the first row). Search avoids this by exploring and pruning less promising paths.

6/12
Here is another task on the WebArena CMS environment, where performing more exploration through search helps the model to identify a trajectory that is likely to be more successful:

7/12
Our method is compatible with any baseline LLM agents, and demonstrates gains for both gpt-4o and Llama-3.

I'm very excited to see how far we can scale search: this will be a key component for LLM agents to allow us to expend more inference-time compute for stronger results.

8/12
Project page: Tree Search for Language Model Agents
Paper: https://jykoh.com/search-agents/paper.pdf
Code: GitHub - kohjingyu/search-agents: Code for the paper 🌳 Tree Search for Language Model Agents

This work was done at CMU with @McaleerStephen @dan_fried @rsalakhu

9/12
How do LLM agents based on GPT-4 work? Are they implemented as Python code on top of GPT-4, or is there another model layered on top of GPT-4?

10/12
Pretty much yes, you can see our value function here: search-agents/agent/value_function.py at main · kohjingyu/search-agents

11/12
Congrats JY!! On the reading list for this week - I think the premise of the importance of search/exploration makes a ton of sense, excited about the great execution by u and the team!

12/12
Thanks John!


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GQeuHY4a4AAxDh6.png

GQeuJp5asAAPtp5.png

GQeuOoZasAE8T5A.jpg

GQeuWy-aIAACwqV.png

GQewgKhaIAUkynU.jpg

GQewb7UaYAAVWY5.jpg



1/1
🚨 Paper Alert 🚨

➡️Paper Title: Tree Search for Language Model Agents

🌟Few pointers from the paper

🎯Autonomous agents powered by language models (LMs) have demonstrated promise in their ability to perform decision-making tasks such as web automation.

🎯However, a fundamental challenge remains: LMs, primarily optimized for natural language understanding and generation, struggle with multi-step reasoning, planning, and using environmental feedback when attempting to solve realistic computer tasks.

🎯 Towards addressing this, authors of this paper have proposed an inference-time search algorithm for LM agents to explicitly perform exploration and multi-step planning in interactive web environments. Their approach is a form of best-first tree search that operates within the actual environment space, and is complementary with most existing state-of-the-art agents.

🎯 It is the first tree search algorithm for LM agents that shows effectiveness on realistic web tasks. On the challenging VisualWebArena benchmark, applying their search algorithm on top of a GPT-4o agent yields a 39.7% relative increase in success rate compared to the same baseline without search, setting a state-of-the-art success rate of 26.4%.

🎯On WebArena, search also yields a 28.0% relative improvement over a baseline agent, setting a competitive success rate of 19.2%. Their experiments highlight the effectiveness of search for web agents, and they demonstrated that performance scales with increased test-time compute.

🏢Organization: @CarnegieMellon

🧙Paper Authors: @kohjingyu , @McaleerStephen , @dan_fried , @rsalakhu

1️⃣Read the Full Paper here: [2407.01476] Tree Search for Language Model Agents

2️⃣Project Page: Tree Search for Language Model Agents

3️⃣Code: GitHub - kohjingyu/search-agents: Code for the paper 🌳 Tree Search for Language Model Agents

🎥 Be sure to watch the attached Demo Video -Sound on 🔊🔊

🎵 Music by Nick Valerson from @pixabay

Find this Valuable 💎 ?

♻️QT and teach your network something new

Follow me 👣, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196






[Submitted on 1 Jul 2024]

Tree Search for Language Model Agents​

Jing Yu Koh, Stephen McAleer, Daniel Fried, Ruslan Salakhutdinov
Autonomous agents powered by language models (LMs) have demonstrated promise in their ability to perform decision-making tasks such as web automation. However, a key limitation remains: LMs, primarily optimized for natural language understanding and generation, struggle with multi-step reasoning, planning, and using environmental feedback when attempting to solve realistic computer tasks. Towards addressing this, we propose an inference-time search algorithm for LM agents to explicitly perform exploration and multi-step planning in interactive web environments. Our approach is a form of best-first tree search that operates within the actual environment space, and is complementary with most existing state-of-the-art agents. It is the first tree search algorithm for LM agents that shows effectiveness on realistic web tasks. On the challenging VisualWebArena benchmark, applying our search algorithm on top of a GPT-4o agent yields a 39.7% relative increase in success rate compared to the same baseline without search, setting a state-of-the-art success rate of 26.4%. On WebArena, search also yields a 28.0% relative improvement over a baseline agent, setting a competitive success rate of 19.2%. Our experiments highlight the effectiveness of search for web agents, and we demonstrate that performance scales with increased test-time compute. We conduct a thorough analysis of our results to highlight improvements from search, limitations, and promising directions for future work. Our code and models are publicly released at this https URL.
Comments:11 pages. Models and code available at this https URL
Subjects:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:arXiv:2407.01476 [cs.AI]
(or arXiv:2407.01476v1 [cs.AI] for this version)
[2407.01476] Tree Search for Language Model Agents
Focus to learn more

Submission history

From: Jing Yu Koh [view email]
[v1] Mon, 1 Jul 2024 17:07:55 UTC (2,417 KB)

 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,202
Reputation
8,613
Daps
161,849

Computer Science > Artificial Intelligence​

[Submitted on 3 Jul 2024]

Enhancements for Real-Time Monte-Carlo Tree Search in General Video Game Playing​

Dennis J.N.J. Soemers, Chiara F. Sironi, Torsten Schuster, Mark H.M. Winands
General Video Game Playing (GVGP) is a field of Artificial Intelligence where agents play a variety of real-time video games that are unknown in advance. This limits the use of domain-specific heuristics. Monte-Carlo Tree Search (MCTS) is a search technique for game playing that does not rely on domain-specific knowledge. This paper discusses eight enhancements for MCTS in GVGP; Progressive History, N-Gram Selection Technique, Tree Reuse, Breadth-First Tree Initialization, Loss Avoidance, Novelty-Based Pruning, Knowledge-Based Evaluations, and Deterministic Game Detection. Some of these are known from existing literature, and are either extended or introduced in the context of GVGP, and some are novel enhancements for MCTS. Most enhancements are shown to provide statistically significant increases in win percentages when applied individually. When combined, they increase the average win percentage over sixty different games from 31.0% to 48.4% in comparison to a vanilla MCTS implementation, approaching a level that is competitive with the best agents of the GVG-AI competition in 2015.
Comments:Green Open Access version of conference paper published in 2016
Subjects:Artificial Intelligence (cs.AI)
Cite as:arXiv:2407.03049 [cs.AI]
(or arXiv:2407.03049v1 [cs.AI] for this version)
[2407.03049] Enhancements for Real-Time Monte-Carlo Tree Search in General Video Game Playing
Focus to learn more
Journal reference:2016 IEEE Conference on Computational Intelligence and Games (CIG 2016), pp. 436-443
Related DOI:https://doi.org/10.1109/CIG.2016.7860448
Focus to learn more

Submission history

From: Dennis Soemers [view email]
[v1] Wed, 3 Jul 2024 12:18:28 UTC (1,092 KB)


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,202
Reputation
8,613
Daps
161,849

1/2
The study introduces a new guided tree search algorithm aimed at improving LLM performance on mathematical reasoning tasks while reducing computational costs compared to existing methods. By incorporating dynamic node selection, exploration budget calculation,...

2/2
LiteSearch: Efficacious Tree Search for LLM


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GRw7eXxawAA860i.png



1/2
LiteSearch: Efficacious Tree Search for LLM. [2407.00320] LiteSearch: Efficacious Tree Search for LLM

2/2
AI Summary: The study introduces a new guided tree search algorithm aimed at improving LLM performance on mathematical reasoning tasks while reducing computational costs compared to existing methods. By inco...
LiteSearch: Efficacious Tree Search for LLM


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GRw7eXxawAA860i.png




Computer Science > Computation and Language​

[Submitted on 29 Jun 2024]

LiteSearch: Efficacious Tree Search for LLM​

Ante Wang, Linfeng Song, Ye Tian, Baolin Peng, Dian Yu, Haitao Mi, Jinsong Su, Dong Yu
Recent research suggests that tree search algorithms (e.g. Monte Carlo Tree Search) can dramatically boost LLM performance on complex mathematical reasoning tasks. However, they often require more than 10 times the computational resources of greedy decoding due to wasteful search strategies, making them difficult to be deployed in practical applications. This study introduces a novel guided tree search algorithm with dynamic node selection and node-level exploration budget (maximum number of children) calculation to tackle this issue. By considering the search progress towards the final answer (history) and the guidance from a value network (future) trained without any step-wise annotations, our algorithm iteratively selects the most promising tree node before expanding it within the boundaries of the allocated computational budget. Experiments conducted on the GSM8K and TabMWP datasets demonstrate that our approach not only offers competitive performance but also enjoys significantly lower computational costs compared to baseline methods.
Subjects:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:arXiv:2407.00320 [cs.CL]
(or arXiv:2407.00320v1 [cs.CL] for this version)
[2407.00320] LiteSearch: Efficacious Tree Search for LLM
Focus to learn more

Submission history

From: Linfeng Song [view email]
[v1] Sat, 29 Jun 2024 05:14:04 UTC (640 KB)


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,202
Reputation
8,613
Daps
161,849

1/1
[LG] Learning to (Learn at Test Time): RNNs with Expressive Hidden States
Y Sun, X Li, K Dalal, J Xu… [Stanford University & UC San Diego & UC Berkeley] (2024)
[2407.04620] Learning to (Learn at Test Time): RNNs with Expressive Hidden States

- This paper proposes TTT (Test-Time Training) layers, a new class of sequence modeling layers where the hidden state is a model, and the update rule is self-supervised learning.

- The key idea is to make the hidden state a machine learning model itself, and the update rule a gradient step on a self-supervised loss. Updating the hidden state on a test sequence is equivalent to training the model at test time, which is how TTT layers get their name.

- Two simple instantiations are proposed: TTT-Linear, where the hidden state model is a linear model, and TTT-MLP, where it is a 2-layer MLP. These can be integrated into any network architecture like RNN layers.

- TTT layers have better perplexity and use of long context compared to Mamba RNNs, and lower latency compared to Transformers after 8k context length.

- Practical innovations like mini-batch TTT and a dual form are proposed to improve hardware efficiency on modern GPUs and TPUs. The dual form allows computing output tokens directly without materializing intermediate variables.

- Theoretical connections are made between TTT layers and existing concepts like attention, fast weights, and meta learning. The outer loop of TTT can be seen as learning a good self-supervised task for the inner loop.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GR8s5xcacAAEHql.jpg

GR8s5xbbMAAjCuM.jpg

GR8s5xbasAAtER-.jpg

GR8s52sbcAAO773.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,202
Reputation
8,613
Daps
161,849

1/2
[LG] Mixture of A Million Experts
X O He [Google DeepMind] (2024)
[2407.04153] Mixture of A Million Experts

- The feedforward (FFW) layers in standard transformer architectures incur a linear increase in computational costs and activation memory as the hidden layer width grows.

- Sparse mixture-of-experts (MoE) architectures have emerged to address this issue by decoupling model size from computational cost, but are limited to a small number of experts due to computational and optimization challenges.

- This paper introduces PEER (parameter efficient expert retrieval), a novel layer design that utilizes the product key technique for sparse retrieval from a vast pool of tiny experts (over a million).

- Experiments demonstrate PEER layers outperform dense FFWs and coarse-grained MoEs in terms of performance-compute trade-off.

- By enabling efficient utilization of massive number of experts, PEER unlocks potential for further scaling of transformer models while maintaining computational efficiency.

2/2
Dark mode for this paper 🌚 Mixture of A Million Experts


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GR8y_bibUAEUmwH.jpg

GR8y_bibwAAYhAD.jpg

GR8y_bia8AEgRmR.jpg

GR8y_0MbUAAszH9.jpg
 
Top