bnew

Veteran
Joined
Nov 1, 2015
Messages
56,129
Reputation
8,239
Daps
157,829


Nous-Capybara-34B V1.9

This is trained on the Yi-34B model with 200K context length, for 3 epochs on the Capybara dataset!

First 34B Nous model and first 200K context length Nous model!


The Capybara series is the first Nous collection of models made by fine-tuning mostly on data created by Nous in-house.

We leverage our novel data synthesis technique called Amplify-instruct (Paper coming soon), the seed distribution and synthesis method are comprised of a synergistic combination of top performing existing data synthesis techniques and distributions used for SOTA models such as Airoboros, Evol-Instruct(WizardLM), Orca, Vicuna, Know_Logic, Lamini, FLASK and others, all into one lean holistically formed methodology for the dataset and model. The seed instructions used for the start of synthesized conversations are largely based on highly regarded datasets like Airoboros, Know logic, EverythingLM, GPTeacher and even entirely new seed instructions derived from posts on the website LessWrong, as well as being supplemented with certain in-house multi-turn datasets like Dove(A successor to Puffin).

While performing great in it's current state, the current dataset used for fine-tuning is entirely contained within 20K training examples, this is 10 times smaller than many similar performing current models, this is signficant when it comes to scaling implications for our next generation of models once we scale our novel syntheiss methods to significantly more examples.

Process of creation and special thank yous!​

This model was fine-tuned by Nous Research as part of the Capybara/Amplify-Instruct project led by Luigi D.(LDJ) (Paper coming soon), as well as significant dataset formation contributions by J-Supha and general compute and experimentation management by Jeffrey Q. during ablations.

Special thank you to A16Z for sponsoring our training, as well as Yield Protocol for their support in financially sponsoring resources during the R&D of this project.

Thank you to those of you that have indirectly contributed!​

While most of the tokens within Capybara are newly synthsized and part of datasets like Puffin/Dove, we would like to credit the single-turn datasets we leveraged as seeds that are used to generate the multi-turn data as part of the Amplify-Instruct synthesis.

The datasets shown in green below are datasets that we sampled from to curate seeds that are used during Amplify-Instruct synthesis for this project.

Datasets in Blue are in-house curations that previously existed prior to Capybara.

Capybara

Prompt Format​

The reccomended model usage is:

Prefix: USER:

Suffix: ASSISTANT:

Stop token: </s>

Mutli-Modality!​

Notable Features:​

  • Uses Yi-34B model as the base which is trained for 200K context length!
  • Over 60% of the dataset is comprised of multi-turn conversations.(Most models are still only trained for single-turn conversations and no back and forths!)
  • Over 1,000 tokens average per conversation example! (Most models are trained on conversation data that is less than 300 tokens per example.)
  • Able to effectively do complex summaries of advanced topics and studies. (trained on hundreds of advanced difficult summary tasks developed in-house)
  • Ability to recall information upto late 2022 without internet.
  • Includes a portion of conversational data synthesized from less wrong posts, discussing very in-depth details and philosophies about the nature of reality, reasoning, rationality, self-improvement and related concepts.

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,129
Reputation
8,239
Daps
157,829


NVIDIA H200 Tensor Core GPU​

The world’s most powerful GPU for supercharging AI and HPC workloads.


Notify me when this product becomes available.
Notify Me

Datasheet | Specs


The World’s Most Powerful GPU​

The NVIDIA H200 Tensor Core GPU supercharges generative AI and high-performance computing (HPC) workloads with game-changing performance and memory capabilities. As the first GPU with HBM3e, the H200’s larger and faster memory fuels the acceleration of generative AI and large language models (LLMs) while advancing scientific computing for HPC workloads.

NVIDIA Supercharges Hopper, the World’s Leading AI Computing Platform​

Based on the NVIDIA Hopper™ architecture, the NVIDIA HGX H200 features the NVIDIA H200 Tensor Core GPU with advanced memory to handle massive amounts of data for generative AI and high-performance computing workloads.
Read Press Release

Highlights​

Experience Next-Level Performance​

Llama2 70B Inference​


1.9X Faster

GPT-3 175B Inference​


1.6X Faster

High-Performance Computing​


110X Faster



H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLM

TensorRT-LLM evaluation of the new H200 GPU achieves 11,819 tokens/s on Llama2-13B on a single GPU. H200 is up to 1.9x faster than H100. This performance is enabled by H200's larger, faster HBM3e memory.

H200 FP8 Max throughput

ModelBatch Size(1)TP(2)Input LengthOutput LengthThroughput (out tok/s)
llama_13b1024112812811,819
llama_13b128112820484,750
llama_13b64120481281,349
llama_70b51211281283,014
llama_70b512412820486,616
llama_70b6422048128682
llama_70b3212048128303
Preliminary measured performance, subject to change. TensorRT-LLM v0.5.0, TensorRT v9.1.0.4 | H200, H100 FP8.

(1) Largest batch supported on given TP configuration by power of 2. (2) TP = Tensor Parallelism

Additional Performance data is available on the NVIDIA Data Center Deep Learning Product Performance page, & soon in TensorRT-LLM's Performance Documentation.

H200 vs H100

H200's HBM3e larger capacity & faster memory enables up to 1.9x performance on LLMs compared to H100. Max throughput improves due to its dependence on memory capacity and bandwidth, benefitting from the new HBM3e. First token latency is compute bound for most ISLs, meaning H200 retains similar time to first token as H100.

For practical examples of H200's performance:

Max Throughput TP1: an offline summarization scenario (ISL/OSL=2048/128) with Llama-70B on a single H200 is 1.9x more performant than H100.

Max Throughput TP8: an online chat agent scenario (ISL/OSL=80/200) with GPT3-175B on a full HGX (TP8) H200 is 1.6x more performant than H100.

max throughput llama TP1

Preliminary measured performance, subject to change. TensorRT-LLM v0.5.0, TensorRT v9.1.0.4. | Llama-70B: H100 FP8 BS 8, H200 FP8 BS 32 | GPT3-175B: H100 FP8 BS 64, H200 FP8 BS 128

Max Throughput across TP/BS: Max throughput(3) on H200 vs H100 varies by model, sequence lengths, BS, and TP. Below results shown for maximum throughput per GPU across all these variables.

max throughput llama sweep

Preliminary measured performance, subject to change. TensorRT-LLM v0.5.0, TensorRT v9.1.0.4 | H200, H100 FP8.

(3) Max Throughput per GPU is defined as the highest tok/s per GPU, swept across TP configurations & BS powers of 2.

Latest HBM Memory

H200 is the newest addition to NVIDIA’s data center GPU portfolio. To maximize that compute performance, H200 is the first GPU with HBM3e memory with 4.8TB/s of memory bandwidth, a 1.4X increase over H100. H200 also expands GPU memory capacity nearly 2X to 141 gigabytes (GB). The combination of faster and larger HBM memory accelerates performance of LLM model inference performance with faster throughput and tokens per second. These results are measured and preliminary, more updates expected as optimizations for H200 continue with TensorRT-LLM.

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,129
Reputation
8,239
Daps
157,829

GraphCast: AI model for faster and more accurate global weather forecasting​

Published14 NOVEMBER 2023Authors

Remi Lam on behalf of the GraphCast team

GraphCast global weather forecasting of surface wind speed

Our state-of-the-art model delivers 10-day weather predictions at unprecedented accuracy in under one minute

The weather affects us all, in ways big and small. It can dictate how we dress in the morning, provide us with green energy and, in the worst cases, create storms that can devastate communities. In a world of increasingly extreme weather, fast and accurate forecasts have never been more important.

In a paper published in Science, we introduce GraphCast, a state-of-the-art AI model able to make medium-range weather forecasts with unprecedented accuracy. GraphCast predicts weather conditions up to 10 days in advance more accurately and much faster than the industry gold-standard weather simulation system – the High Resolution Forecast (HRES), produced by the European Centre for Medium-Range Weather Forecasts (ECMWF).

GraphCast can also offer earlier warnings of extreme weather events. It can predict the tracks of cyclones with great accuracy further into the future, identifies atmospheric rivers associated with flood risk, and predicts the onset of extreme temperatures. This ability has the potential to save lives through greater preparedness.

GraphCast takes a significant step forward in AI for weather prediction, offering more accurate and efficient forecasts, and opening paths to support decision-making critical to the needs of our industries and societies. And, by open sourcing the model code for GraphCast, we are enabling scientists and forecasters around the world to benefit billions of people in their everyday lives. GraphCast is already being used by weather agencies, including ECMWF, which is running a live experiment of our model’s forecasts on its website.

Watch



A selection of GraphCast’s predictions rolling across 10 days showing specific humidity at 700 hectopascals (about 3 km above surface), surface temperature, and surface wind speed.

The challenge of global weather forecasting​

Weather prediction is one of the oldest and most challenging–scientific endeavours. Medium range predictions are important to support key decision-making across sectors, from renewable energy to event logistics, but are difficult to do accurately and efficiently.

Forecasts typically rely on Numerical Weather Prediction (NWP), which begins with carefully defined physics equations, which are then translated into computer algorithms run on supercomputers. While this traditional approach has been a triumph of science and engineering, designing the equations and algorithms is time-consuming and requires deep expertise, as well as costly compute resources to make accurate predictions.

Deep learning offers a different approach: using data instead of physical equations to create a weather forecast system. GraphCast is trained on decades of historical weather data to learn a model of the cause and effect relationships that govern how Earth’s weather evolves, from the present into the future.

Crucially, GraphCast and traditional approaches go hand-in-hand: we trained GraphCast on four decades of weather reanalysis data, from the ECMWF’s ERA5 dataset. This trove is based on historical weather observations such as satellite images, radar, and weather stations using a traditional NWP to ‘fill in the blanks’ where the observations are incomplete, to reconstruct a rich record of global historical weather.

GraphCast: An AI model for weather prediction​

GraphCast is a weather forecasting system based on machine learning and Graph Neural Networks (GNNs), which are a particularly useful architecture for processing spatially structured data.

GraphCast makes forecasts at the high resolution of 0.25 degrees longitude/latitude (28km x 28km at the equator). That’s more than a million grid points covering the entire Earth’s surface. At each grid point the model predicts five Earth-surface variables – including temperature, wind speed and direction, and mean sea-level pressure – and six atmospheric variables at each of 37 levels of altitude, including specific humidity, wind speed and direction, and temperature.

While GraphCast’s training was computationally intensive, the resulting forecasting model is highly efficient. Making 10-day forecasts with GraphCast takes less than a minute on a single Google TPU v4 machine. For comparison, a 10-day forecast using a conventional approach, such as HRES, can take hours of computation in a supercomputer with hundreds of machines.

In a comprehensive performance evaluation against the gold-standard deterministic system, HRES, GraphCast provided more accurate predictions on more than 90% of 1380 test variables and forecast lead times (see our Science paper for details). When we limited the evaluation to the troposphere, the 6-20 kilometer high region of the atmosphere nearest to Earth’s surface where accurate forecasting is most important, our model outperformed HRES on 99.7% of the test variables for future weather.

WDKVqosih5LZs_qZBn5zzRkv-3Ef8zEo9KP4cgW5UIOsdmP-nF2eGFcO-EfRawoimdi1daLV4hWslCOaibqilRPciPCXr1BKqTC9EU35eqCqSwoeuA=w1070

For inputs, GraphCast requires just two sets of data: the state of the weather 6 hours ago, and the current state of the weather. The model then predicts the weather 6 hours in the future. This process can then be rolled forward in 6-hour increments to provide state-of-the-art forecasts up to 10 days in advance.

Better warnings for extreme weather events​

Our analyses revealed that GraphCast can also identify severe weather events earlier than traditional forecasting models, despite not having been trained to look for them. This is a prime example of how GraphCast could help with preparedness to save lives and reduce the impact of storms and extreme weather on communities.

By applying a simple cyclone tracker directly onto GraphCast forecasts, we could predict cyclone movement more accurately than the HRES model. In September, a live version of our publicly available GraphCast model, deployed on the ECMWF website, accurately predicted about nine days in advance that Hurricane Lee would make landfall in Nova Scotia. By contrast, traditional forecasts had greater variability in where and when landfall would occur, and only locked in on Nova Scotia about six days in advance.

GraphCast can also characterize atmospheric rivers – narrow regions of the atmosphere that transfer most of the water vapour outside of the tropics. The intensity of an atmospheric river can indicate whether it will bring beneficial rain or a flood-inducing deluge. GraphCast forecasts can help characterize atmospheric rivers, which could help planning emergency responses together with AI models to forecast floods.

Finally, predicting extreme temperatures is of growing importance in our warming world. GraphCast can characterize when the heat is set to rise above the historical top temperatures for any given location on Earth. This is particularly useful in anticipating heat waves, disruptive and dangerous events that are becoming increasingly common.

c8lQXizNwWqsYMORBAl9RhlfZortDkEUBpnG3QQZiFzUGX2cuQUEdZAPa237hsrv40NlbYdyWS4AZroR1AUUt0TyEXqEiFZaXKT2QPVt2KzIwPkKHaU=w616

Severe-event prediction - how GraphCast and HRES compare.


Left: Cyclone tracking performances. As the lead time for predicting cyclone movements grows, GraphCast maintains greater accuracy than HRES.


Right: Atmospheric river prediction. GraphCast’s prediction errors are markedly lower than HRES’s for the entirety of their 10-day predictions

The future of AI for weather​

GraphCast is now the most accurate 10-day global weather forecasting system in the world, and can predict extreme weather events further into the future than was previously possible. As the weather patterns evolve in a changing climate, GraphCast will evolve and improve as higher quality data becomes available.

To make AI-powered weather forecasting more accessible, we’ve open sourced our model’s code. ECMWF is already experimenting with GraphCast’s 10-day forecasts and we’re excited to see the possibilities it unlocks for researchers – from tailoring the model for particular weather phenomena to optimizing it for different parts of the world.

GraphCast joins other state-of-the-art weather prediction systems from Google DeepMind and Google Research, including a regional Nowcasting model that produces forecasts up to 90 minutes ahead, and MetNet-3, a regional weather forecasting model already in operation across the US and Europe that produces more accurate 24-hour forecasts than any other system.

Pioneering the use of AI in weather forecasting will benefit billions of people in their everyday lives. But our wider research is not just about anticipating weather – it’s about understanding the broader patterns of our climate. By developing new tools and accelerating research, we hope AI can empower the global community to tackle our greatest environmental challenges.


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,129
Reputation
8,239
Daps
157,829








LCM-LoRA: A Universal Stable-Diffusion Acceleration Module​

Simian Luo, Yiqin Tan, Suraj Patil, Daniel Gu, Patrick von Platen, Apolinário Passos, Longbo Huang, Jian Li, Hang Zhao
Latent Consistency Models (LCMs) have achieved impressive performance in accelerating text-to-image generative tasks, producing high-quality images with minimal inference steps. LCMs are distilled from pre-trained latent diffusion models (LDMs), requiring only ~32 A100 GPU training hours. This report further extends LCMs' potential in two aspects: First, by applying LoRA distillation to Stable-Diffusion models including SD-V1.5, SSD-1B, and SDXL, we have expanded LCM's scope to larger models with significantly less memory consumption, achieving superior image generation quality. Second, we identify the LoRA parameters obtained through LCM distillation as a universal Stable-Diffusion acceleration module, named LCM-LoRA. LCM-LoRA can be directly plugged into various Stable-Diffusion fine-tuned models or LoRAs without training, thus representing a universally applicable accelerator for diverse image generation tasks. Compared with previous numerical PF-ODE solvers such as DDIM, DPM-Solver, LCM-LoRA can be viewed as a plug-in neural PF-ODE solver that possesses strong generalization abilities. Project page: this https URL.
Comments:Technical Report
Subjects:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:arXiv:2311.05556 [cs.CV]
(or arXiv:2311.05556v1 [cs.CV] for this version)
[2311.05556] LCM-LoRA: A Universal Stable-Diffusion Acceleration Module
Focus to learn more

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,129
Reputation
8,239
Daps
157,829

8-bit-AI-shutterstock_1757599910-675x380.jpg

Training of 1-Trillion Parameter Scientific AI Begins​

By Agam Shah

November 13, 2023

A US national lab has started training a massive AI brain that could ultimately become the must-have computing resource for scientific researchers.

Argonne National Laboratory (ANL) is creating a generative AI model called AuroraGPT and is pouring a giant mass of scientific information into creating the brain.

The model is being trained on its Aurora supercomputer, which delivers more than an half an exaflop performance at ANL. The system has Intel’s Ponte Vecchio GPUs, which provide the main computing power.

Intel and ANL are partnering with other labs in the US and worldwide to make scientific AI a reality.
“It combines all the text, codes, specific scientific results, papers, into the model that science can use to speed up research,” said Ogi Brkic, vice president and general manager for data center and HPC solutions, in a press briefing.

Brkic called the model “ScienceGPT,” indicating it will have a chatbot interface, and researchers can submit questions and get responses.

Chatbots could help in a wide range of scientific research, including biology, cancer research, and climate change.

ChatGPT-Top500-example.png

Example ChatGPT v3.5 Question and Answer

Training a model with complex data can take time and massive computing resources. ANL and Intel are in the early stages of testing the hardware before putting the model into full training mode.

While it will operate like ChatGPT, it is unclear if the generative model will be multimodal or whether it will generate images and videos. Inference will also be a big part of the system as scientists seek answers from the chatbot and feed more information into the model.

Training AuroraGPT has just started and could take months to complete. The training is currently limited to 256 nodes, which will then be scaled to all of the nodes — about 10,000 — of the Aurora supercomputer.

OpenAI has not shared details on how long it took to train GPT-4, which takes place on Nvidia GPUs. In May, Google said it was training its upcoming large-language model called Gemini, which is likely happening on its TPUs.

The biggest challenge in training large language models is the memory requirements, and in most cases, the training needs to be sliced down to smaller bits across a wide range of GPUs. The AuroraGPT is enabled by Microsoft’s Megatron/DeepSpeed, which does exactly that and ensures the training is happening in parallel.

Intel and ANL are testing the 1-trillion parameter model training on a string of 64 Aurora nodes.
“The number of nodes is lower than we typically see on these large language models… because [of the] unique Aurora design,” Brkic said.

Intel has worked with Microsoft on fine-tuning the software and hardware, so the training can scale to all nodes. The goal is to extend this to the entire system of 10,000 plus nodes.

Intel also hopes to achieve linear scaling so the performance increases as the number of nodes increases.

Brkic said its Ponte Vecchio GPUs outperformed Nvidia’s A100 GPUs in another Argonne supercomputer called Theta, which has a peak performance of 11.7 petaflops.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,129
Reputation
8,239
Daps
157,829


Shared November 13, 2023

Text-to-music generation models are now capable of generating high-quality music audio in broad styles. However, text control is primarily suitable for the manipulation of global musical attributes like genre, mood, and tempo, and is less suitable for precise control over time-varying attributes such as the positions of beats in time or the changing dynamics of the music. We propose Music ControlNet, a diffusion-based music generation model that offers multiple precise, time-varying controls over generated audio. To imbue text-to-music models with time- varying control, we propose an approach analogous to pixel-wise control of the image-domain ControlNet method. Specifically, we extract controls from training audio yielding paired data, and fine-tune a diffusion-based conditional generative model over audio spectrograms given melody, dynamics, and rhythm controls. While the image-domain Uni-ControlNet method already allows generation with any subset of controls, we devise a new strategy to allow creators to input controls that are only partially specified in time. We evaluate both on controls extracted from audio and controls we expect creators to provide, demonstrating that we can generate realistic music that corresponds to control inputs in both settings. While few comparable music generation models exist, we benchmark against MusicGen, a recent model that accepts text and melody input, and show that our model generates music that is 49% more faithful to input melodies despite having 35x fewer parameters, training on 11x less data, and enabling two additional forms of time-varying control. Sound examples can be found at MusicControlNet.github.io/web/.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,129
Reputation
8,239
Daps
157,829

The Impact of Large Language Models on Scientific Discovery: a Preliminary Study using GPT-4​

Published on Nov 13
·Featured in Daily Papers on Nov 13
Authors:
Microsoft Research AI4Science,
Microsoft Azure Quantum

Abstract​

In recent years, groundbreaking advancements in natural language processing have culminated in the emergence of powerful large language models (LLMs), which have showcased remarkable capabilities across a vast array of domains, including the understanding, generation, and translation of natural language, and even tasks that extend beyond language processing. In this report, we delve into the performance of LLMs within the context of scientific discovery, focusing on GPT-4, the state-of-the-art language model. Our investigation spans a diverse range of scientific areas encompassing drug discovery, biology, computational chemistry (density functional theory (DFT) and molecular dynamics (MD)), materials design, and partial differential equations (PDE). Evaluating GPT-4 on scientific tasks is crucial for uncovering its potential across various research domains, validating its domain-specific expertise, accelerating scientific progress, optimizing resource allocation, guiding future model development, and fostering interdisciplinary research. Our exploration methodology primarily consists of expert-driven case assessments, which offer qualitative insights into the model's comprehension of intricate scientific concepts and relationships, and occasionally benchmark testing, which quantitatively evaluates the model's capacity to solve well-defined domain-specific problems. Our preliminary exploration indicates that GPT-4 exhibits promising potential for a variety of scientific applications, demonstrating its aptitude for handling complex problem-solving and knowledge integration tasks. Broadly speaking, we evaluate GPT-4's knowledge base, scientific understanding, scientific numerical calculation abilities, and various scientific prediction capabilities.



Our observations​

GPT-4 demonstrates considerable potential in various scientific domains, including drug discovery, biology, computational chemistry, materials design, and PDEs. Its capabilities span a wide range of tasks and it exhibits an impressive understanding of key concepts in each domain.

The output of GPT-4 depends on several variables such as the model version, system messages, and hyperparameters like the decoding temperature. Thus, one might observe different responses for the same cases examined in this report. For the majority of this report, we primarily utilized GPT-4 version 0314, with a few cases employing version 0613.


The qualitative approach used in this report mainly refers to case studies. It is related to but not identical to qualitative
methods in social science research.


In drug discovery, GPT-4 shows a comprehensive grasp of the field, enabling it to provide useful insights and suggestions across a wide range of tasks. It is helpful in predicting drug-target binding affinity, molecular properties, and retrosynthesis routes. It also has the potential to generate novel molecules with desired properties, which can lead to the discovery of new drug candidates with the potential to address unmet medical needs. However, it is important to be aware of GPT-4’s limitations, such as challenges in processing SMILES sequences and limitations in quantitative tasks.

In the field of biology, GPT-4 exhibits substantial potential in understanding and processing complex biological language, executing bioinformatics tasks, and serving as a scientific assistant for biology design. Its extensive grasp of biological concepts and its ability to perform various tasks, such as processing specialized files, predicting signaling peptides, and reasoning about plausible mechanisms from observations, benefit it to be a valuable tool in advancing biological research. However, GPT-4 has limitations when it comes to processing biological sequences (e.g., DNA and FASTA sequences) and its performance on tasks related to under-studied entities.

In computational chemistry, GPT-4 demonstrates remarkable potential across various subdomains, including electronic structure methods and molecular dynamics simulations. It is able to retrieve information, suggest design principles, recommend suitable computational methods and software packages, generate code for various programming languages, and propose further research directions or potential extensions. However, GPT-4 may struggle with generating accurate atomic coordinates of complex molecules, handling raw atomic coordinates, and performing precise calculations.

In materials design, GPT-4 shows promise in aiding materials design tasks by retrieving information, suggesting design principles, generating novel and feasible chemical compositions, recommending analytical and numerical methods, and generating code for different programming languages. However, it encounters challenges in representing and proposing more complex structures, e.g., organic polymers and MOFs, generating accurate atomic coordinates, and providing precise quantitative predictions.

In the realm of PDEs, GPT-4 exhibits its ability to understand the fundamental concepts, discern relationships between concepts, and provide accurate proof approaches. It is able to recommend appropriate analytical and numerical methods for addressing various types of PDEs and generate code in different programming languages to numerically solve PDEs. However, GPT-4’s proficiency in mathematical theorem proving still has room for growth, and its capacity for independently discovering and validating novel mathematical theories remains limited in scope.

In summary, GPT-4 exhibits both significant potential and certain limitations for scientific discovery.

To better leverage GPT-4, researchers should be cautious and verify the model’s outputs, experiment with different prompts, and combine its capabilities with dedicated AI models or computational tools to ensure reliable conclusions and optimal performance in their respective research domains:

• Interpretability and Trust: It is crucial to maintain a healthy skepticism when interpreting GPT-4’s output. Researchers should always critically assess the generated results and cross-check them with existing knowledge or expert opinions to ensure the validity of the conclusions.

• Iterative Questioning and Refinement: GPT-4’s performance can be improved by asking questions in an iterative manner or providing additional context. If the initial response from GPT-4 is not satisfactory, researchers can refine their questions or provide more information to guide the model toward a more accurate and relevant answer.

• Combining GPT-4 with Domain-Specific Tools: In many cases, it may be beneficial to combine GPT-4’s capabilities with more specialized tools and models designed specifically for scientific discovery tasks, such as molecular docking software, or protein folding algorithms. This combination can help researchers leverage the strengths of both GPT-4 and domain-specific tools to achieve more reliable and accurate results. Although we do not extensively investigate the integration of LLMs and domain-specific tools/models in this report, a few examples are briefly discussed in Section 7.2.1.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,129
Reputation
8,239
Daps
157,829

MFTCoder beats GPT-4 on HumanEval​


MARIUSZ "EMSI" WOŁOSZYN

NOV 10, 2023
Share

A low poly, landscape orientation illustration of two robots facing each other. The first robot on the left, designed with geometric shapes in metallic silver, has a sleek, futuristic look and a thought bubble filled with abstract symbols representing code. It is extending a USB drive towards the second robot. The robot on the right, made with low poly geometric shapes in rustic bronze, has a traditional, boxy robot design. Above its head is a light bulb, symbolizing a new idea or realization. This scene captures the concept of knowledge transfer between robots, similar to multi-task fine-tuning in Code Large Language Models.

A new study proposes an innovative approach to enhancing the capabilities of Code LLMs through multi-task fine-tuning. The paper "MFTCoder: Boosting Code LLMs with Multitask Fine-Tuning" introduces MFTCoder, a novel framework for concurrently adapting LLMs to multiple code-related downstream tasks.

The key innovation of MFTCoder is its ability to address common challenges faced in multi-task learning, including data imbalance, varying task difficulties, and inconsistent convergence speeds. It does so through custom loss functions designed to promote equitable attention and optimization across diverse tasks.


Overview of MFTCoder framework

Experiments demonstrate MFTCoder's superiority over traditional approaches of individual task fine-tuning or mixed-task fine-tuning. When implemented with CodeLLama-34B-Python as the base model, MFTCoder achieved a remarkable 74.4% pass@1 score on the HumanEval benchmark. This even surpasses GPT-4's 67% zero-shot performance (as reported in original paper)

pass@1 performance on HumanEval (Code Completion) and MBPP (Text-to-Code Generation) after fine-tuning with MFTCoder across multiple mainstream open-source models.

The implications are significant - this multitask fine-tuning methodology could enable more performant and generalizable Code LLMs with efficient training. The MFTCoder framework has been adapted for popular LLMs like CodeLLama, Qwen, Baichuan, and more.

The researchers highlight innovative techniques like instruction dataset construction using Self-Instruct and efficient tokenization modes. MFTCoder also facilitates integration with PEFT methods like LoRA and QLoRA for parameter-efficient fine-tuning.

Overall, this study presents an important advancement in effectively leveraging multitask learning to boost Code LLM capabilities. The proposed MFTCoder framework could have far-reaching impacts, enabling rapid development of performant models for code intelligence tasks like completion, translation and test case generation. Its efficiency and generalizability across diverse tasks and models makes MFTCoder particularly promising.


MFTCoder is open-sourced at https://github.com/codefuse-ai/MFTCOder
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,129
Reputation
8,239
Daps
157,829

Catch me if you can! How to beat GPT-4 with a 13B model​

by: Shuo Yang*, Wei-Lin Chiang*, Lianmin Zheng*, Joseph E. Gonzalez, Ion Stoica, Nov 14, 2023



Announcing Llama-rephraser: 13B models reaching GPT-4 performance in major benchmarks (MMLU/GSK-8K/HumanEval)! To ensure result validity, we followed OpenAI's decontamination method and found no evidence of data contamination.

llama-rephraser.png

What's the trick behind it? Well, rephrasing the test set is all you need! We simply paraphrase a test sample or translate it into a different language. It turns out a 13B LLM is smart enough to "generalize" beyond such variations and reaches drastically high benchmark performance. So, did we just make a big breakthrough? Apparently, there is something wrong with our understanding of contamination.

In this blog post, we point out why contamination is still poorly understood and how existing decontamination measures fail to capture such nuances. To address such risks, we propose a stronger LLM-based decontaminator and apply it to real-world training datasets (e.g., the Stack, RedPajama), revealing significant test overlap with widely used benchmarks. For more technical details, please refer to our paper.

What's Wrong with Existing Decontamination Measures?

Contamination occurs when test set information is leaked in the training set, resulting in an overly optimistic estimate of the model’s performance. Despite being recognized as a crucial issue, understanding and detecting contamination remains an open and challenging problem.

The most commonly used approaches are n-gram overlap and embedding similarity search. N-gram overlap relies on string matching to detect contamination, widely used by leading developments such as GPT-4, PaLM, and Llama-2. Embedding similarity search uses the embeddings of pre-trained models (e.g., BERT) to find similar and potentially contaminated examples.

However, we show that simple variations of the test data (e.g., paraphrasing, translation) can easily bypass existing simple detection methods. We refer to such variations of test cases as Rephrased Samples.

Below we demonstrate a rephrased sample from the MMLU benchmark. We show that if such samples are included in the training set, a 13B model can reach drastically high performance (MMLU 85.9). Unfortunately, existing detection methods (e.g., n-gram overlap, embedding similarity) fail to detect such contamination. The embedding similarity approach struggles to distinguish the rephrased question from other questions in the same subject (high school US history).

overview.png

With similar rephrasing techniques, we observe consistent results in widely used coding and math benchmarks such as HumanEval and GSM-8K (shown in the cover figure). Therefore, being able to detect such rephrased samples becomes critical.

Stronger Detection Method: LLM Decontaminator

To address the risk of possible contamination, we propose a new contamination detection method “LLM decontaminator”.

This LLM decontaminator involves two steps:
  1. For each test case, LLM decontaminator identifies the top-k training items with the highest similarity using the embedding similarity search.
  2. From these items, LLM decontaminator generates k potential rephrased pairs. Each pair is evaluated for rephrasing using an advanced LLM, such as GPT-4.

Results show that our proposed LLM method works significantly better than existing methods on removing rephrased samples.

Evaluating Different Detection Methods

To compare different detection methods, we use MMLU benchmark to construct 200 prompt pairs using both the original and rephrased test sets. These comprised 100 random pairs and 100 rephrased pairs. The f1 score on these pairs provides insight into the detection methods' ability to detect contamination, with higher values indicating more precise detection. As shown in the following table, except for the LLM decontaminator, all other detection methods introduce some false positives. Both rephrased and translated samples successfully evade the n-gram overlap detection. With multi-qa BERT, the embedding similarity search proves ineffective against translated samples. Our proposed LLM decontaminator is more robust in all cases with the highest f1 scores.

MMLU-us-f1score.png

Contamination in Real-World Dataset

We apply the LLM decontaminator to widely used real-world datasets (e.g., the Stack, RedPajama, etc) and identify a substantial amount of rephrased samples. The table below displays the contamination percentage of different benchmarks in each training dataset.

real-world-rephrase.png

Below we show some detected samples.
CodeAlpaca contains 20K instruction-following synthetic data generated by GPT, which is widely used for instruction fine-tuning (e.g., Tulu).

A rephrased example in CodeAlpaca is shown below.

codealpaca-rephrase.png

This suggests contamination may subtly present in synthetic data generated by LLMs. In the Phi-1 report, they also discover such semantically similar test samples that are undetectable by n-gram overlap.
MATH is a widely recognized math training dataset that spans various mathematical domains, including algebra, geometry, and number theory. Surprisingly, we even find contamination between the train-test split in the MATH benchmark as shown below.

MATH-rephrase.png
StarCoder-Data is used for training StarCoder and StarCoderBase, and it contains 783GB of code in 86 programming languages. In the StarCoder paper, the code training data was decontaminated by removing files that contained docstrings or solutions from HumanEval. However, there are still some samples detected by LLM decontaminator.

starcoder-rephrase.png

Use LLM Decontaminator to Scan Your Data

Based on the above study, we suggest the community adopt a stronger decontamination method when using any public benchmarks. Our proposed LLM decontaminator is open-sourced on GitHub. Here we show how to remove rephrased samples from training data using the LLM decontaminator tool. The following example can be found here.
Pre-process training data and test data. The LLM decontaminator accepts the dataset in jsonl format, with each line corresponding to a {"text": data} entry.

Run End2End detection. The following command builds a top-k similar database based on sentence bert and uses GPT-4 to check one by one if they are rephrased samples. You can select your embedding model and detection model by modifying the parameters.

run-e2e.png

Conclusion

In this blog, we show that contamination is still poorly understood. With our proposed decontamination method, we reveal significant previously unknown test overlap in real-world datasets. We encourage the community to rethink benchmark and contamination in LLM context, and adopt stronger decontamination tools when evaluating LLMs on public benchmarks. Moreover, we call for the community to actively develop fresh one-time exams to accurately evaluate LLMs. Learn more about our ongoing effort on live LLM eval at Chatbot Arena!

Acknowledgment

We would like to express our gratitude to Ying Sheng for the early discussion on rephrased samples. We also extend our thanks to Dacheng Li, Erran Li, Hao Liu, Jacob Steinhardt, Hao Zhang, and Siyuan Zhuang for providing insightful feedback.

Citation

@misc{yang2023rethinking,
title={Rethinking Benchmark and Contamination for Language Models with Rephrased Samples},
author={Shuo Yang and Wei-Lin Chiang and Lianmin Zheng and Joseph E. Gonzalez and Ion Stoica},
year={2023},
eprint={2311.04850},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,129
Reputation
8,239
Daps
157,829


BY PARESH DAVE
BUSINESS


OCT 10, 2023 7:00 AM

Google’s AI Is Making Traffic Lights More Efficient and Less Annoying​

Google is analyzing data from its Maps app to suggest how cities can adjust traffic light timing to cut wait times and emissions. The company says it’s already cutting stops for millions of drivers.
Illustration of cars in traffic a purple map texture and a glowing green circle overlay in the center

PHOTO-ILLUSTRATION: CHARIS MORGAN; GETTY IMAGES


EACH TIME A driver in Seattle meets a red light, they wait about 20 seconds on average before it turns green again, according to vehicle and smartphone data collected by analytics company Inrix. The delays cause annoyance and expel in Seattle alone an estimated 1,000 metric tons or more of carbon dioxide into the atmosphere each day. With a little help from new Google AI software, the toll on both the environment and drivers is beginning to drop significantly.


Seattle is among a dozen cities across four continents, including Jakarta, Rio de Janeiro, and Hamburg, optimizing some traffic signals based on insights from driving data from Google Maps, aiming to reduce emissions from idling vehicles. The project analyzes data from Maps users using AI algorithms and has initially led to timing tweaks at 70 intersections. By Google’s preliminary accounting of traffic before and after adjustments tested last year and this year, its AI-powered recommendations for timing out the busy lights cut as many as 30 percent of stops and 10 percent of emissions for 30 million cars a month.
Google announced those early results today along with other updates to projects that use its data and AI researchers to drive greater environmental sustainability. The company is expanding to India and Indonesia the fuel-efficient routing feature in Maps, which directs drivers onto roads with less traffic or uphill driving, and it is introducing flight-routing suggestions to air traffic controllers for Belgium, the Netherlands, Luxembourg, and northwest Germany to reduce climate-warming contrails.


Some of Google’s other climate nudges, including those showing estimated emissions alongside flight and recipe search results, have frustrated groups including airlines and cattle ranchers, who accuse the company of using unsound math that misrepresents their industries. So far, Google’s Project Green Light is drawing bright reviews, but new details released today about how it works and expansion of the system to more cities next year could yield greater scrutiny.
People standing on a street looking down at a piece of paper that one of them is writing on

Google engineers and officials from Hyderabad, India, discuss traffic light settings as part of the company's project to use data from its Maps app to cut frustration and vehicle emissions.


“It is a worthy goal with significant potential for real-world impact,” says Guni Sharon, an assistant professor at Texas A&M University also studying AI’s potential to optimize traffic signals. But in his view, more sweeping AI and sensor systems that make lights capable of adjusting in real time to traffic conditions could be more effective. Sharon says Google’s traffic light system appears to take a conservative approach in that it allows cities to work with their existing infrastructure, making it easier and less risky to adopt. Google says on its Project Green Light web page that it expects results to evolve, and it will provide more information about the project in a forthcoming paper.

Traffic officers in Kolkata have made tweaks suggested by Green Light at 13 intersections over the past year, leaving commuters pleased, according to a statement provided by Google from Rupesh Kumar, the Indian city’s joint commissioner of police. “Green Light has become an essential component,” says Kumar, who didn’t respond to a request for comment on Monday from WIRED.

In other cases, Green Light provides reassurance, not revolution. For authorities at Transport for Greater Manchester in the UK, many of Green Light's recommendations are not helpful because they don't take into account prioritization of buses, pedestrians, and certain thoroughfares, says David Atkin, a manager at the agency. But Google's project can provide confirmation that its signal network is working well, he says.
‘No-Brainer’
Smarter traffic lights long have been many drivers’ dream. In reality, the cost of technology upgrades, coordination challenges within and between governments, and a limited supply of city traffic engineers have forced drivers to keep hitting the brakes despite a number of solutions available for purchase. Google’s effort is gaining momentum with cities because it’s free and relatively simple, and draws upon the company’s unrivaled cache of traffic data, collected when people use Maps, the world’s most popular navigation app.


Juliet Rothenberg, Google’s lead product manager for climate AI, credits the idea for Project Green Light to the wife of a company researcher who proposed it over dinner about two years ago. “As we evaluated dozens of ideas that we could work on, this kept rising to the top,” Rothenberg says. “There was a way to make it a no-brainer deployment for cities.”

Rothenberg says Google has prioritized supporting larger cities who employ traffic engineers and can remotely control traffic signals, while also spreading out globally to prove the technology works well in a variety of conditions—suggesting it could, if widely adopted, make a big dent in global emissions.

Through Maps data, Google can infer the signal timings and coordination at thousands of intersections per city. An AI model the company’s scientists developed can then analyze traffic patterns over the past few weeks and determine which lights could be worth adjusting—mostly in urban areas. It then suggests settings changes to reduce stop-and-go traffic. Filters in the system try to block some unwise suggestions, like those that could be unfriendly to pedestrians.

Some of Google’s recommendations are as simple as adding two more seconds during specific hours to the time between the start of one green light and when the next one down the road turns green, allowing more vehicles to pass through both intersections without stopping. More complicated suggestions can involve two steps, tuning both the duration of a particular light and the offset between that light and an adjacent one.

City engineers log into an online Google dashboard to view the recommendations, which they can copy over to their lighting control programs and apply in minutes remotely, or for non-networked lights, by stopping by an intersection’s control box in person. In either case, Google's computing all this using its own data spares cities from having to collect their own—whether automatically through sensors or manually through laborious counts—and also from having to calculate or eyeball their own adjustments.

In some cities, an intersection’s settings may go unchanged for years. Rothenberg says the project has in some cases drawn attention to intersections in areas typically neglected by city leaders. Google’s system enables changes every few weeks as traffic patterns change, though for now it lacks capability for real-time adjustments, which many cities don’t have the infrastructure to support anyway. Rothenberg says Google collaborated with traffic engineering faculty at Israel’s Technion university and UC Berkeley on Green Light, whose users also include Haifa, Budapest, Abu Dhabi, and Bali.

To validate that Google’s suggestions work, cities can use traffic counts from video footage or other sensors. Applying computer vision algorithms to city videofeeds could eventually help Google and users understand other effects not easily detected in conventional traffic data. For instance, when Google engineers watched in person as a Green Light tweak went into effect in Budapest, they noticed fewer people running the red light because drivers no longer had to wait for multiple cycles of red-to-green lights to pass through the intersection.

Green Light is ahead of some competing options. Mark Burfeind, a spokesperson for transportation analytics provider Inrix, says the company’s data set covers 250,000 of the estimated 300,000 signals in the US and helps about 25 government agencies study changes to timing settings. But it doesn’t actively suggest adjustments, leaving traffic engineers to calculate their own. Inrix’s estimates do underscore the considerable climate consequences of small changes: Each second of waiting time at the average signal in King County, Washington, home to Seattle, burns 19 barrels of oil annually.


Google has a “sizable” team working on Green Light, Rothenberg says. Its future plans include exploring how to proactively optimize lights for pedestrians’ needs and whether to notify Maps users that they are traveling through a Green Light-tuned intersection. Asked whether Google will eventually charge for the service, she says there are no foreseeable plans to, but the project is in an early stage. Its journey hasn’t yet hit any red lights.

Updated 10-10-2023, 5:15 pm EDT: This article was updated with comment from Transport for Greater Manchester.
 
Top