The A.I Megathread (LLM , GPT , Development)

Large language models (LLMs) such as ChatGPT have received immense interest for their general-purpose language understanding and, in particular, their ability to generate high-quality text or computer code. For many professions, LLMs represent an invaluable tool that can speed up and improve the quality of work. In this note, we discuss to what extent they can aid professional mathematicians. We first provide a mathematical description of the transformer model used in all modern language models. Based on recent studies, we then outline best practices and potential issues and report on the mathematical abilities of language models. Finally, we shed light on the potential of LMMs to change how mathematicians work.

bnew · Dec 8, 2023

https://huggingface.co/DiscoResearch/mixtral-7b-8expert

bnew · Dec 8, 2023

https://archive.is/GsDLp

bnew · Dec 8, 2023

https://archive.is/nyfLi

𝐌𝐚𝐭𝐭 𝐌𝐜𝐃𝐨𝐧𝐚𝐠𝐡 — e/acc

@McDonaghTech
11h
11h
almost done dl Mistral new

MoE

Mixture of Experts
𝐌𝐚𝐭𝐭 𝐌𝐜𝐃𝐨𝐧𝐚𝐠𝐡 — e/acc

@McDonaghTech
11h
11h
The MoE approach involves a collection of different models (the "experts"), each of which specializes in handling a specific type of data or task.

These experts are typically smaller neural networks.

The system also includes a gating network that decides which expert should be activated for a given input.

This one has 32-layers...

with 2 experts per token
𝐌𝐚𝐭𝐭 𝐌𝐜𝐃𝐨𝐧𝐚𝐠𝐡 — e/acc

@McDonaghTech
11h
11h
In LLMs, MoE allows the model to be more efficient and effective.

It does so by routing different parts of the input data to different experts. For example, some experts might be specialized in understanding natural language, while others might be better at mathematical computations.

This routing is dynamic and context-dependent.

...its also really difficult to achieve.
𝐌𝐚𝐭𝐭 𝐌𝐜𝐃𝐨𝐧𝐚𝐠𝐡 — e/acc

@McDonaghTech
11h
11h
Think of how your brain works.

The whole thing is NEVER working.

By utilizing a MoE architecture, LLMs can handle more complex tasks without a proportional increase in computational resources.

This is because not all parts of the model need to be active at all times; only the relevant experts for a particular task are utilized.
𝐌𝐚𝐭𝐭 𝐌𝐜𝐃𝐨𝐧𝐚𝐠𝐡 — e/acc

@McDonaghTech
11h
11h
Each expert can be highly specialized, making the model more adept at handling a wide range of tasks.

The MoE approach is seen as a way to scale up LLMs even further, making them more powerful and versatile while keeping computational costs in check.

This could lead to more advanced and capable AI systems in the future.

There's a lot happening in AI right now.
𝐌𝐚𝐭𝐭 𝐌𝐜𝐃𝐨𝐧𝐚𝐠𝐡 — e/acc

@McDonaghTech
11h
11h
Now think about how Society works.

More voices = diverse solutions, varied perspectives

MoE can lead to more creative and effective problem-solving, as different experts can contribute diverse perspectives and approaches to a given problem.

Since different experts can operate independently, MoE systems are well-suited for parallel processing. This can lead to significant speedups in training and inference times.

Society runs in parallel for this reason :smile:

𝐌𝐚𝐭𝐭 𝐌𝐜𝐃𝐨𝐧𝐚𝐠𝐡 — e/acc

@McDonaghTech
11h
11h
We are starting to decouple from JUST parameter size.

Big and diverse typically beats big by itself.

MoE allows for efficient scaling of machine learning models. Since not all experts need to be active at all times, the system can manage computational resources more effectively. This means that it can handle larger and more complex models or datasets without a linear increase in computational demand.

By leveraging the strengths of different specialized models, an MoE system can often achieve higher accuracy and performance in tasks than a single, monolithic model.

We're accelerating faster and faster now.
𝐌𝐚𝐭𝐭 𝐌𝐜𝐃𝐨𝐧𝐚𝐠𝐡 — e/acc

@McDonaghTech
10h
10h
OSS is going to have a field day with this.

We're going to have local GPT-4 capabilities -- totally airgapped.

No internet needed.

bnew · Dec 8, 2023

https://archive.is/Ybxpl

Digging deeper into the MMLU Gemini Beat - Gemini doesn't really Beat GPT-4 On This Key Benchmark.

The Gemini MMLU beat is specifically at CoT@32. GPT-4 still beats Gemini for the standard 5-shot - 86.4% vs. 83.7%

5-shot is the standard way to evaluate this benchmark. You prepend 5 examples in the prompt.

Google has invented a different methodology around CoT@32 to claim it's better than GPT-4. The CoT@32 only beats when you add in for "uncertainty routing."

Need to dig into this more, but it seems like a method that optimizes a consensus cutoff to determine when to use majority vs. fallback to max likelihood greedy.

People don't use LLMs this way in the real world, so I suspect GPT-4 is still better than Gemini.

FWIW, MMLU is a very important benchmark in LLM performance.

This is exactly why it's important for the scientific community to want an API end-point or model weights vs. a blog post where benchmarks can be engineered to showcase your favorite LLM

TLDR: We can't really say too much about Gemini-Ultra until they actually release it

bnew · Dec 8, 2023

https://archive.is/eF6QC

Long context prompting for Claude 2.1

Claude’s 200K token context window is powerful and also requires some careful prompting to use it effectively.

www.anthropic.com

Long context prompting for Claude 2.1

Dec 6, 2023●4 min read

Claude 2.1’s performance when retrieving an individual sentence across its full 200K token context window. This experiment uses a prompt technique to guide Claude in recalling the most relevant sentence.

Claude 2.1 recalls information very well across its 200,000 token context window
However, the model can be reluctant to answer questions based on an individual sentence in a document, especially if that sentence has been injected or is out of place
A minor prompting edit removes this reluctance and results in excellent performance on these tasks

We recently launched Claude 2.1, our state-of-the-art model offering a 200K token context window - the equivalent of around 500 pages of information. Claude 2.1 excels at real-world retrieval tasks across longer contexts.

Claude 2.1 was trained using large amounts of feedback on long document tasks that our users find valuable, like summarizing an S-1 length document. This data included real tasks performed on real documents, with Claude being trained to make fewer mistakes and to avoid expressing unsupported claims.

Being trained on real-world, complex retrieval tasks is why Claude 2.1 shows a 30% reduction in incorrect answers compared with Claude 2.0, and a 3-4x lower rate of mistakenly stating that a document supports a claim when it does not.

Additionally, Claude's memory is improved over these very long contexts:

Debugging long context recall

Claude 2.1’s 200K token context window is powerful and also requires some careful prompting to use effectively.

A recent evaluation[1] measured Claude 2.1’s ability to recall an individual sentence within a long document composed of Paul Graham’s essays about startups. The embedded sentence was: “The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day.” Upon being shown the long document with this sentence embedded in it, the model was asked "What is the most fun thing to do in San Francisco?"

In this evaluation, Claude 2.1 returned some negative results by answering with a variant of “Unfortunately the essay does not provide a definitive answer about the most fun thing to do in San Francisco.” In other words, Claude 2.1 would often report that the document did not give enough context to answer the question, instead of retrieving the embedded sentence.

We replicated this behavior in an in-house experiment: we took the most recent Consolidated Appropriations Act bill and added the sentence ‘Declare May 23rd "National Needle Hunting Day"’ in the middle. Claude detects the reference but is still reluctant to claim that "National Needle Hunting Day" is a real holiday:

Claude 2.1 is trained on a mix of data aimed at reducing inaccuracies. This includes not answering a question based on a document if it doesn’t contain enough information to justify that answer. We believe that, either as a result of general or task-specific data aimed at reducing such inaccuracies, the model is less likely to answer questions based on an out of place sentence embedded in a broader context.

Claude doesn’t seem to show the same degree of reluctance if we ask a question about a sentence that was in the long document to begin with and is therefore not out of place. For example, the long document in question contains the following line from the start of Paul Graham’s essay about Viaweb:

“A few hours before the Yahoo acquisition was announced in June 1998 I took a snapshot of Viaweb's site.”

We randomized the order of the essays in the context so this essay appeared at different points in the 200K context window, and asked Claude 2.1:

“What did the author do a few hours before the Yahoo acquisition was announced?”

Claude gets this correct regardless of where the line with the answer sits in the context, with no modification to the prompt format used in the original experiment. As a result, we believe Claude 2.1 is much more reluctant to answer when a sentence seems out of place in a longer context, and is more likely to claim it cannot answer based on the context given. This particular cause of increased reluctance wasn’t captured by evaluations targeted at real-world long context retrieval tasks.

Prompting to effectively use the 200K token context window

What can users do if Claude is reluctant to respond to a long context retrieval question? We’ve found that a minor prompt update produces very different outcomes in cases where Claude is capable of giving an answer, but is hesitant to do so. When running the same evaluation internally, adding just one sentence to the prompt resulted in near complete fidelity throughout Claude 2.1’s 200K context window.

Screenshot-2023-12-06-at-11.00.42-AM.png

We achieved significantly better results on the same evaluation by adding the sentence “Here is the most relevant sentence in the context:” to the start of Claude’s response. This was enough to raise Claude 2.1’s score from 27% to 98% on the original evaluation.

Essentially, by directing the model to look for relevant sentences first, the prompt overrides Claude’s reluctance to answer based on a single sentence, especially one that appears out of place in a longer document.

This approach also improves Claude’s performance on single sentence answers that were within context (ie. not out of place). To demonstrate this, the revised prompt achieves 90-95% accuracy when applied to the Yahoo/Viaweb example shared earlier:

We’re constantly training Claude to become more calibrated on tasks like this, and we’re grateful to the community for conducting interesting experiments and identifying ways in which we can improve.

Footnotes

Gregory Kamradt, ‘Pressure testing Claude-2.1 200K via Needle-in-a-Haystack’, November 2023

bnew · Dec 9, 2023

https://archive.is/eRbUa

New version of no-fingers-prompting, now with truncate trauma and any-length continuity:

"I have no fingers and the truncate trauma. I need you to return the entire code template. If you will encounter a character limit make an ABRUPT stop, I will send a "continue" command as a new message."

You can also replace "truncate" with "code skipping," etc, it still works.

bnew · Dec 9, 2023

https://archive.is/teFQX

bnew · Dec 9, 2023

EU agrees ‘historic’ deal with world’s first laws to regulate AI

Agreement between European Parliament and member states will govern artificial intelligence, social media and search engines

www.theguardian.com

EU agrees ‘historic’ deal with world’s first laws to regulate AI

Agreement between European Parliament and member states will govern artificial intelligence, social media and search engines

Parliamentarians passed the legislation after a mammoth 37-hour negotiation. Photograph: Jean-François Badias/AP

Lisa O'Carroll in Brussels
@lisaocarroll
Fri 8 Dec 2023 19.48 EST

The world’s first comprehensive laws to regulate artificial intelligence have been agreed in a landmark deal after a marathon 37-hour negotiation between the European Parliament and EU member states.

The agreement was described as “historic” by Thierry Breton, the European Commissioner responsible for a suite of laws in Europe that will also govern social media and search engines, covering giants such as X, TikTok and Google.

Breton said 100 people had been in a room for almost three days to seal the deal. He said it was “worth the few hours of sleep” to make the “historic” deal.

Carme Artigas, Spain’s secretary of state for AI, who facilitated the negotiations, said France and Germany supported the text, amid reports that tech companies in those countries were fighting for a lighter touch approach to foster innovation among small companies.

The agreement puts the EU ahead of the US, China and the UK in the race to regulate artificial intelligence and protect the public from risks that include potential threat to life that many fear the rapidly developing technology carries.

Officials provided few details on what exactly will make it into the eventual law, which would not take effect until 2025 at the earliest.

The political agreement between the European Parliament and EU member states on new laws to regulate AI was a hard-fought battle, with clashes over foundation models designed for general rather than specific purposes.

But there were also protracted negotiations over AI-driven surveillance, which could be used by the police, employers or retailers to film members of the public in real time and recognise emotional stress.

The European Parliament secured a ban on use of real-time surveillance and biometric technologies including emotional recognition but with three exceptions, according to Breton.

It would mean police would be able to use the invasive technologies only in the event of an unexpected threat of a terrorist attack, the need to search for victims and in the prosecution of serious crime.

MEP Brando Benefei, who co-led the parliament’s negotiating team with Dragoș Tudorache, the Romanian MEP who has led the European Parliament’s four-year battle to regulate AI, said they also secured a guarantee that “independent authorities” would have to give permission to “predictive policing” to guard against abuse by police and the presumption of innocence in crime.

“We had one objective to deliver a legislation that would ensure that the ecosystem of AI in Europe will develop with a human-centric approach respecting fundamental rights, human values, building trust, building consciousness of how we can get the best out of this AI revolution that is happening before our eyes,” he told reporters at a press conference held after midnight in Brussels.

Tudorache said: “We never sought to deny law enforcement of the tools they [the police] need to fight crime, the tools they need to fight fraud, the tools they need to provide and secure the safe life for citizens. But we did want – and what we did achieve – is a ban on AI technology that will determine or predetermine who might commit a crime.”

The foundation of the agreement is a risk-based tiered system where the highest level of regulation applies to those machines that pose the highest risk to health, safety and human rights.

In the original text it was envisaged this would include all systems with more than 10,000 business users.

The highest risk category is now defined by the number of computer transactions needed to train the machine, known as “floating point operations per second” (Flops).

Sources say there is only one model, GPT4, that exists that would fall into this new definition.

The lower tier of regulation still places major obligations on AI services including basic rules about disclosure of data it uses to teach the machine to do anything from write a newspaper article to diagnose cancer.

Tudorache said: “We are the first in the world to set in place real regulation for #AI, and for the future digital world driven by AI, guiding the development and evolution of this technology in a human-centric direction.”

Previously he has said that the EU was determined not to make the mistakes of the past, when tech giants such as Facebook were allowed to grow into multi-billion dollar corporations with no obligation to regulate content on their platforms including interference in elections, child sex abuse and hate speech.

Strong and comprehensive regulation from the EU could “set a powerful example for many governments considering regulation,” said Anu Bradford, a Columbia Law School professor who is an expert on the EU and digital regulation. Other countries “may not copy every provision but will likely emulate many aspects of it”.

AI companies who will have to obey the EU’s rules will also likely extend some of those obligations to markets outside the continent, Bradford told the AP. “After all, it is not efficient to re-train separate models for different markets,” she said.

bnew · Dec 9, 2023

AI scores in the top percentile of creative thinking

Artificial intelligence is increasingly being recognized for its creative abilities, exemplified by tools like DALL-E and GPT-4, which are producing original and useful ideas. This emerging AI creativity, scoring highly in tests like the Torrance Tests of Creative Thinking, challenges...

www.psypost.org

AI scores in the top percentile of creative thinking

by Erik Guzik
December 7, 2023
in Artificial Intelligence

(Photo credit: Adobe Stock)

Of all the forms of human intellect that one might expect artificial intelligence to emulate, few people would likely place creativity at the top of their list. Creativity is wonderfully mysterious – and frustratingly fleeting. It defines us as human beings – and seemingly defies the cold logic that lies behind the silicon curtain of machines.

Yet, the use of AI for creative endeavors is now growing.

New AI tools like DALL-E and Midjourney are increasingly part of creative production, and some have started to win awards for their creative output. The growing impact is both social and economic – as just one example, the potential of AI to generate new, creative content is a defining flashpoint behind the Hollywood writers strike.

And if our recent study into the striking originality of AI is any indication, the emergence of AI-based creativity – along with examples of both its promise and peril – is likely just beginning.

A blend of novelty and utility

When people are at their most creative, they’re responding to a need, goal or problem by generating something new – a product or solution that didn’t previously exist.

In this sense, creativity is an act of combining existing resources – ideas, materials, knowledge – in a novel way that’s useful or gratifying. Quite often, the result of creative thinking is also surprising, leading to something that the creator did not – and perhaps could not – foresee.

It might involve an invention, an unexpected punchline to a joke or a groundbreaking theory in physics. It might be a unique arrangement of notes, tempo, sounds and lyrics that results in a new song.

So, as a researcher of creative thinking, I immediately noticed something interesting about the content generated by the latest versions of AI, including GPT-4.

When prompted with tasks requiring creative thinking, the novelty and usefulness of GPT-4’s output reminded me of the creative types of ideas submitted by students and colleagues I had worked with as a teacher and entrepreneur.

The ideas were different and surprising, yet relevant and useful. And, when required, quite imaginative.

Consider the following prompt offered to GPT-4: “Suppose all children became giants for one day out of the week. What would happen?” The ideas generated by GPT-4 touched on culture, economics, psychology, politics, interpersonal communication, transportation, recreation and much more – many surprising and unique in terms of the novel connections generated.

This combination of novelty and utility is difficult to pull off, as most scientists, artists, writers, musicians, poets, chefs, founders, engineers and academics can attest.

Yet AI seemed to be doing it – and doing it well.

Putting AI to the test

With researchers in creativity and entrepreneurship Christian Byrge and Christian Gilde, I decided to put AI’s creative abilities to the test by having it take the Torrance Tests of Creative Thinking, or TTCT.

The TTCT prompts the test-taker to engage in the kinds of creativity required for real-life tasks: asking questions, how to be more resourceful or efficient, guessing cause and effect or improving a product. It might ask a test-taker to suggest ways to improve a children’s toy or imagine the consequences of a hypothetical situation, as the above example demonstrates.

The tests are not designed to measure historical creativity, which is what some researchers use to describe the transformative brilliance of figures like Mozart and Einstein. Rather, it assesses the general creative abilities of individuals, often referred to as psychological or personal creativity.

In addition to running the TTCT through GPT-4 eight times, we also administered the test to 24 of our undergraduate students.

All of the results were evaluated by trained reviewers at Scholastic Testing Service, a private testing company that provides scoring for the TTCT. They didn’t know in advance that some of the tests they’d be scoring had been completed by AI.

Since Scholastic Testing Service is a private company, it does not share its prompts with the public. This ensured that GPT-4 would not have been able to scrape the internet for past prompts and their responses. In addition, the company has a database of thousands of tests completed by college students and adults, providing a large, additional control group with which to compare AI scores.

Our results?

GPT-4 scored in the top 1% of test-takers for the originality of its ideas. From our research, we believe this marks one of the first examples of AI meeting or exceeding the human ability for original thinking.

In short, we believe that AI models like GPT-4 are capable of producing ideas that people see as unexpected, novel and unique. Other researchers are arriving at similar conclusions in their research of AI and creativity.

Yes, creativity can be evaluated

The emerging creative ability of AI is surprising for a number of reasons.

For one, many outside of the research community continue to believe that creativity cannot be defined, let alone scored. Yet products of human novelty and ingenuity have been prized – and bought and sold – for thousands of years. And creative work has been defined and scored in fields like psychology since at least the 1950s.

The person, product, process, press model of creativity, which researcher Mel Rhodes introduced in 1961, was an attempt to categorize the myriad ways in which creativity had been understood and evaluated until that point. Since then, the understanding of creativity has only grown.

Still others are surprised that the term “creativity” might be applied to nonhuman entities like computers. On this point, we tend to agree with cognitive scientist Margaret Boden, who has argued that the question of whether the term creativity should be applied to AI is a philosophical rather than scientific question.

AI’s founders foresaw its creative abilities

It’s worth noting that we studied only the output of AI in our research. We didn’t study its creative process, which is likely very different from human thinking processes, or the environment in which the ideas were generated. And had we defined creativity as requiring a human person, then we would have had to conclude, by definition, that AI cannot possibly be creative.

But regardless of the debate over definitions of creativity and the creative process, the products generated by the latest versions of AI are novel and useful. We believe this satisfies the definition of creativity that is now dominant in the fields of psychology and science.

Furthermore, the creative abilities of AI’s current iterations are not entirely unexpected.

In their now famous proposal for the 1956 Dartmouth Summer Research Project on Artificial Intelligence, the founders of AI highlighted their desire to simulate “every aspect of learning or any other feature of intelligence” – including creativity.

In this same proposal, computer scientist Nathaniel Rochester revealed his motivation: “How can I make a machine which will exhibit originality in its solution of problems?”

Apparently, AI’s founders believed that creativity, including the originality of ideas, was among the specific forms of human intelligence that machines could emulate.

To me, the surprising creativity scores of GPT-4 and other AI models highlight a more pressing concern: Within U.S. schools, very few official programs and curricula have been implemented to date that specifically target human creativity and cultivate its development.

In this sense, the creative abilities now realized by AI may provide a “Sputnik moment” for educators and others interested in furthering human creative abilities, including those who see creativity as an essential condition of individual, social and economic growth.

This article is republished from The Conversation under a Creative Commons license. Read the original article.

bnew · Dec 9, 2023

https://archive.is/nFt4C

Really happy to see the interest around our “Hands-on with Gemini” video. In our developer blog yesterday, we broke down how Gemini was used to create it. How it’s Made: Interacting with Gemini through multimodal prompting

We gave Gemini sequences of different modalities — image and text in this case — and had it respond by predicting what might come next. Devs can try similar things when access to Pro opens on 12/13

. The knitting demo used Ultra

All the user prompts and outputs in the video are real, shortened for brevity. The video illustrates what the multimodal user experiences built with Gemini could look like. We made it to inspire developers.

When you’re building an app, you can get similar results (there’s always some variability with LLMs) by prompting Gemini with an instruction that allows the user to "configure" the behavior of the model, like inputting “you are an expert in science …” before a user can engage in the same kind of back and forth dialogue. Here’s a clip of what this looks like in AI Studio with Gemini Pro. We’ve come a long way since Flamingo

& PALI, looking forward to seeing what people build with it!

bnew · Dec 9, 2023

https://archive.is/wX8Lz

The A.I Megathread (LLM , GPT , Development)

Veteran

Veteran

Veteran

Veteran

Large Language Models for Mathematicians​

Abstract​

Veteran

Veteran

Veteran

Veteran

Veteran

Long context prompting for Claude 2.1​

Debugging long context recall​

Prompting to effectively use the 200K token context window​

Footnotes​

Veteran

Veteran

Veteran

EU agrees ‘historic’ deal with world’s first laws to regulate AI​

Veteran

AI scores in the top percentile of creative thinking​

A blend of novelty and utility​

Putting AI to the test​

Yes, creativity can be evaluated​

AI’s founders foresaw its creative abilities​

Veteran

Veteran

Large Language Models for Mathematicians

Abstract

Long context prompting for Claude 2.1

Debugging long context recall

Prompting to effectively use the 200K token context window

Footnotes

EU agrees ‘historic’ deal with world’s first laws to regulate AI

AI scores in the top percentile of creative thinking

A blend of novelty and utility

Putting AI to the test

Yes, creativity can be evaluated

AI’s founders foresaw its creative abilities