bnew

Veteran
Joined
Nov 1, 2015
Messages
55,551
Reputation
8,224
Daps
156,964




1/5
Woah, another exciting update from Chatbot Arena❤️‍🔥

The results for @xAI’s sus-column-r (Grok 2 early version) are now public**!

With over 12,000 community votes, sus-column-r has secured the #3 spot on the overall leaderboard, even matching GPT-4o! It excels in Coding (#2), Hard Prompts (#4), and Math (#2).

Congratulations to @xAI on this impressive debut for Grok 2!

More plots below👇

**Note: We post its early result on twitter. The official update for Grok 2 coming soon..!

2/5
Confidence intervals of model scores. sus-column-r is strong in Coding and Hard Prompts Arena.

3/5
#2-4 in English Arena

4/5
Overall win-rate heatmap.

Grok 2 official blog at Grok-2 Beta Release. We will test the official version and update it to leaderboard soon!

5/5
Come chat with the model at http://lmarena.ai/?model=sus-column-r !


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GU640J5a4Actl44.jpg

GU65JJua4AQduBw.jpg

GU65PMLa4AMb5J7.jpg

GU65WAta4AUUQfA.jpg

GU69EAaXMAAu0iK.jpg
 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,551
Reputation
8,224
Daps
156,964


AI poses no existential threat to humanity – new study finds​


Large language models like ChatGPT cannot learn independently or acquire new skills, meaning they pose no existential threat to humanity.


Man typing on phone with AI robot appearing from screen
Large language models remain inherently controllable, predictable and safe.

ChatGPT and other large language models (LLMs) cannot learn independently or acquire new skills, meaning they pose no existential threat to humanity, according to new research from the University of Bath and the Technical University of Darmstadt in Germany.

The study, published today as part of the proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024) – the premier international conference in natural language processing – reveals that LLMs have a superficial ability to follow instructions and excel at proficiency in language, however, they have no potential to master new skills without explicit instruction. This means they remain inherently controllable, predictable and safe.

This means they remain inherently controllable, predictable and safe.

The research team concluded that LLMs – which are being trained on ever larger datasets – can continue to be deployed without safety concerns, though the technology can still be misused.

With growth, these models are likely to generate more sophisticated language and become better at following explicit and detailed prompts, but they are highly unlikely to gain complex reasoning skills.

“The prevailing narrative that this type of AI is a threat to humanity prevents the widespread adoption and development of these technologies, and also diverts attention from the genuine issues that require our focus,” said Dr Harish Tayyar Madabushi, computer scientist at the University of Bath and co-author of the new study on the ‘emergent abilities’ of LLMs.

The collaborative research team, led by Professor Iryna Gurevych at the Technical University of Darmstadt in Germany, ran experiments to test the ability of LLMs to complete tasks that models have never come across before – the so-called emergent abilities.

As an illustration, LLMs can answer questions about social situations without ever having been explicitly trained or programmed to do so. While previous research suggested this was a product of models ‘knowing’ about social situations, the researchers showed that it was in fact the result of models using a well-known ability of LLMs to complete tasks based on a few examples presented to them, known as `in-context learning’ (ICL).

Through thousands of experiments, the team demonstrated that a combination of LLMs ability to follow instructions (ICL), memory and linguistic proficiency can account for both the capabilities and limitations exhibited by LLMs.

Dr Tayyar Madabushi said: “The fear has been that as models get bigger and bigger, they will be able to solve new problems that we cannot currently predict, which poses the threat that these larger models might acquire hazardous abilities including reasoning and planning.

“This has triggered a lot of discussion – for instance, at the AI Safety Summit last year at Bletchley Park, for which we were asked for comment – but our study shows that the fear that a model will go away and do something completely unexpected, innovative and potentially dangerous is not valid.

“Concerns over the existential threat posed by LLMs are not restricted to non-experts and have been expressed by some of the top AI researchers across the world."

However, Dr Tayyar Madabushi maintains this fear is unfounded as the researchers' tests clearly demonstrated the absence of emergent complex reasoning abilities in LLMs.

“While it's important to address the existing potential for the misuse of AI, such as the creation of fake news and the heightened risk of fraud, it would be premature to enact regulations based on perceived existential threats,” he said.

“Importantly, what this means for end users is that relying on LLMs to interpret and perform complex tasks which require complex reasoning without explicit instruction is likely to be a mistake. Instead, users are likely to benefit from explicitly specifying what they require models to do and providing examples where possible for all but the simplest of tasks.”

Professor Gurevych added: "… our results do not mean that AI is not a threat at all. Rather, we show that the purported emergence of complex thinking skills associated with specific threats is not supported by evidence and that we can control the learning process of LLMs very well after all. Future research should therefore focus on other risks posed by the models, such as their potential to be used to generate fake news."


Dr Harish Tayyar Madabushi describes the pros, cons and limitations of LLMs.​


 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,551
Reputation
8,224
Daps
156,964


Google quietly opens Imagen 3 access to all U.S. users​

Michael Nuñez@MichaelFNunez

August 15, 2024 10:42 AM

Credit: Google Imagen


Credit: Google Imagen

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


Google has quietly made its latest text-to-image AI model, Imagen 3, available to all U.S. users through its ImageFX platform and published a research paper detailing the technology.

This dual release marks a significant expansion of access to the AI tool, which was initially announced in May at Google I/O and limited to select Vertex AI users in June.



1/1
Google announces Imagen 3

discuss: Paper page - Imagen 3

We introduce Imagen 3, a latent diffusion model that generates high quality images from text prompts. We describe our quality and responsibility evaluations. Imagen 3 is preferred over other state-of-the-art (SOTA) models at the time of evaluation. In addition, we discuss issues around safety and representation, as well as methods we used to minimize the potential harm of our models.
GU6DMxqWEAAsumc.jpg

The company’s research team stated in their paper, published on arxiv.org, “We introduce Imagen 3, a latent diffusion model that generates high-quality images from text prompts. Imagen 3 is preferred over other state-of-the-art models at the time of evaluation.”

This development comes in the same week as xAI’s launch of Grok-2, a rival AI system with notably fewer restrictions on image generation, highlighting the divergent approaches to AI ethics and content moderation within the tech industry.


Imagen 3: Google’s latest salvo in the AI arms race​


Google’s release of Imagen 3 to the broader U.S. public represents a strategic move in the intensifying AI arms race. However, the reception has been mixed. While some users praise its improved texture and word recognition capabilities, others express frustration with its strict content filters.

One user on Reddit noted, “Quality is much higher with amazing texture and word recognition, but I think it’s currently worse than Imagen 2 for me.” They added, “It’s pretty good, but I’m working harder with higher error results.”

The censorship implemented in Imagen 3 has become a focal point of criticism. Many users report that seemingly innocuous prompts are being blocked. “Way too censored I can’t even make a cyborg for crying out loud,” another Reddit user commented. Another said, “[It] denied half my inputs, and I’m not even trying to do anything crazy.”

These comments highlight the tension between Google’s efforts to ensure responsible AI use and users’ desires for creative freedom. Google has emphasized its focus on responsible AI development, stating, “We used extensive filtering and data labeling to minimize harmful content in datasets and reduced the likelihood of harmful outputs.”


Grok-2: xAI’s controversial unrestricted approach​


In stark contrast, xAI’s Grok-2, integrated within Elon Musk’s social network X and available through premium subscription tiers, offers image generation capabilities with virtually no restrictions. This has led to a flood of controversial content on the platform, including manipulated images of public figures and graphic depictions that other AI companies typically prohibit.

The divergent approaches of Google and xAI underscore the ongoing debate in the tech industry about the balance between innovation and responsibility in AI development. While Google’s cautious approach aims to prevent misuse, it has led to frustration among some users who feel creatively constrained. Conversely, xAI’s unrestricted model has reignited concerns about the potential for AI to spread misinformation and offensive content.

Industry experts are closely watching how these contrasting strategies will play out, particularly as the U.S. presidential election approaches. The lack of guardrails in Grok-2’s image generation capabilities has already raised eyebrows, with many speculating that xAI will face increasing pressure to implement restrictions.


The future of AI image generation: Balancing creativity and responsibility​


Despite the controversies, some users have found value in Google’s more restricted tool. A marketing professional on Reddit shared, “It’s so much easier to generate images via something like Adobe Firefly than digging through hundreds of pages of stock sites.”

As AI image generation technology becomes more accessible to the public, the industry faces critical questions about the role of content moderation, the balance between creativity and responsibility, and the potential impact of these tools on public discourse and information integrity.

The coming months will be crucial for both Google and xAI as they navigate user feedback, potential regulatory scrutiny, and the broader implications of their technological choices. The success or failure of their respective approaches could have far-reaching consequences for the future development and deployment of AI tools across the tech industry.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,551
Reputation
8,224
Daps
156,964


Runway’s Gen-3 Alpha Turbo is here and can make AI videos faster than you can type​

Carl Franzen@carlfranzen

August 15, 2024 9:04 AM

Robot director in red beret looks through camera monitor setup video village


Credit: VentureBeat made with Midjourney

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More



After showing it off in a preview late last month, Runway ML has officially released Gen-3 Alpha Turbo, the latest version of the AI video generation model that it claims is seven times faster and half the cost of its predecessor, Gen-3 Alpha.

The goal? Make AI video production more accessible to a wider audience across all subscription plans, including free trials.

The New York City-based company announced the news on its X account, writing: “Gen-3 Alpha Turbo Image to Video is now available and can generate 7x faster for half the price of the original Gen-3 Alpha. All while still matching performance across many use cases. Turbo is available for all plans, including trial for free users. More improvements to the model, control mechanisms and possibilities for real-time interactivity to come.”



1/1
Gen-3 Alpha Turbo Image to Video is now available and can generate 7x faster for half the price of the original Gen-3 Alpha. All while still matching performance across many use cases. Turbo is available for all plans, including trial for free users.

More improvements to the model, control mechanisms and possibilities for real-time interactivity to come.


Gen-3 Alpha Turbo builds on the already impressive capabilities of Runway’s Gen-3 Alpha, which gained attention for its realistic video generation.

However, Runway has pushed the boundaries even further with this latest release, prioritizing speed without compromising on performance. According to Runway co-founder and CEO Cristóbal Valenzuela, the new Turbo model means “it now takes me longer to type a sentence than to generate a video.”



1/1
it now takes me longer to type a sentence than to generate a video.

This leap in speed addresses a critical issue with AI video generation models—time lag—allowing for near real-time video production.

As a result, users can expect a more seamless and efficient workflow, particularly in industries where quick turnaround times are essential.


Broad accessibility and aggressively low pricing​


Runway’s decision to lower the cost of using Gen-3 Alpha Turbo aligns with its strategy to encourage more widespread adoption of its technology.

While Gen-3 Alpha regular is priced at 10 credits per second of video generated by the model, Gen-3 Alpha Turbo should be priced at 5 credits per 1 second of video per Runway’s statement that it is 50% less.

Credits can be purchased in bundles starting at 1,000 credits on the Runway website or as part of monthly or annual subscription tiers. It costs $10 for 1,000 credits, or $0.01 per credit.

Screenshot-2024-08-15-at-11.44.45%E2%80%AFAM-1.png


The model’s availability across all subscription plans, including free trials, ensures that a broad spectrum of users—from hobbyists to professional creators—can benefit from these enhancements.

By offering a faster and cheaper alternative, Runway is positioning itself to maintain a competitive edge in the rapidly evolving AI video generation market, where rivals including Pika Labs, Luma AI’s Dream Machine, Kuaishou’s Kling, and OpenAI’s Sora are also vying for dominance.

Yet despite showing off Sora in January of this year and releasing it to a select group of creators, OpenAI’s video model remains out of reach to the public, and other video generation models tend to take much longer to generate from text prompts and images — more than several minutes in my tests.


Promising initial results​


Already, users of Runway Gen-3 Alpha Turbo and subscribers are sharing videos made with the new model and are finding themselves impressed with its combination of speed and quality.

While not always 1×1 in terms of seconds spent generating to seconds of video, the users are nonetheless delighted with the overall experience of using the new model and showcasing a wide range of styles, from realistic to animation and anime.







Some users, such as @LouiErik8Irl on X, prefer the regular Gen-3 Alpha model for its higher quality, in their eyes. Yet they see value in being able to generate simple motion quickly through Gen-3 Alpha Turbo.











1/10
@runwayml Gen-3 Alpha Turbo model is out! It is insanely fast (7x) and very high quality too! Tho the base Alpha model still wins when you want more dynamic motions.

Here are 6 🔥examples to test and compare the two models.

(1/6)
The left is the normal model, and the right is Turbo.

I think I will use Turbo for shots that just need some simple motion from now on. However, the Turbo model doesn't have the Last frame gen, so it's a trade-off.

2/10
It's pretty clear that the base model is far more dynamic. But getting 7X speed with Turbo is also a great trade-off.

Used the same prompt for both to test:
The camera flies inside the tornado

3/10
(2/6)
The base model is better at dynamic motion, but that also leads to more morphing. So if you want more stable and simple motion, Turbo is the way to go!

No prompt for this one to test the models raw.

The left is the normal model, and the right is Turbo.

4/10
(3/6)
But if you want more complex motions and changes, the base model is far better.

Same prompt for both:
The dragon breathes fire out of its mouth.

The left is the normal model, and the right is Turbo.

5/10
(4/6)
The turbo model also seems to stick to the original image more closely, while the base model is more creative.

No prompt for both to test raw motion.

The left is the normal model, and the right is Turbo.

6/10
(5/6)
Some shot types might also work better with Turbo due to the fact that it is more stable. You can see the fire is definitely better for the base model here, but the overall motion of the Turbo model is not bad either.

No prompt for both to test raw motion.

The left is the normal model, and the right is Turbo.

7/10
(6/6)
Again, the base model wins in terms of dynamics. But Turbo model is more consistent and stable. It also doesn't change the character's faces when moving, which was a big problem with the base model. Turbo sticks to the original image really well, tho it is not immune from morphing either.

No prompt for both to test raw motion.

The left is the normal model, and the right is Turbo.

8/10
Overall, the new Turbo model is a fantastic addition to Gen-3. I would use Turbo for shots that need simple motion, more stability, sticking closer to the original image, or faster iteration. And use the base model for more complex motion, more creative outputs, and the First and Last frame feature.

9/10
Btw this set of images was for the Discord daily challenge. Which is themed Fire.

10/10
At the model selection drop-down button on the top left.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GVCGD7ta0AEPSSn.png


Future improvements and unresolved legal/ethical issues​


Runway is not resting on its laurels with the release of Gen-3 Alpha Turbo. The company has indicated that more improvements are on the horizon, including enhancements to the model’s control mechanisms and possibilities for real-time interactivity.

Previously, on its older Gen-2 model, Runway enabled the capability to edit selective objects and portions of a video with its Multi Motion Brush, enabling a more granular direction of the AI algorithms and resulting clips.

However, the company continues to navigate the ethical complexities of AI model training. Runway has faced scrutiny over the sources of its training data, particularly following a report from 404 Media that the company may have used copyrighted content from YouTube for training purposes without authorization.

Although Runway has not commented on these allegations, the broader industry is grappling with similar challenges, as legal battles over the use of copyrighted materials in AI training intensify.

As the debate over ethical AI practices unfolds, Runway and other generative AI companies may find themselves compelled to disclose more information about their training data and methods. The outcome of these discussions could have significant implications for the future of AI model development and deployment.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,551
Reputation
8,224
Daps
156,964

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,551
Reputation
8,224
Daps
156,964

I've made a useful comparison table between all the providers, models and prices
15NrRLm.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,551
Reputation
8,224
Daps
156,964
yeah this was my prompt "explain how major label loans with hip hop artists are like sharecropper deals black americans were given." using Qwen1.5-72B-chat LLM.

they weren't initially designed to do math but thats an area thats seeing constant improvement.



1/1
DeepSeekMath: Approaching Mathematical Reasoning Capability of GPT-4 with a 7B Model.

Highlights:
- Continue pre-training DeepSeek-Coder-Base-v1.5 7B with 120B math tokens from Common Crawl.
- Introduce GRPO, a variant of PPO, that enhances mathematical reasoning and reduces training resources.

More Details:[2402.03300] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Model Download:deepseek-ai (DeepSeek)
GitHub Repo:GitHub - deepseek-ai/DeepSeek-Math: DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

#DeepSeek #DeepSeekMath


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GFnw7VtbUAEYx0N.jpg



1/1
AI apparently now at Math Olympiad levels in geometry

Deepmind’s “AlphaGeometry”, uses a language model + deduction engine to solve complex geometry problems.

Also, it uses a similar dual-thinking method as humans (analogous to intuition & logic in the book “thinking fast and slow”)!


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GEE9Le-X0AAXSrv.jpg

1/1
AlphaGeometry is a system made up of parts:
A neural language model, which can predict useful geometry constructions to solve problems
A symbolic deduction engine, which uses logical rules to deduce conclusions

Both work together to find proofs for complex geometry theorems.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GEDj1GCWEAAH1u8.png

also google Gemini Ultra is beign released today so try it on some math problems you may have tried before to see if it improved.


1/1
Big news.

The most powerful GPT competitor, Gemini Ultra, will be released on Wednesday. Google confirmed it.

Ultra beats GPT-4 in 7 out of 8 benchmarks:

HumanEval (Code)
Gemini Ultra: 74%
GPT-4: 67%

MMLU (General)
Gemini Ultra: 90%
GPT-4: 86.4%

GSM8K (Math)
Gemini Ultra: 94.4%
GPT-4: 92%

Big Bench (Reasoning)
Gemini Ultra: 83.6%
GPT-4: 83.1%

It's also the first model to outperform human experts on MMLU.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GFlz6ePbsAUsWC1.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,551
Reputation
8,224
Daps
156,964

Artificial intelligence is losing hype​


For some, that is proof the tech will in time succeed. Are they right?​

An illustration of a robotic hand with crossed fingers on a solid red background.
Illustration: Alberto Miranda

Aug 19th 2024

Silicon Valley’s tech bros are having a difficult few weeks. A growing number of investors worry that artificial intelligence (AI) will not deliver the vast profits they seek. Since peaking last month the share prices of Western firms driving the ai revolution have dropped by 15%. A growing number of observers now question the limitations of large language models, which power services such as ChatGPT. Big tech firms have spent tens of billions of dollars on ai models, with even more extravagant promises of future outlays. Yet according to the latest data from the Census Bureau, only 4.8% of American companies use ai to produce goods and services, down from a high of 5.4% early this year. Roughly the same share intend to do so within the next year.

Gently raise these issues with a technologist and they will look at you with a mixture of disappointment and pity. Haven’t you heard of the “hype cycle”? This is a term popularised by Gartner, a research firm—and one that is common knowledge in the Valley. After an initial period of irrational euphoria and overinvestment, hot new technologies enter the “trough of disillusionment”, the argument goes, where sentiment sours. Everyone starts to worry that adoption of the technology is proceeding too slowly, while profits are hard to come by. However, as night follows day, the tech makes a comeback. Investment that had accompanied the wave of euphoria enables a huge build-out of infrastructure, in turn pushing the technology towards mainstream adoption. Is the hype cycle a useful guide to the world’s ai future?

It is certainly helpful in explaining the evolution of some older technologies. Trains are a classic example. Railway fever gripped 19th-century Britain. Hoping for healthy returns, everyone from Charles Darwin to John Stuart Mill ploughed money into railway stocks, creating a stockmarket bubble. A crash followed. Then the railway companies, using the capital they had raised during the mania, built the track out, connecting Britain from top to bottom and transforming the economy. The hype cycle was complete. More recently, the internet followed a similar evolution. There was euphoria over the technology in the 1990s, with futurologists predicting that within a couple of years everyone would do all their shopping online. In 2000 the market crashed, prompting the failure of 135 big dotcom companies, from garden.com to pets.com. The more important outcome, though, was that by then telecoms firms had invested billions in fibre-optic cables, which would go on to became the infrastructure for today’s internet.

Although ai has not experienced a bust on anywhere near the same scale as the railways or dotcom, the current anxiety is, according to some, nevertheless evidence of its coming global domination. “The future of ai is just going to be like every other technology. There’ll be a giant expensive build-out of infrastructure, followed by a huge bust when people realise they don’t really know how to use AI productively, followed by a slow revival as they figure it out,” says Noah Smith, an economics commentator.

Is this right? Perhaps not. For starters, versions of ai itself have for decades experienced periods of hype and despair, with an accompanying waxing and waning of academic engagement and investment, but without moving to the final stage of the hype cycle. There was lots of excitement over ai in the 1960s, including over eliza, an early chatbot. This was followed by ai winters in the 1970s and 1990s. As late as 2020 research interest in ai was declining, before zooming up again once generative ai came along.

It is also easy to think of many other influential technologies that have bucked the hype cycle. Cloud computing went from zero to hero in a pretty straight line, with no euphoria and no bust. Solar power seems to be behaving in the same way. Social media, too. Individual companies, such as Myspace, fell by the wayside, and there were concerns early on about whether it would make money, but consumer adoption increased monotonically. On the flip side, there are plenty of technologies for which the vibes went from euphoria to panic, but which have not (or at least not yet) come back in any meaningful sense. Remember Web3? For a time, people speculated that everyone would have a 3d printer at home. Carbon nanotubes were also a big deal.

Anecdotes only get you so far. Unfortunately, it is not easy to test whether a hype cycle is an empirical regularity. “Since it is vibe-based data, it is hard to say much about it definitively,” notes Ethan Mollick of the University of Pennsylvania. But we have had a go at saying something definitive, extending work by Michael Mullany, an investor, that he conducted in 2016. The Economist collected data from Gartner, which for decades has placed dozens of hot technologies where it believes they belong on the hype cycle. We then supplemented it with our own number-crunching.

Over the hill​

We find, in short, that the cycle is a rarity. Tracing breakthrough technologies over time, only a small share—perhaps a fifth—move from innovation to excitement to despondency to widespread adoption. Lots of tech becomes widely used without such a rollercoaster ride. Others go from boom to bust, but do not come back. We estimate that of all the forms of tech which fall into the trough of disillusionment, six in ten do not rise again. Our conclusions are similar to those of Mr Mullany: “An alarming number of technology trends are flashes in the pan.”

ai could still revolutionise the world. One of the big tech firms might make a breakthrough. Businesses could wake up to the benefits that the tech offers them. But for now the challenge for big tech is to prove that ai has something to offer the real economy. There is no guarantee of success. If you must turn to the history of technology for a sense of ai’s future, the hype cycle is an imperfect guide. A better one is “easy come, easy go”. ■

For more expert analysis of the biggest stories in economics, finance and markets, sign up to Money Talks, our weekly subscriber-only newsletter.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,551
Reputation
8,224
Daps
156,964

1/1
LongWriter unlocks text generation up to 10k words! 🤯 can't wait to try it


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196




1/2
LongWriter-glm4-9b from @thukeg is capable of generating 10,000+ words at once!🚀

Paper identifies a problem with current long context LLMs -- they can process inputs up to 100,000 tokens, yet struggle to generate outputs exceeding lengths of 2,000 words.

Paper proposes that an LLM's effective generation length is inherently bounded by the sample it has seen during supervised fine-tuning😮

Demonstrates that existing long context LLMs already possess the potential for a larger output window--all you need is data with extended output during model alignment to unlock this capability.

Code & models are released under Apache License 2.0🧡

2/2
Model on 🤗 Hub: THUDM/LongWriter-glm4-9b · Hugging Face

Gradio demo available on the repo locally and linked on the project Readme: GitHub - THUDM/LongWriter: LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs

Clone the repo and launch the gradio demo: python trans_web_demo.py 🤠

Demo releasing soon on 🤗 Spaces, stay tuned!


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,551
Reputation
8,224
Daps
156,964



1/4
1/ AI news this week that we're paying close attention to:

• Hermes 3 - The uncensored AI model by @NousResearch • Listening-while-Speaking Language Model (LSLM) by ByteDance devs

Why? Read more below!

2/4
Hermes 3 - The uncensored AI model

Powered by @lambdaapi & @NousResearch, built on @Meta's Llama 3.1 405B. The open-source uncensored LLM offers powerful agentic capabilities & user-tailored responses.

It represents a new approach to unrestricted, personalized AI interaction.

3/4
Listening-while-Speaking Language Model (LSLM) - The AI that converses in real-time

Developed by researchers from ByteDance & Shanghai Jiao Tong University, built on a decoder-only Transformer. The model can listen & speak simultaneously, enabling seamless natural conversations.

4/4
Follow @Mira_Network for more weekly updates in AI!

Join our discord: Discord - Group Chat That’s All Fun & Games


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GVhAjctawAA1wg4.jpg

GVCcltuaEAA_mYv.jpg

GVhAoIlakAAsAZD.png









1/8
Introducing 𝐇𝐞𝐫𝐦𝐞𝐬 𝟑: The latest version in our Hermes series, a generalist language model 𝐚𝐥𝐢𝐠𝐧𝐞𝐝 𝐭𝐨 𝐲𝐨𝐮.

Hermes 3 - NOUS RESEARCH

Hermes 3 is available in 3 sizes, 8, 70, and 405B parameters. Hermes has improvements across the board, but with particular capability improvements in roleplaying, agentic tasks, more reliable function calling, multi-turn chats, long context coherence and more.

We published a technical report detailing new capabilities, training run information and more:

Paper: https://nousresearch.com/wp-content/uploads/2024/08/Hermes-3-Technical-Report.pdf

This model was trained in collaboration with our great partners @LambdaAPI, and they are now offering it for free in a chat interface here: https://lambda.chat/chatui/

You can also chat with Hermes 405B on our discord, join here: Join the Nous Research Discord Server!

Hermes 3 was a project built with the help of @Teknium1, @TheEmozilla, @nullvaluetensor, @karan4d, @huemin_art, and an uncountable number of people and work in the Open Source community.

2/8
Hermes 3 performs strongly against Llama-3.1 Instruct Models, but with a focus on aligning the model to you, instead of a company or external policy - meaning less censorship and more steerability - with additional capabilities like agentic XML, scratchpads, roleplaying prowess, and more. Step level reasoning and planning, internal monologues, improved RAG, and even LLM as a judge capabilities were also targeted.

Below are benchmark comparisons between Hermes 3 and Llama-3.1 Instruct and a sample of utilizing the agentic XML tags:

3/8
Lambda's Hermes 3 Announcement Post: Unveiling Hermes 3: The First Full-Parameter Fine-Tuned Llama 3.1 405B Model is on Lambda’s Cloud

Nous' blog post on our experience discovering emergent behavior with 405B:
Freedom at the Frontier: Hermes 3 - NOUS RESEARCH

Hermes 3 405B was trained with @LambdaAPI's new 1-Click Cluster offering, check it out here: Lambda GPU Cloud | 1-Click Clusters

Check out our reference inference code for Hermes Function Calling here: GitHub - NousResearch/Hermes-Function-Calling

Thanks to all the other organizations who helped bring this together, including @weights_biases, @neuralmagic, @vllm_project, @huggingface, @WeAreFireworks, @AiEleuther, @togethercompute, @AIatMeta, and many more

4/8
Special shoutouts to @intrstllrninja for all the work on making function calling real, robust, and useful

and a special thanks to our designer @StudioMilitary for the cover art and all the other designs that Nous uses!

5/8
Believe Lambda is also hosting an api version, will update when clear

6/8
You can try it out in our discord right now if you want! Join the Nous Research Discord Server!

7/8
He's the god of language

8/8
Certainly (not sure if 405b can be done but the rest yes)


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GVCcltuaEAA_mYv.jpg

GVCeToCaEAE8vxG.jpg

GVCeXIsaEAAryOp.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,551
Reputation
8,224
Daps
156,964


1/2
1/n How Mutual Consistent Reasoning Unlocks Agentic AI for Small Language Models

Large Language Models (LLMs) have demonstrated remarkable abilities in various tasks, yet their capacity for complex reasoning remains a significant challenge, especially for their smaller, more accessible counterparts – Small Language Models (SLMs). While fine-tuning on specific reasoning datasets can improve performance, this approach often relies on data generated by superior models, creating a dependence that hinders the development of truly self-sufficient SLMs. The paper "Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers" tackles this challenge head-on, introducing rStar, a novel approach that significantly enhances the reasoning capabilities of SLMs without relying on fine-tuning or data from superior models.

The core of rStar lies in addressing the two major pain points that plague SLMs when it comes to complex reasoning: ineffective exploration of potential solutions and unreliable self-assessment. Traditional methods often confine SLMs to a limited set of reasoning actions, hindering their ability to explore diverse paths towards a solution. Furthermore, relying on these models to evaluate their own reasoning proves unreliable, as their self-assessment capabilities are often inaccurate.

rStar tackles these limitations through a clever two-pronged approach: a richer, human-inspired set of reasoning actions and a collaborative evaluation mechanism called mutual consistency. Unlike previous methods that rely on a single action type, rStar empowers SLMs with a diverse set of actions, mimicking human problem-solving strategies. These actions include proposing thoughts, formulating sub-questions, re-answering, and even rephrasing questions for clarity. This expanded repertoire allows SLMs to navigate the solution space more effectively, exploring a wider range of possibilities.

To address the issue of unreliable self-evaluation, rStar introduces a second SLM as a partner in a collaborative verification process. The first SLM, acting as a generator, leverages the diverse action set and the Monte Carlo Tree Search (MCTS) algorithm to generate multiple candidate reasoning trajectories. The second SLM, acting as a discriminator, then evaluates these trajectories by attempting to complete them with partial information. This collaborative approach, termed "mutual consistency," ensures that only those reasoning paths agreed upon by both SLMs are considered valid, leading to a more robust and reliable evaluation process.

The effectiveness of rStar is evident in its impressive performance on a variety of reasoning tasks. Tested on five different SLMs and five diverse reasoning benchmarks, including mathematical problem-solving and multi-hop reasoning over text, rStar consistently outperforms existing state-of-the-art methods. Remarkably, it achieves accuracy comparable to or even exceeding models fine-tuned on these specific datasets, highlighting its ability to learn and improve without task-specific training data.

The success of rStar signifies a significant leap forward in the field of LLM reasoning. By combining the power of diverse reasoning actions with a collaborative evaluation mechanism, rStar unlocks the potential of SLMs, enabling them to tackle complex reasoning tasks with remarkable accuracy. This approach not only paves the way for more accessible and efficient AI systems but also sheds light on the power of collaborative learning and self-improvement in pushing the boundaries of artificial intelligence.

2/2
2/n Comparision with other methods

1. Prompting LLMs to Reason:

Chain-of-Thought (CoT) (Wei et al., 2022): Prompts LLMs with a few-shot demonstration of reasoning steps.Contrast with rStar: CoT relies on a single, greedy decoding path, while rStar explores multiple reasoning trajectories using MCTS and a richer action space.

Planning, Decomposition, Abstraction, Programming Prompts: Various works explore specific prompting strategies to guide reasoning.

Contrast with rStar: These methods focus on single-round inference, while rStar uses an iterative, self-improving approach.

2. LLM Self-improvement:
Fine-tuning based methods (Chen et al., 2024b;a): Use a well-pretrained LLM to generate data for further fine-tuning.Contrast with rStar: rStar improves reasoning at inference time without requiring additional training data or a superior teacher model.

Self-verification (Gero et al., 2023; Zhou et al., 2023): LLMs verify their own answers, often by generating explanations or checking for consistency.Contrast with rStar: rStar uses a separate discriminator SLM for more reliable evaluation, overcoming the limitations of self-assessment in SLMs.

RAP (Hao et al., 2023): Uses self-exploration and self-rewarding to iteratively improve reasoning.

Contrast with rStar: rStar addresses the limitations of RAP's single action type and unreliable self-rewarding with its diverse action space and mutual consistency mechanism.

3. Sampling Reasoning Paths:
Self-Consistency (Wang et al., 2023): Samples multiple CoT paths and selects the most consistent answer.Contrast with rStar: Self-consistency relies on random sampling of complete CoT paths, while rStar uses MCTS with a richer action space for more guided exploration.

Tree-search approaches (Yao et al., 2024; Hao et al., 2023; Zhang et al., 2024): Use tree search algorithms like MCTS to explore reasoning paths.

Contrast with rStar: Most existing tree-search methods use limited action spaces, while rStar's diverse actions provide more flexibility and effectiveness.

4. Answer Verification:
Majority voting (Wang et al., 2023): Selects the answer that appears most frequently across multiple generated solutions.Contrast with rStar: rStar's mutual consistency mechanism provides a more robust evaluation than simple majority voting, especially for SLMs.

Trained reward models (Wang et al., 2024b; Chen et al., 2024a): Train separate models to evaluate the quality of reasoning paths.

Contrast with rStar: rStar avoids the need for additional training data and potential overfitting issues associated with training separate reward models.

In essence, rStar distinguishes itself from prior work by combining the strengths of several approaches:
It leverages the power of tree search for exploring solution spaces.
It introduces a richer, human-inspired action space for more effective exploration.
It employs a novel mutual consistency mechanism for reliable evaluation without relying on self-assessment or external training data.
This unique combination allows rStar to significantly improve SLM reasoning, achieving performance comparable to or even surpassing fine-tuned models.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GVRrR_JXUAIAVTz.jpg

GVRr7IaXMAA47GM.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,551
Reputation
8,224
Daps
156,964





1/6
Super excited to announce our cool project, Trace, on optimizing general AI systems, using LLMs.😎

Trace is a new AutoDiff-like tool for training AI systems end-to-end with general feedback (like numerical rewards, natural language text, compiler errors). Trace

2/6
Training AI systems & agents with Trace can never be simpler. It is just like training neural networks!! With Trace, you can use a single optimizer to learn HPs, prompts, orchestration code, robot policy, etc, with just a few iterations of training.

3/6
Trace generalizes the back-propagation algorithm by capturing and propagating an AI system's < execution trace >. Trace is implemented as a PyTorch-like Python library. Users of Trace can optimize heterogenous parameters jointly in a non-differentiable workflow with feedback.

4/6
This feat is made possible by a new math formulation of iterative optimization, we call Optimization with Trace Oracle (OPTO). In the paper, we design an LLM-based OPTO optimizer, OptoPrime, that can solve problems originating from disparate domains.

5/6
This is work done by a wonderful collaboration with @Allen_A_N and @adith387 😀. Stay tuned. We will release the code soon!
,

6/6
The source code is out now :smile: Please see also our new blogpost to learn more about it. This is a preview of the library. Let me know if you have any feedback.

Discover Trace, a new framework for AI optimization from language models to robot control


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GQ8T0QdaoAAkZ2B.png

GQ8nM1FbwAIKABc.jpg
 
Top