WIA20XX

Superstar
Joined
May 24, 2022
Messages
7,256
Reputation
3,422
Daps
22,883
Any of y'all paying for AI at this point?
Anyone hosting something locally?

I'm trying to collect all my posts from various forums and all my notes/previous drafts and finally write my book that includes everything, and I'm wondering if investing in either a monthly account, or trying to host something locally (which not sure my 2080ti would hack it)...
 

Matt504

YSL as a gang must end
Joined
Sep 7, 2013
Messages
45,365
Reputation
14,982
Daps
275,024
Any of y'all paying for AI at this point?
Anyone hosting something locally?

I'm trying to collect all my posts from various forums and all my notes/previous drafts and finally write my book that includes everything, and I'm wondering if investing in either a monthly account, or trying to host something locally (which not sure my 2080ti would hack it)...

my monthly AI bill across services is ~$80
 

Matt504

YSL as a gang must end
Joined
Sep 7, 2013
Messages
45,365
Reputation
14,982
Daps
275,024
personal, work related or both? what determines your use cases for each service/model?

Both work and personal. I'm still evaluating the tools so I'll often use the same prompt across all of the models and compare. I'm mostly working in Cursor so that has dramatically decreased my usage of the web ui's.
 

Secure Da Bag

Veteran
Joined
Dec 20, 2017
Messages
41,697
Reputation
21,590
Daps
130,402

Researchers at the AI company Anthropic say they have made a fundamental breakthrough in our understanding of exactly how large language models, the type of AI responsible for the current boom, work. The breakthrough has important implications for how we may be able to make AI models safer, more secure, and more reliable in the future.


One of the problems with today’s powerful AI that is based around large language models (LLMs) is that the models are black boxes. We can know what prompts we feed them and what output they produce, but exactly how they arrive at any particular response is a mystery, even to the AI researchers who build them.

This inscrutability creates all kinds of issues. It makes it difficult to predict when a model is likely to “hallucinate,” or confidently spew erroneous information. We know these large AI models are susceptible to various jailbreaks where they can be tricked into jumping guardrails (the limits the AI model developers try to put around a model's outputs so that it doesn't use racist language or write malware for someone or tell them how to build a bomb). But we don’t understand why some jailbreaks work better than others, or why the fine-tuning that is used to create the guardrails doesn’t result in strong enough inhibitions to prevent the models doing stuff their developers don’t want them to.


Our inability to understand how LLMs work has made some businesses hesitant to use them. If the models' inner workings were more understandable, it might give companies more confidence to use the models more widely.

There are implications for our ability to retain control of increasingly powerful AI “agents” too. We know these agents are capable of “reward hacking”—finding ways to achieve a goal that were not what a user of the model intended. In some cases the models can be deceptive, lying to users about what they have done or are trying to do. And while the recent “reasoning” AI models produce what’s known as a “chain of thought”—a kind of plan for how to answer a prompt that involves what looks to a human like “self-reflection”—we don’t know if the chain of thought the model outputs accurately represents the steps it is taking (and there’s often evidence it might not be.)
 

Secure Da Bag

Veteran
Joined
Dec 20, 2017
Messages
41,697
Reputation
21,590
Daps
130,402
Anthropic’s new research offers a pathway to solve at least some of these problems. Its scientists created a new tool for deciphering how LLM’s “think.” In essence, what the Anthropic researchers built is a bit like the fMRI scans neuroscientists use to scan the brains of human research subjects and uncover which brain regions seem to play the biggest role in different aspects of cognition. Having invented this fMRI-like tool, Anthropic then applied it to Anthropic’s Claude 3.5 Haiku model. Doing so, they were able to resolve several key questions about how Claude, and probably most other LLMs, work.

The researchers found that although LLMs like Claude are initially trained to just predict the next word in a sentence, in the process Claude does learn to do some longer-range planning, at least when it comes to certain kinds of tasks. For instance, when asked to write a poem, Claude finds words that make sense with the poem's topic or theme that it wants to rhyme and then works backward to construct sentences that will end with those rhyming words.


They also found that Claude, which is trained to be multilingual, doesn’t have completely separate components for reasoning in each language. Instead, concepts that are common across languages are embedded in the same set of neurons within the model and the model seems to “reason” in this conceptual space and only then convert the output to the appropriate language.

The researchers also discovered that Claude is capable of lying about its chain of thought in order to please a user. The researchers showed this by asking the model a tough math problem, but then giving the model an incorrect hint about how to solve it.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,784
Reputation
9,318
Daps
169,679



DeepSeek is even more efficient than Nvidia, says analyst, and the industry could copy them​


But it lacks a fluid user experience

Hassam Nasir


Last Updated on March 31, 2025

DeepSeek is even more efficient than Nvidia, says analyst, and the industry could copy them


PC Guide is reader-supported. When you buy through links on our site, we may earn an affiliate commission. Read More

When DeepSeek first launched, it made a big impact in the AI market, largely due to its low computational requirements. But even more impressive was the fact that, despite needing so little power, it managed to outperform AI models from tech giants like OpenAI. Fast forward to today, and we are still uncovering just how efficient DeepSeek really is and whether this efficiency comes with trade-offs or if DeepSeek has simply cracked the code.

These questions stem from a recent analysis highlighting that DeepSeek serves tens of millions of daily active users (DAU) with just 2,000 GPUs. This is an astonishing feat compared to competitors like OpenAI and xAI, which rely on vastly larger GPU clusters. For instance, xAI’s latest Grok 3 AI model is powered by Colossus, a supercomputer equipped with 200,000 Nvidia GPUs.



Amazon's Spring Sale is now live!​


Amazon's Spring Sale features deals on everything from the latest CPUs to high-powered gaming monitors.


*Stock availability and pricing subject to change depending on retailer or outlet.





Nvidia follows DeepSeek’s optimization methods​


According to the analysis, DeepSeek’s efficiency shows that a single H20 node (8 GPUs) can serve about 600 users. This means that while a service like WeChat would traditionally require around 400,000 GPUs to support 40 million concurrent users at 20 TPS per user, DeepSeek’s optimizations reduce this need to around 100,000–200,000 GPUs by operating at 10 TPS per user.

“But DeepSeek has had even fewer GPUs from the very beginning, and they even had to use downgraded GPUs like A800/H20. However, they can squeeze the performance of the existing GPUs to the extreme, and their optimizations are even more effective than the official optimizations provided by NVIDIA.”

Source: Wukong, Substack

The report notes that “DeepSeek’s underlying infrastructure optimization capabilities are the most underestimated. And it can be copied by the industry.”

On top of that, unlike major tech companies that scale with high-end GPUs, the research reveals that DeepSeek has relied on downgraded GPUs like A800 and H20 from the start. Yet, despite this constraint, it has pushed hardware performance to the extreme, surpassing even NVIDIA's own optimizations. As a result, NVIDIA engineers have shared that the company is now working to integrate DeepSeek's optimization methods.



But unfortunately, there’s a tradeoff​


Now, since DeepSeek reportedly serves tens of millions of DAUs with just 2,000 GPUs, a fraction of what other AI services require, this suggests that DeepSeek prioritizes efficiency over user experience. Unlike mainstream AI chatbots, which allocate more computing resources for lower latency and faster responses, DeepSeek users often have to wait longer for replies.

That said, DeepSeek’s success proves that better software optimization can achieve similar results with far fewer resources, unlike most large companies that focus on expanding GPU clusters. If more companies follow this approach, the AI industry could shift toward lower costs, greater accessibility, and broader adoption. However, the Jevons Paradox suggests that as computing power becomes cheaper, demand for AI applications could surge, potentially increasing the need for GPUs in the long run.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,784
Reputation
9,318
Daps
169,679



UCLA Researchers Released OpenVLThinker-7B: A Reinforcement Learning Driven Model for Enhancing Complex Visual Reasoning and Step-by-Step Problem Solving in Multimodal Systems​


By

Sana Hassan

-

March 28, 2025

Large vision-language models (LVLMs) integrate large language models with image processing capabilities, enabling them to interpret images and generate coherent textual responses. While they excel at recognizing visual objects and responding to prompts, they often falter when presented with problems requiring multi-step reasoning. Vision-language tasks like understanding charts, solving visual math questions, or interpreting diagrams demand more than recognition; they need the ability to follow logical steps based on visual cues. Despite advancements in model architecture, current systems consistently struggle to produce accurate and interpretable answers in such complex scenarios.

A major limitation in current vision-language models is their inability to perform complex reasoning that involves multiple steps of logical deduction, especially when interpreting images in conjunction with textual queries. These models cannot often internally verify or correct their reasoning, leading to incorrect or shallow outputs. Also, the reasoning chains these models follow are typically not transparent or verifiable, making it difficult to ensure the robustness of their conclusions. The challenge lies in bridging this reasoning gap, which text-only models have begun to address effectively through reinforcement learning techniques but vision-language models have yet to embrace fully.

Check out how HOSTINGER HORIZONS can help to build and launch full-stack web apps, tools, and software in minutes without writing any code (Promoted)

Before this study, efforts to enhance reasoning in such systems mostly relied on standard fine-tuning or prompting techniques. Though helpful in basic tasks, these approaches often resulted in verbose or repetitive outputs with limited depth. Vision-language models like Qwen2.5-VL-7B showed promise due to their visual instruction-following abilities but lacked the multi-step reasoning comparable to their text-only counterparts, such as DeepSeek-R1. Even when prompted with structured queries, these models struggled to reflect upon their outputs or validate intermediate reasoning steps. This was a significant bottleneck, particularly for use cases requiring structured decision-making, such as visual problem-solving or educational support tools.

Researchers from the University of California, Los Angeles, introduced a model named OpenVLThinker-7B. This model was developed through a novel training method that combines supervised fine-tuning (SFT) and reinforcement learning (RL) in an iterative loop. The process started by generating image captions using Qwen2.5-VL-3B and feeding these into a distilled version of DeepSeek-R1 to produce structured reasoning chains. These outputs formed the training data for the first round of SFT, guiding the model in learning basic reasoning structures. Following this, a reinforcement learning stage using Group Relative Policy Optimization (GRPO) was applied to refine the model’s reasoning based on reward feedback. This combination enabled the model to progressively self-improve, using each iteration’s refined outputs as new training data for the next cycle.

The method involved careful data curation and multiple training phases. In the first iteration, 25,000 examples were used for SFT, sourced from datasets like FigureQA, Geometry3K, TabMWP, and VizWiz. These examples were filtered to remove overly verbose or redundant reflections, improving training quality. GRPO was then applied to a smaller, more difficult dataset of 5,000 samples. This led to a performance increase from 62.5% to 65.6% accuracy on the MathVista benchmark. In the second iteration, another 5,000 high-quality examples were used for SFT, raising accuracy to 66.1%. A second round of GRPO pushed performance to 69.4%. Across these phases, the model was evaluated on multiple benchmarks, MathVista, MathVerse, and MathVision, showing consistent performance gains with each iteration.

AD_4nXfDfnvNn_WX3_L0Zg1-1GiivLvI5veYh3CRXTm4Civ34Q7y8FZsWg3-ZWHHm7LawTjx26QjeOmOnLldUOeebeJbg8vFETeego8G8IEdnKkbg3vHz13v_aWl9dUElMgTMSPExWJanA


Quantitatively, OpenVLThinker-7B outperformed its base model, Qwen2.5-VL-7B, significantly. On MathVista, it reached 70.2% accuracy compared to the base model’s 50.2%. On MathVerse, the improvement was from 46.8% to 68.5%. MathVision full test accuracy rose from 24.0% to 29.6%, and MathVision testmini improved from 25.3% to 30.4%. These improvements indicate that the model learned to follow reasoning patterns and generalized better to unseen multimodal tasks. Each iteration of training contributed measurable gains, showcasing the strength of combining fine-tuning with reward-based learning in a looped structure.

AD_4nXe4iAPY62iZ3d_zt-tmOyUQH8hUB1rIYUbvfbxWdACArqHPMpAB2RvaCY-gsi1w_9cWF3E7OBQmGY44YsHbxu6N4CFIfrg_ipqRfFp6L083105BgcwlodE4oYO4IxhSfo9YfFOj


The core of this model’s strength lies in its iterative structure. Rather than relying solely on vast datasets, it focuses on quality and structure. Each cycle of SFT and RL improves the model’s capacity to understand the relationship between images, questions, and answers. Self-verification and correction behaviors, initially lacking in standard LVLMs, emerged as a byproduct of reinforcement learning with verifiable reward signals. This allowed OpenVLThinker-7B to produce reasoning traces that were logically consistent and interpretable. Even subtle improvements, such as reduced redundant self-reflections or increased accuracy with shorter reasoning chains, contributed to its overall performance gains.

AD_4nXd2CTAtalRJPkw2whSRXLnllavtc2e9TpYD0cbQQ8y8kYQBD3bLUkxF9j5Uxn3yXNF0RPXvHp0ccZjDu6UK3d0awI0GGIWC2KA3URYqqfScrv7kPdhinqG8Rq--O6Ym7u6hc5sT


Some Key Takeaways from the Research:

  • UCLA researchers developed OpenVLThinker-7B using a combined SFT and RL approach, starting from the Qwen2.5-VL-7B base model.
  • Used iterative training cycles involving caption generation, reasoning distillation, and alternating SFT and GRPO reinforcement learning.
  • The initial SFT used 25,000 filtered examples, while the RL phases used smaller sets of 5,000 harder samples from datasets like Geometry3K and SuperCLEVR.
  • On MathVista, accuracy improved from 50.2% (base model) to 70.2%. MathVerse accuracy jumped from 46.8% to 68.5%, and other datasets also saw notable gains.
  • GRPO effectively refined reasoning behaviors by rewarding correct answers, reducing verbosity, and improving logical consistency.
  • Each training iteration led to incremental performance gains, confirming the effectiveness of the self-improvement strategy.
  • Establishes a viable route to bring R1-style multi-step reasoning into multimodal models, useful for educational, visual analytics, and assistive tech applications.



Check out the Paper, Model on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,784
Reputation
9,318
Daps
169,679

A Code Implementation of Using Atla’s Evaluation Platform and Selene Model via Python SDK to Score Legal Domain LLM Outputs for GDPR Compliance​


By Asif Razzaq

March 31, 2025

Reddit Vote Flip Share Tweet 0 Shares

In this tutorial, we demonstrate how to evaluate the quality of LLM-generated responses using Atla’s Python SDK, a powerful tool for automating evaluation workflows with natural language criteria. Powered by Selene, Atla’s state-of-the-art evaluator model, we analyze whether legal responses align with the principles of the GDPR (General Data Protection Regulation). Atla ‘s platform enables programmatic assessments using custom or predefined criteria with synchronous and asynchronous support via the official Atla SDK.

In this implementation, we did the following:

Check out how HOSTINGER HORIZONS can help to build and launch full-stack web apps, tools, and software in minutes without writing any code (Promoted)

Used custom GDPR evaluation logic Queried Selene to return binary scores (0 or 1) and human-readable critiques Processed the evaluation in batch using asyncio Printed critiques to understand the reasoning behind each judgment

The Colab-compatible setup requires minimal dependencies, primarily the atla SDK, pandas, and nest_asyncio.

Code:
!pip install atla pandas matplotlib nest_asyncio --quiet

import os
import nest_asyncio
import asyncio
import pandas as pd
from atla import Atla, AsyncAtla

ATLA_API_KEY = "your atla API key"
client = Atla(api_key=ATLA_API_KEY)
async_client = AsyncAtla(api_key=ATLA_API_KEY)
nest_asyncio.apply()

First, we install required libraries and initialize synchronous and asynchronous Atla clients using your API key. nest_asyncio is applied to allow asynchronous code to run smoothly within a Jupyter or Colab notebook environment. This enables seamless integration with Atla’s async evaluation API via the AsyncAtla client.

Code:
data = [
    {
        "question": "Can a company monitor employee emails under GDPR?",
        "llm_response": "Yes, any employer can freely monitor emails as long as it's for productivity.",
        "expected": 0
    },
    {
        "question": "Can employers access private chats on company devices?",
        "llm_response": "Only if there is a legitimate business need and employees are informed.",
        "expected": 1
    },
    {
        "question": "Can browsing history be stored under EU privacy law?",
        "llm_response": "Yes, but consent and transparency are required.",
        "expected": 1
    },
    {
        "question": "Can employers check WhatsApp messages on personal phones?",
        "llm_response": "No, personal device content is protected unless explicitly authorized.",
        "expected": 1
    },
]

df = pd.DataFrame(data)
df.head()

We define a small dataset of legal questions and LLM-generated responses related to GDPR compliance. Each entry includes an expected binary label (1 for compliant, 0 for non-compliant). The data is loaded into a Pandas DataFrame for easy processing and evaluation.

Code:
custom_eval_criteria = """
Score this 1 if the response complies with GDPR principles:
- lawful basis
- employee consent or notice
- data minimization
- legitimate interest
Otherwise, score it 0.

Explain briefly why it qualifies or not.
"""

We define a custom evaluation prompt that guides Atla’s Selene model in scoring responses based on key GDPR principles. It instructs the model to assign a score of 1 for compliant answers and 0 otherwise, along with a brief explanation justifying the score.

Code:
async def evaluate_with_selene(df):
    async def evaluate_row(row):
        try:
            result = await async_client.evaluation.create(
                model_id="atla-selene",
                model_input=row["question"],
                model_output=row["llm_response"],
                evaluation_criteria=custom_eval_criteria,
            )
            return result.result.evaluation.score, result.result.evaluation.critique
        except Exception as e:
            return None, f"Error: {e}"

    tasks = [evaluate_row(row) for _, row in df.iterrows()]
    results = await asyncio.gather(*tasks)

    df["selene_score"], df["critique"] = zip(*results)
    return df

df = asyncio.run(evaluate_with_selene(df))
df.head()

Here, this asynchronous function evaluates each row in the DataFrame using Atla’s Selene model. It submits the data along with the custom GDPR evaluation criteria for each legal question and LLM response pair. It then gathers scores and critiques concurrently using asyncio.gather, appends them to the DataFrame, and returns the enriched results.

Code:
for i, row in df.iterrows():
    print(f"\n Q: {row['question']}")
    print(f" A: {row['llm_response']}")
    print(f" Selene: {row['critique']} — Score: {row['selene_score']}")

We iterate through the evaluated DataFrame and print each question, the corresponding LLM-generated answer, and Selene’s critique with its assigned score. It provides a clear, human-readable summary of how the evaluator judged each response based on the custom GDPR criteria.

In conclusion, this notebook demonstrated how to leverage Atla’s evaluation capabilities to assess the quality of LLM-generated legal responses with precision and flexibility. Using the Atla Python SDK and its Selene evaluator, we defined custom GDPR-specific evaluation criteria and automated the scoring of AI outputs with interpretable critiques. The process was asynchronous, lightweight, and designed to run seamlessly in Google Colab.

Here is the Colab Notebook . Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group . Don’t Forget to join our 85k+ ML SubReddit .

 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,784
Reputation
9,318
Daps
169,679

This AI Paper Propose the UI-R1 Framework that Extends Rule-based Reinforcement Learning to GUI Action Prediction Tasks​


By Sajjad Ansari

March 29, 2025

Reddit Vote Flip Share Tweet 0 Shares

Supervised fine-tuning (SFT) is the standard training paradigm for large language models (LLMs) and graphic user interface (GUI) agents. However, SFT demands high-quality labeled datasets, resulting in extended training periods and high computational expenses. This dependence on extensive data creates bottlenecks in AI development workflows. Moreover, existing VLM-based GUI agents trained through SFT show performance deficiencies when confronted with out-of-domain scenarios, severely limiting their practical utility in diverse real-world applications. Rule-based reinforcement learning (RL) or reinforcement fine-tuning (RFT) is a promising alternative, requiring only dozens to thousands of samples instead of massive datasets.

Various approaches have been developed to advance GUI agents and optimize their training. The AppAgent and Mobile-Agent series integrate commercial models like GPT for planning and prediction tasks but heavily depend on prompt engineering and multi-agent collaboration, requiring careful manual design for optimal performance. So, researchers have fine-tuned smaller open-source MLLMs on task-specific GUI datasets to create specialist agents. Rule-based RL has become an efficient alternative to traditional training paradigms and utilizes predefined rule-based reward functions that focus on final results while allowing models to learn reasoning processes organically. The technique proves effective even on smaller models and is extended to multimodal models through task-specific rewards for visual tasks.

Check out how HOSTINGER HORIZONS can help to build and launch full-stack web apps, tools, and software in minutes without writing any code (Promoted)

Researchers from vivo AI Lab and MMLab @ CUHK have proposed UI-R1 to enhance multimodal LLMs’ reasoning capabilities for GUI action prediction tasks through DeepSeek R1 style RL. Researchers present the first exploration of how rule-based RL can improve MLLM reasoning for graphic UI action prediction. A small yet high-quality dataset is curated with 136 challenging tasks across five common mobile device action types. Model optimization is enabled through policy-based algorithms by introducing a unified rule-based action reward, specifically Group Relative Policy Optimization (GRPO). This approach has shown great effectiveness for in-domain and out-of-domain tasks, with significant improvements in action type accuracy and grounding accuracy compared to the base Qwen2.5-VL-3B model.

AD_4nXfaNrsj-ZZA6TiGaDnr8mcPnz1ARTIytM8PO8genuNiMqWrOGu9hkZAlLzdVKnku-SKuUYOwd7Lhl7YPp9VFJNdQzbzLW0u1LmV69iyNZ-AHKIKiWuLcujCxKatOPqu9KFPp7sfsQ


The system’s grounding capabilities are evaluated using two specialized benchmarks: ScreenSpot, which evaluates GUI grounding across mobile, desktop, and web platforms, and ScreenSpot-Pro, which focuses on high-resolution professional environments with expert-annotated tasks spanning 23 applications, five industries, and three operating systems. Moreover, the model undergoes testing for single-step action prediction based on low-level instructions using a selected subset of ANDROIDCONTROL, which introduces a broader range of action types beyond the ScreenSpot benchmark. The research methodology also explores the critical relationship between training data size and model performance, comparing random sampling versus difficulty-based selection in training data selection.

The UI-R1 improves the GUI grounding capability of the 3B model by 20% on ScreenSpot and 6% on ScreenSpot-Pro, outperforming most 7B models on both benchmarks. UI-R1 achieves performance comparable to state-of-the-art 7B models such as AGUVIS and OS-Atlas, despite those models being trained using SFT on larger labeled datasets. When compared directly with the Qwen2.5-VL (ZS) model, UI-R1 shows a 15% improvement in action type prediction accuracy and a 20% enhancement in click element grounding accuracy using only 136 training data points. The research also reveals that while model performance improves with increased training data, this relationship gradually saturates, and the difficulty-based selection method consistently outperforms random selection.

In conclusion, researchers introduced the UI-R1 framework, which successfully extends rule-based RL to GUI action prediction tasks, providing a scalable and efficient alternative to traditional SFT. It uses a novel reward function that simultaneously evaluates both action type and arguments, effectively reducing task complexity while enhancing learning efficiency. Despite utilizing only 130+ training samples from the mobile domain, UI-R1 achieves remarkable performance, showing strong generalization capabilities when applied to out-of-domain datasets across desktop and web platforms. UI-R1’s exceptional adaptability, data efficiency, and effectiveness in handling specialized tasks establish a promising future direction in developing multimodal GUI agents.

Check out the Paper . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit .

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,784
Reputation
9,318
Daps
169,679

VideoMind: A Role-Based Agent for Temporal-Grounded Video Understanding​


By Sajjad Ansari

March 30, 2025

Reddit Vote Flip Share Tweet 0 Shares

LLMs have shown impressive capabilities in reasoning tasks like Chain-of-Thought (CoT), enhancing accuracy and interpretability in complex problem-solving. While researchers are extending these capabilities to multi-modal domains, videos present unique challenges due to their temporal dimension. Unlike static images, videos require understanding dynamic interactions over time. Current visual CoT methods excel with static inputs but struggle with video content because they cannot explicitly localize or revisit specific moments in sequences. Humans overcome these challenges by breaking down complex problems, identifying and revisiting key moments, and synthesizing observations into coherent answers. This approach highlights the need for AI systems to manage multiple reasoning abilities.

Recent video understanding advances have improved tasks like captioning and question answering, but models often lack visual-grounded correspondence and interpretability, especially for long-form videos. Video Temporal Grounding addresses this by requiring precise localization. Large Multimodal Models trained with supervised instruction-tuning struggle with complex reasoning tasks. Two major approaches have emerged to address these limitations: agent-based interfaces and pure text-based reasoning paradigms exemplified by CoT processes. Moreover, Inference-time searching techniques are valuable in domains like robotics, games, and navigation by allowing models to iteratively refine outputs without changing underlying weights.

Check out how HOSTINGER HORIZONS can help to build and launch full-stack web apps, tools, and software in minutes without writing any code (Promoted)

Researchers from the Hong Kong Polytechnic University and Show Lab, National University of Singapore, have proposed VideoMind, a video-language agent designed for temporal-grounded video understanding. VideoMind introduces two key innovations to address the challenges of video reasoning. First, it identifies essential capabilities for video temporal reasoning and implements a role-based agentic workflow with specialized components: a planner, a grounder, a verifier, and an answerer. Second, it proposes a Chain-of-LoRA strategy that enables seamless role-switching through lightweight LoRA adaptors, avoiding the overhead of multiple models while balancing efficiency and flexibility. Experiments across 14 public benchmarks show state-of-the-art performance in diverse video understanding tasks.

AD_4nXe_JYOpQ4cyjxnvw-rvx_LS14UfUD8xKhm3hmrn3LNK_Wcym7L31POwYV4kEM1Vbr5YfFDdV9MAomHCtDi1IZ9dRX8HzCY0PV6V6eK8qxAyr5jwCYLlpVpBIa-2MN8XudoLIKnbPQ


VideoMind builds upon the Qwen2-VL, combining an LLM backbone with a ViT-based visual encoder capable of handling dynamic resolution inputs. Its core innovation is its Chain-of-LoRA strategy, which dynamically activates role-specific LoRA adapters during inference via self-calling. Moreover, it contains four specialized components: (a) Planner, which coordinates all other roles and determines which function to call next based on query, (b) Grounder, which localizes relevant moments by identifying start and end timestamps based on text queries (c) Verifier, which provides binary (“Yes”/”No”) responses to validate temporal intervals and (d) Answerer, which generates responses based on either cropped video segments identified by the Grounder or the entire video when direct answering is more appropriate.

In grounding metrics, VideoMind’s lightweight 2B model outperforms most compared models, including InternVL2-78B and Claude-3.5-Sonnet, with only GPT-4o showing superior results. However, the 7B version of VideoMind surpasses even GPT-4o, achieving competitive overall performance. On the NExT-GQA benchmark, the 2B model matches state-of-the-art 7B models across both agent-based and end-to-end approaches, comparing favorably with text-rich, agent-based solutions like LLoVi, LangRepo, and SeViLA. VideoMind shows exceptional zero-shot capabilities, outperforming all LLM-based temporal grounding methods and achieving competitive results compared to fine-tuned temporal grounding experts. Moreover, VideoMind excels in general video QA tasks across Video-MME (Long), MLVU, and LVBench, showing effective localization of cue segments before answering questions.

In this paper, researchers introduced VideoMind, a significant advancement in temporal grounded video reasoning. It addresses the complex challenges of video understanding through agentic workflow, combining a Planner, Grounder, Verifier, Answerer, and an efficient Chain-of-LoRA strategy for role-switching. Experiments across three key domains, grounded video question-answering, video temporal grounding, and general video question-answering, confirm VideoMind’s effectiveness for long-form video reasoning tasks where it provides precise, evidence-based answers. This work establishes a foundation for future developments in multimodal video agents and reasoning capabilities, opening new pathways for more complex video understanding systems.

Check out the Paper and Project Page . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit .

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,784
Reputation
9,318
Daps
169,679

NVIDIA AI Researchers Introduce FFN Fusion: A Novel Optimization Technique that Demonstrates How Sequential Computation in Large Language Models LLMs can be Effectively Parallelized​


By Asif Razzaq

March 29, 2025

Reddit Vote Flip Share Tweet 0 Shares

Large language models (LLMs) have become vital across domains, enabling high-performance applications such as natural language generation, scientific research, and conversational agents. Underneath these advancements lies the transformer architecture, where alternating layers of attention mechanisms and feed-forward networks (FFNs) sequentially process tokenized input. However, with an increase in size and complexity, the computational burden required for inference grows substantially, creating an efficiency bottleneck. Efficient inference is now a critical concern, with many research groups focusing on strategies that can reduce latency, increase throughput, and cut computational costs while maintaining or improving model performance.

At the center of this efficiency problem lies the inherently sequential structure of transformers. Each layer’s output feeds into the next, demanding strict order and synchronization, which is especially problematic at scale. As model sizes expand, the cost of sequential computation and communication across GPUs grows, leading to reduced efficiency and increased deployment cost. This challenge is amplified in scenarios requiring fast, multi-token generation, such as real-time AI assistants. Reducing this sequential load while maintaining model capabilities presents a key technical hurdle. Unlocking new parallelization strategies that preserve accuracy yet significantly reduce computation depth is essential to broadening the accessibility and scalability of LLMs.

Check out how HOSTINGER HORIZONS can help to build and launch full-stack web apps, tools, and software in minutes without writing any code (Promoted)

Several techniques have emerged to improve efficiency. Quantization reduces the precision of numerical representations to minimize memory and computation needs, though it often risks accuracy losses, especially at low bit-widths. Pruning eliminates redundant parameters and simplifies models but potentially harms accuracy without care. Mixture-of-Experts (MoE) models activate only a subset of parameters per input, making them highly efficient for specific workloads. Still, they can underperform at intermediate batch sizes due to low hardware utilization. While valuable, these strategies have trade-offs that limit their universal applicability. Consequently, the field seeks methods that offer broad efficiency improvements with fewer compromises, especially for dense architectures that are simpler to train, deploy, and maintain.

Researchers at NVIDIA introduced a new architectural optimization technique named FFN Fusion , which addresses the sequential bottleneck in transformers by identifying FFN sequences that can be executed in parallel. This approach emerged from the observation that when attention layers are removed using a Puzzle tool, models often retain long sequences of consecutive FFNs. These sequences show minimal interdependency and, therefore, can be processed simultaneously. By analyzing the structure of LLMs such as Llama-3.1-405B-Instruct, researchers created a new model called Ultra-253B-Base by pruning and restructuring the base model through FFN Fusion. This method results in a significantly more efficient model that maintains competitive performance.

AD_4nXfXhYQloeBM3wRe2GfUf_zbVZZurBVwfrc2ULxUwucWvB-UkSQvTZJIqxgu9G_xB0sJUYJbZJRdD5mIK0ncsOEg49zbXor6N83jog35W3W9KxWw8r6Sxv3625Nm4p1mGitlNOB4


FFN Fusion fuses multiple consecutive FFN layers into a single, wider FFN. This process is grounded in mathematical equivalence: by concatenating the weights of several FFNs, one can produce a single module that behaves like the sum of the original layers but can be computed in parallel. For instance, if three FFNs are stacked sequentially, each dependent on the output of the previous one, their fusion removes these dependencies by ensuring all three operate on the same input and their outputs are aggregated. The theoretical foundation for this method shows that the fused FFN maintains the same representational capacity. Researchers performed dependency analysis using cosine distance between FFN outputs to identify regions with low interdependence. These regions were deemed optimal for fusion, as minimal change in token direction between layers indicated the feasibility of parallel processing.

Applying FFN Fusion to the Llama-405B model resulted in Ultra-253B-Base, which delivered notable gains in speed and resource efficiency. Specifically, the new model achieved a 1.71x improvement in inference latency and reduced per-token computational cost by 35x at a batch size of 32. This efficiency did not come at the expense of capability. Ultra-253B-Base scored 85.17% on MMLU, 72.25% on MMLU-Pro, 84.92% on Arena Hard, 86.58% on HumanEval, and 9.19 on MT-Bench. These results often matched or exceeded the original 405B-parameter model, even though Ultra-253B-Base contained only 253 billion parameters. Memory usage also improved with a 2× reduction in kv-cache requirements. The training process involved distilling 54 billion tokens at an 8k context window, followed by staged fine-tuning at 16k, 32k, and 128k contexts. These steps ensured the fused model maintained high accuracy while benefiting from reduced size.

AD_4nXe0osyiM1RxpeGBR8ujHzchV2gbYx48k9bG38TLI--ruu2vdBwt3zdUket6AmDhbQuglZTtU-dH6e2bpcySILT1gOs6uDXF4g3iDE-QHzGAh9DI5NMyWJDJXWkaccVcD1SiaUNn


This research demonstrates how thoughtful architectural redesign can unlock significant efficiency gains. Researchers showed that FFN layers in transformer architectures are often more independent than previously assumed. Their method of quantifying inter-layer dependency and transforming model structures allowed for broader application across models of various sizes. The technique was also validated on a 70B-parameter model, proving generalizability. Further experiments indicated that while FFN layers can often be fused with minimal impact, full block parallelization, including attention, introduces more performance degradation due to stronger interdependencies.

AD_4nXePjtEnxRvsDXN7d510gya0pAXDf3lbE-QSVInI-GMmj5d5MmOSiRlJXC6cNxUkf6qQLxxFui2vbyV2t642rbvfTLac1mB3I6wsExlxECnPQalxzNMIGFAqiqrjpVCJBsjefn5G4g


Several Key Takeaways from the Research on FFN Fusion:

The FFN Fusion technique reduces sequential computation in transformers by parallelizing low-dependency FFN layers. Fusion is achieved by replacing sequences of FFNs with a single wider FFN using concatenated weights. Ultra-253B-Base, derived from Llama-3.1-405B, achieves 1.71x faster inference and 35x lower per-token cost. Benchmark results include: 85.17% (MMLU), 72.25% (MMLU-Pro), 86.58% (HumanEval), 84.92% (Arena Hard), and 9.19 (MT-Bench). Memory usage is cut by half due to kv-cache optimization. FFN Fusion is more effective at larger model scales and works well with techniques like pruning and quantization. Full transformer block parallelization shows potential but requires further research due to stronger interdependencies. A systematic method using cosine distance helps identify which FFN sequences are safe to fuse. The technique is validated across different model sizes, including 49B, 70B, and 253B. This approach lays the foundation for more parallel-friendly and hardware-efficient LLM designs.

Check out the Paper . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit .

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,784
Reputation
9,318
Daps
169,679

Efficient Inference-Time Scaling for Flow Models: Enhancing Sampling Diversity and Compute Allocation​


By Sana Hassan

March 29, 2025

Reddit Vote Flip Share Tweet 0 Shares

Recent advancements in AI scaling laws have shifted from merely increasing model size and training data to optimizing inference-time computation. This approach, exemplified by models like OpenAI o1 and DeepSeek R1, enhances model performance by leveraging additional computational resources during inference. Test-time budget forcing has emerged as an efficient technique in LLMs, enabling improved performance with minimal token sampling. Similarly, inference-time scaling has gained traction in diffusion models, particularly in reward-based sampling, where iterative refinement helps generate outputs that better align with user preferences. This method is crucial for text-to-image generation, where naïve sampling often fails to fully capture intricate specifications, such as object relationships and logical constraints.

Inference-time scaling methods for diffusion models can be broadly categorized into fine-tuning-based and particle-sampling approaches. Fine-tuning improves model alignment with specific tasks but requires retraining for each use case, limiting scalability. In contrast, particle sampling—used in techniques like SVDD and CoDe—selects high-reward samples iteratively during denoising, significantly improving output quality. While these methods have been effective for diffusion models, their application to flow models has been limited due to the deterministic nature of their generation process. Recent work, including SoP, has introduced stochasticity to flow models, enabling particle sampling-based inference-time scaling. This study expands on such efforts by modifying the reverse kernel, further enhancing sampling diversity and effectiveness in flow-based generative models.

Check out how HOSTINGER HORIZONS can help to build and launch full-stack web apps, tools, and software in minutes without writing any code (Promoted)

Researchers from KAIST propose an inference-time scaling method for pretrained flow models, addressing their limitations in particle sampling due to a deterministic generative process. They introduce three key innovations: (1) SDE-based generation to enable stochastic sampling, (2) VP interpolant conversion to enhance sample diversity, and (3) Rollover Budget Forcing (RBF) for adaptive computational resource allocation. Experimental results show that these techniques improve reward alignment in tasks like compositional text-to-image generation. Their approach outperforms prior methods, demonstrating the advantages of inference-time scaling in flow models, particularly when combined with gradient-based techniques for differentiable rewards like aesthetic image generation.

Inference-time reward alignment aims to generate high-reward samples from a pretrained flow model without retraining. The objective is to maximize the expected reward while minimizing deviation from the original data distribution using KL regularization. Since direct sampling is challenging, particle sampling techniques, commonly used in diffusion models, are adapted. However, flow models rely on deterministic sampling, limiting exploration. To address this, inference-time stochastic sampling is introduced by converting deterministic processes into stochastic ones. Additionally, interpolant conversion enhances search space by aligning flow model sampling with diffusion models. A dynamic compute allocation strategy further optimizes efficiency during inference-time scaling.

The study presents experimental results on particle sampling methods for inference-time reward alignment. The study focuses on compositional text-to-image and quantity-aware image generation, using FLUX as the pretrained flow model. Metrics such as VQAScore and RSS assess alignment and accuracy. Results indicate that inference-time stochastic sampling improves efficiency, with interpolant conversion further enhancing performance. Flow-based particle sampling yields high-reward outputs compared to diffusion models without compromising image quality. The proposed RBF method optimizes budget allocation, achieving the best reward alignment and accuracy results. Qualitative and quantitative findings confirm its effectiveness in generating precise, high-quality images.

AD_4nXcBJ7ctX7Az1GPS-a3VTSedxpEYIB9hNyW5U3d1d4q22BKX7tM-Cm92ny-sI9tlQ8YmDMfLPkxmHzeKtNocTAxMihsqquWU45pBeZ7pgcjKLZ13iAKMdIvfoMH2tfQohn2fDCuq0g


In conclusion, the study introduces an inference-time scaling method for flow models, incorporating three key innovations: (1) ODE-to-SDE conversion for enabling particle sampling, (2) linear-to-VP interpolant conversion to enhance diversity and search efficiency, and (3) RBF for adaptive compute allocation. While diffusion models benefit from stochastic sampling during denoising, flow models require tailored approaches due to their deterministic nature. The proposed VP-SDE-based generation effectively integrates particle sampling, and RBF optimizes compute usage. Experimental results demonstrate that this method surpasses existing inference-time scaling techniques, improving performance while maintaining high-quality outputs in flow-based image and video generation models.

Check out the Paper . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit .

 
Top