The A.I Megathread (LLM , GPT , Development)

bnew · Nov 26, 2024

1/11
@AnthropicAI
With styles, you can now customize how Claude responds.

Select from the new preset options: Concise, Explanatory, or Formal.

https://video.twimg.com/ext_tw_video/1861472528524337152/pu/vid/avc1/1920x1080/zvAWsXTu5K5R0A6N.mp4

2/11
@AnthropicAI
Want Claude to more closely match how you communicate?

Upload writing samples and Claude can automatically generate custom styles, just for you.

https://video.twimg.com/ext_tw_video/1861472551139934208/pu/vid/avc1/1920x1080/HHCgSB3ATOqAUssN.mp4

3/11
@testingcatalog
Finally someone released something to really improve writing

4/11
@KinggZoom
When will Claude get internet access and better rate limits?

5/11
@thegenioo
Anthropic takes it again OpenAI losing

6/11
@omarsar0
Curious how this works together with Projects? I find myself often using Projects for styled responses.

7/11
@novocrypto
Can there be a “prefer not to choose style” option that defaults to Claude as it is? That would be dope 🫡

8/11
@JohnForrester
rather than a menu selection, I think it is better for you to build a default personalization of user styles. Asking users for writing samples and remembering. btw, what is going on with your latency via the API? huge spike in Sonnet latency last 30 days.

9/11
@AGI_Odyssey
Fabulous!

10/11
@koltregaskes
Thank you again guys!

11/11
@alikayadibi11
Good job

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Nov 26, 2024

1/5
@rohanpaul_ai
Adding rule-based guidance doubles RAG's performance in document retrieval and answer generation.

Basically, RAG gets a proper manual on how to use its knowledge.

It's like giving RAG a GPS instead of letting it wander around blindly.

Original Problem:

Current Retrieval-Augmented Generation (RAG) frameworks face two major limitations: retrievers can't guarantee fetching the most relevant information, and LLMs lack specific guidance on using retrieved content effectively.

-----

Solution in this Paper:

→ Introduces RuleRAG, which uses symbolic rules to guide both retrieval and generation processes.

→ Guide retrievers to fetch logically related documents following rule directions

→ Help generators uniformly generate answers attributed by the same set of rules

→ Use queries and rules combined as supervised fine-tuning data to improve rule-based instruction following

→ RuleRAG-ICL: Uses in-context learning with rule guidance during retrieval and inference

→ RuleRAG-FT: Fine-tunes both retrievers and generators using rule-guided fine-tuning

→ Created five rule-aware QA benchmarks (three temporal, two static) to evaluate performance

-----

Key Insights:

→ Rules can explicitly guide both document retrieval and answer generation

→ Combining rules with queries improves retrieval quality significantly

→ Rule-guided fine-tuning enhances both retrieval and generation performance

→ The method scales well with increasing numbers of retrieved documents

-----

Results:

→ RuleRAG-ICL improved retrieval quality by +89.2% in Recall@10 scores

→ Generation accuracy increased by +103.1% in exact match scores

→ RuleRAG-FT achieved even better performance improvements across all benchmarks

→ Method showed strong generalization ability for untrained rules

2/5
@rohanpaul_ai
Paper Title: "RuleRAG: Rule-guided retrieval-augmented generation with language models for question answering"

Generated below podcast on this paper with Google's Illuminate.

https://video.twimg.com/ext_tw_video/1861501768653512707/pu/vid/avc1/1080x1080/0WhCQR7B3zGenfUy.mp4

3/5
@rohanpaul_ai
Guided by the rule r related to the query, the proposed RuleRAG first retrieves supportive documents that are logically related to the query and then attributes the correct answer,

4/5
@rohanpaul_ai

5/5
@rohanpaul_ai

[2410.22353] RuleRAG: Rule-guided retrieval-augmented generation with language models for question answering

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Nov 26, 2024

1/3
@rohanpaul_ai
Efficient bandit-based approach reduces human annotation costs in LLM training

SEA (Sample-Efficient Alignment) uses Thompson sampling to align LLMs with minimal human feedback

Original Problem:

Aligning LLMs with human preferences requires massive amounts of human feedback data, making it expensive and time-consuming. Current methods need extensive human annotations for effective alignment.

-----

Solution in this Paper:

→ They frame LLM alignment as a contextual dueling bandits problem, where the model learns from pairwise comparisons of responses

This formulation naturally requires two key properties for sample-efficient alignment:

→ Online interaction - allowing the agent to act with latest learned policy and immediately improve from experience

→ Active exploration - strategically selecting actions that lead to maximal policy improvement

→ Introduce SEA (Sample-Efficient Alignment), implementing Thompson sampling with epistemic reward modeling

→ SEA maintains uncertainty-aware reward models and uses policy-guided search for efficient exploration

→ The system works in both online user feedback and crowdsourcing scenarios

-----

Key Insights:

→ Online interaction allows immediate policy improvement from latest experiences

→ Active exploration strategically selects actions for maximal learning

→ Thompson sampling naturally balances exploration vs exploitation

→ Mixed preference learning combines different alignment approaches

-----

Results:

→ Tested across three model scales: 1B, 2.8B, and 6.9B parameters

→ Evaluated with three preference learning algorithms: DPO, IPO, and SLiC

→ Achieved higher win rates compared to baseline approaches

→ Significantly improved sample efficiency over recent active exploration methods

2/3
@rohanpaul_ai
Paper Title: "Sample-Efficient Alignment for LLMs"

Generated below podcast on this paper with Google's Illuminate.

https://video.twimg.com/ext_tw_video/1861496636813516800/pu/vid/avc1/1080x1080/1eUeBGlemWFkbEbV.mp4

3/3
@rohanpaul_ai
[2411.01493] Sample-Efficient Alignment for LLMs

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Nov 26, 2024

1/1
@rohanpaul_ai
A group appears to have leaked access to Sora, OpenAI’s video generator.

The group created a HuggingFace space that seemingly connects with Sora API, which isn’t yet publicly available.

They built a frontend using their early access tokens, enabling users to create Sora-generated videos.

That HuggingFace space can generate 10-second video clips at 1080p resolution, depending on compute availability.

[Quoted tweet]
OpenAI Sora has leaked

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Nov 26, 2024

1/1
@ccamp___
I am seeing two approaches to social agents: basing them off of real people, and letting them create their own selves.

Re: agents as real people, approaches have been both to train a custom model or give information in prompt context, some examples:

- “Reid GPT” went viral with Reid Hoffman training a custom GPT off of all his public works
- The @joon_s_pk Generative Agents Simulation paper predicts people's decisions by interviewing them and injecting interviews into prompt context to get the LLM to simulate the interviewee’s choices (would love to see someone train a model on actual vs. predicted results)

For agents as real people, personally I’d love to be able to make a digital twin of myself that takes all of my texts, emails, Zoom calls, everything possible and trains a model on my entire digital footprint. Something kind of like @withdelphi but automatic.

Re: agents as their own people, the approaches all tackle the question of, what makes an agent feel like a real human? Examples:
- Give agents their own experience of being-in-the-world like real humans, with access to news, embodiment, other people/agents, etc. to create their own memories and attitudes
- Improve reasoning, mostly done by foundation models but also things like @NousResearch Forge Reasoning
- Improve their memory systems, like @MemGPT, @mem0ai
- Give them their own goals to pursue, like our real friends and colleagues

The point of all these being, when we talk to a friend, they’re not just trapped away in a box for us to interact with at our leisure, they have their own relationships they learn from, goals, context of the world that makes them interesting.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Nov 26, 2024

1/1
@rohanpaul_ai

Thomson Reuters integrates o1-mini model, into its CoCounsel, achieves 1400% user growth in legal AI

Combines multiple AI models to catch critical legal details other systems miss and teaches AI to think like different types of lawyers

Thomson Reuters integrates OpenAI's o1-mini model into CoCounsel legal assistant, marking first enterprise customization of the model. Implementation showcases strategic use of multiple AI models for specific legal tasks, resulting in 1,400% user growth.

→ Thomson Reuters deploys specialized AI models from OpenAI, Google, and Anthropic, each optimized for specific legal workflows. OpenAI handles generative tasks, Google's Gemini processes long-context documents, while Anthropic's Claude manages sensitive tax and compliance cases. The o1-mini model shows superior performance in detecting nuanced legal privileges compared to GPT-4.

→ Company expands beyond AI usage through Safe Sign Technologies acquisition for proprietary LLM development. Infrastructure support comes from AWS Sagemaker HyperPod.

This implementation signals a shift from general-purpose AI to precision-engineered specialized models working together. For legal industry, where precision is crucial, this approach could set new standards in AI deployment and error reduction.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Nov 26, 2024

1/3
@rohanpaul_ai
Speculative decoding just landed in llama.cpp's server

→ Speedups in token generation ranging from 1.25x to 5.73x depending on hardware and model configurations

→ The new speculative decoding feature allows using a smaller model to draft tokens which are then verified by the main larger model to predict tokens speculatively.

→ Draft model predictions are verified by the main model before being accepted

→ Resulting in great speedups in token generation while maintaining the quality of the larger model's outputs

Key Considerations

→ Performance gains vary based on hardware architecture and memory bandwidth

→ Apple Silicon shows less dramatic improvements due to shared memory architecture

2/3
@rohanpaul_ai
server : add speculative decoding support by ggerganov · Pull Request #10455 · ggerganov/llama.cpp

3/3
@victor_explore
llamas are getting faster at predicting the future

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Nov 26, 2024

1/3
@rohanpaul_ai
Finally a model that understands social networks form through both birds-of-a-feather and popularity

NMM ((Non-Euclidean Mixture Model) combines spherical and hyperbolic spaces to better model how social networks actually form

Original Problem:

Social networks form links through two key factors - homophily (similar nodes connecting) and social influence (popular nodes attracting connections). Current models either focus on just one factor or use simple Euclidean spaces that can't capture the complex network structures.

-----

Solution in this Paper:

→ Introduces NMM (Non-Euclidean Mixture Model) that represents nodes in both spherical space (for homophily) and hyperbolic space (for social influence)

→ Uses spherical space to model cyclic connections between similar nodes and hyperbolic space to model hierarchical influence-based connections

→ Combines these two representations through a novel space unification loss that aligns the two geometric spaces

→ Enhances NMM with graph neural networks (NMM-GNN) using a variational autoencoder framework with non-Euclidean encoders and decoders

-----

Key Insights:

→ Social network links form through both similarity and influence - need both spherical and hyperbolic spaces to model this effectively

→ Homophily creates cycles best captured in spherical space while influence creates hierarchies best captured in hyperbolic space

→ Unifying different geometric spaces through projection and alignment is key for coherent node representations

-----

Results:

→ Significantly outperforms state-of-the-art baselines on social network generation tasks

→ Successfully tested on large-scale networks like LiveJournal (4.8M nodes) and Friendster (65.6M nodes)

→ More parameter efficient compared to existing models like RaRE

2/3
@rohanpaul_ai
Paper Title: "Non-Euclidean Mixture Model for Social Network Embedding"

Generated below podcast on this paper with Google's Illuminate.

https://video.twimg.com/ext_tw_video/1861171591931142146/pu/vid/avc1/1080x1080/-WOGpwh9a824cRFM.mp4

3/3
@rohanpaul_ai
[2411.04876v1] Non-Euclidean Mixture Model for Social Network Embedding

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Nov 26, 2024

1/4
@rohanpaul_ai
New smart weight-compression technique has arrived to reduce your GPU VRAM requirement.

Squeeze more parameters into your GPU by compressing the wasteful parts of floating-point numbers

NeuZip compresses neural networks by exploiting the low entropy nature of floating-point exponents

Original Problem:

Training and deploying large neural networks is severely constrained by GPU memory limitations. While model sizes have grown 100x since 2017, GPU memory has only increased 2.5x (from 32GB to 80GB), creating a critical bottleneck for scaling neural networks.

-----

Solution in this Paper:

→ Introduces NeuZip - a novel compression scheme that exploits low entropy in neural network parameters' exponent bits

→ Compresses exponent bits losslessly using asymmetric numeral system (ANS) while keeping sign and mantissa bits unchanged

→ Implements layer-by-layer compression/decompression during training to avoid creating large buffers

→ Compatible with activation checkpointing for additional memory savings

→ For inference, introduces lossy compression by truncating mantissa bits while controlling relative weight changes

-----

Key Insights:

→ Neural network parameters tend to concentrate around zero, making exponent bits highly compressible

→ Exponent bits carry only ~3 bits of information despite using 8 bits of storage

→ Layer-wise compression enables training without ever fully decompressing the entire network

→ Inference tasks can tolerate more aggressive lossy compression compared to training

-----

Results:

→ Reduces Llama-3 8B training memory from 31GB to 16GB with no performance loss

→ Enables training 13B parameter models on consumer GPUs (20GB memory)

→ For inference, achieves >50% memory reduction while maintaining near-lossless performance

→ Outperforms QLoRA and other quantization methods in memory-performance trade-off

2/4
@rohanpaul_ai
Paper Title: "NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks"

Generated below podcast on this paper with Google's Illuminate.

https://video.twimg.com/ext_tw_video/1861174229959614464/pu/vid/avc1/1080x1080/TgngZoqe92dCG6tb.mp4

3/4
@rohanpaul_ai

4/4
@rohanpaul_ai

[2410.20650] NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Nov 26, 2024

1/5
@rohanpaul_ai
A single model with multiple experts handles error correction for different input types

NeKo, proposed in this paper, uses specialized experts to fix recognition errors across speech, text and vision tasks

Original Problem

:

Building a general-purpose post-recognition error corrector that can handle multiple domains (speech, text, vision) while maintaining high performance across all tasks remains challenging. Current solutions require separate models for each domain, leading to parameter inefficiency.

-----

Solution in this Paper

:

→ NeKo introduces a task-oriented Mixture-of-Experts (MoE) architecture where experts specialize in specific tasks (speech-to-text, language-to-text, vision-to-text)

→ During training, input tokens are routed to both their task-specific expert and the top expert selected by a gating network

→ During inference, tokens are routed purely based on router probabilities without task knowledge, enabling zero-shot generalization

→ The model replaces standard feedforward blocks with MoE layers, allowing efficient parameter sharing across tasks

-----

Key Insights from this Paper

:

→ Task-specific expert assignment during training enables better specialization while maintaining cross-task knowledge sharing

→ MoE architecture provides better parameter efficiency compared to having separate models for each task

→ Zero-shot generalization is possible by relying on learned routing patterns during inference

-----

Results

:

→ 5.0% relative Word Error Rate reduction on Open ASR Leaderboard

→ 15.5% to 27.6% relative WER reduction compared to GPT-3.5 and Claude-Opus on zero-shot Hyporadise benchmark

→ State-of-the-art results in ASR correction while maintaining competitive performance on grammar and OCR correction tasks

2/5
@rohanpaul_ai
Paper Title: "NeKo: Toward Post Recognition Generative Correction Large Language Models with Task-Oriented Experts"

Generated below podcast on this paper with Google's Illuminate.

https://video.twimg.com/ext_tw_video/1861176047334760449/pu/vid/avc1/1080x1080/1V5wmFt-Uzj8Ueyf.mp4

3/5
@rohanpaul_ai

NeKo's architecture

The model replaces standard feedforward blocks with MoE layers. During training, each expert is mapped to a specific task, with input tokens being routed to both their task-specific expert and the top expert selected by the gating network.

During inference, tokens are routed purely based on router probabilities without task knowledge.

4/5
@rohanpaul_ai
[2411.05945] NeKo: Toward Post Recognition Generative Correction Large Language Models with Task-Oriented Experts

5/5
@MiaAI_Builder
I appreciate the idea of using specialized experts to correct errors across different input types, a promising approach to building a general-purpose post-recognition error correction model.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Nov 26, 2024

1/6
@rohanpaul_ai
New tests reveal the true effective context limits of leading LLMs.

→ GPT-4 performs best at small contexts, Gemini 1.5 Pro excels at longer contexts

→ Claude 3.5 Sonnet leads in mid-range contexts (2.5k to 32k tokens)

Original Problem:

Current benchmarks for evaluating LLMs' long context capabilities are inadequate - they either saturate at perfect scores, test on limited context lengths, or lack granular insights into specific model behaviors.

-----

Solution in this Paper:

→ Introduced a series of increasingly complex retrieval tasks using synthetic UUID key-value pairs to test 17 leading LLMs

→ Created novel "needle threading" tasks where models must follow chains of linked information through contexts up to 900k tokens

→ Developed tasks like Single Needle (basic retrieval), Multiple Needles (concurrent retrieval), and Threading (following information chains)

→ Introduced Multi-Threading to test if models can track multiple information threads simultaneously

-----

Key Insights:

→ Most models' effective context limit is shorter than their advertised context length

→ Models perform better with forward-moving threads compared to backward threads

→ Many models are remarkably "thread-safe" - can follow multiple threads without performance loss

→ Different tokenizers count tokens very differently - direct comparisons can be misleading

→ Performance generally decreases towards the middle of the context window

-----

Results:

→ GPT-4 performs best at small contexts, Gemini 1.5 Pro excels at longer contexts

→ Claude 3.5 Sonnet leads in mid-range contexts (2.5k to 32k tokens)

→ Closed-source models consistently outperform open-source alternatives

→ Most models show significant accuracy drop beyond their effective context limit

2/6
@rohanpaul_ai
Paper Title: "Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?"

Generated below podcast on this paper with Google's Illuminate.

https://video.twimg.com/ext_tw_video/1861195970811445248/pu/vid/avc1/1080x1080/tHjcPg5xbjO0bzhA.mp4

3/6
@rohanpaul_ai

Evaluation methods used

→ Used 17 leading LLMs including GPT-4, Gemini 1.5, Claude 3, and open-source models

→ Tested on contexts ranging from 1k to 630k tokens

→ Evaluated using exact matching with expected answers

→ Introduced a task-specific effective context limit metric to measure real performance capabilities

4/6
@rohanpaul_ai
[2411.05000] Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?

5/6
@tOSUFever
everything every pixel is all in long context.
multithreaded (better) = yes.
gpt4 best at short context = (this is probably wrong) i think gpt4 long context was your grandaddy worldmodel distilling data synthetically initially at great cost.

also we need some new words.

long context serves two purposes the first is prompting with a ton of data like a book or repo. (this is still one turn one prompt), human expects ai output

the second is different; the behavior of a long conversation thread. this isn’t one turn this is thousands of unique turns. this is when you SEE behavior. human INPUT to context, ai (output) is INPUT to context.

and human INPUT (as far as the context window is concerned for ‘the next output’) is weighted differently.

for example llm providers may not pass through your exact prompt to “prevent hallucinations” or “jailbreaking”. maybe characters like _* or (…) whatever. however the model will happily insert them for you into context…

and if it *likes* you (literally) the next outputs only get better

6/6
@medoraai
Anthropic's new enterprise version has a 500K context window. Interested to see how they do. Google is the king here however.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Nov 26, 2024

1/3
@rohanpaul_ai
This paper makes complex Multi-objective reinforcement learning (MORL) policies understandable by clustering them based on both behavior and objectives

When AI gives you too many options, this clustering trick saves the day

Original Problem:

Multi-objective reinforcement learning (MORL) generates multiple policies with different trade-offs, but these solution sets are too large and complex for humans to analyze effectively. Decision makers struggle to understand relationships between policy behaviors and their objective outcomes.

-----

Solution in this Paper:

→ Introduces a novel clustering approach that considers both objective space (expected returns) and behavior space (policy actions)

→ Uses Highlights algorithm to capture 5 key states that represent each policy's behavior

→ Applies PAN (Pareto-Set Analysis) clustering to find well-defined clusters in both spaces simultaneously

→ Employs bi-objective evolutionary algorithm to optimize clustering quality across both spaces

-----

Key Insights:

→ First research to tackle MORL solution set explainability

→ Different policies with similar trade-offs can exhibit vastly different behaviors

→ Combining objective and behavior analysis reveals deeper policy insights

→ Makes MORL more practical for real-world applications

-----

Results:

→ Outperformed traditional k-medoids clustering in MO-Highway and MO-Lunar-lander environments

→ Showed comparable performance in MO-Reacher and MO-Minecart scenarios

→ Successfully demonstrated practical application through highway environment case study

2/3
@rohanpaul_ai
Paper Title: "Navigating Trade-offs: Policy Summarization for Multi-Objective Reinforcement Learning"

Generated below podcast on this paper with Google's Illuminate.

https://video.twimg.com/ext_tw_video/1861197491238211584/pu/vid/avc1/1080x1080/56yXAj4Toyxny-Ic.mp4

3/3
@rohanpaul_ai
[2411.04784v1] Navigating Trade-offs: Policy Summarization for Multi-Objective Reinforcement Learning

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Nov 26, 2024

1/6
@rohanpaul_ai
Nice survey paper presents a unified taxonomy bridging personalized text generation and downstream applications

Current research on LLM personalization is fragmented into two disconnected areas: direct personalized text generation and downstream task personalization.

This split creates a knowledge gap, limiting the development of comprehensive personalization solutions.

This Paper:

→ Establishes three personalization granularity levels: user-level (individual), persona-level (groups), and global preference alignment

→ Proposes systematic frameworks for personalization techniques including RAG, prompt engineering, fine-tuning, embedding learning, and RLHF

→ Creates evaluation taxonomies distinguishing between direct (text quality) and indirect (task performance) assessment methods

-----

Key Insights:

→ Personalization can be achieved at different granularities, with trade-offs between precision and data requirements

→ User-level personalization offers finest control but needs substantial user data

→ Persona-level grouping helps handle cold-start problems with new users

→ Privacy concerns and bias management are critical challenges

→ Multi-modal personalization remains an open challenge

2/6
@rohanpaul_ai
Paper Title: "Personalization of Large Language Models: A Survey"

Generated below podcast on this paper with Google's Illuminate.

https://video.twimg.com/ext_tw_video/1861136074451623938/pu/vid/avc1/1080x1080/ZcMPQJ7Cxpxrf4Pj.mp4

3/6
@rohanpaul_ai
[2411.00027] Personalization of Large Language Models: A Survey

4/6
@rohanpaul_ai

5/6
@Zhehao_Zhang123
Thanks Rohan for sharing our work!

6/6
@Adina_Coder
Amazing share

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Nov 26, 2024

1/6
@rohanpaul_ai
Open-source alternative to GPT-4V for building reliable GUI automation agents

OS-ATLAS, a foundational GUI action model enables open-source GUI agents to match commercial VLM performance through cross-platform data synthesis.

Release the largest open-source cross-platform GUI grounding corpus to date, which contains over 13 million GUI elements.

Original Problem:

Existing GUI agents heavily depend on commercial Vision-Language Models (VLMs) like GPT-4V. Open-source VLMs perform poorly in GUI grounding and Out-Of-Distribution scenarios, making them less preferred for real-world applications.

-----

Solution in this Paper:

→ Created OS-ATLAS, a foundational GUI action model with three operating modes: Grounding, Action, and Agent

→ Built first multi-platform GUI data synthesis toolkit covering Windows, Linux, MacOS, Android, web

→ Created largest open-source cross-platform GUI corpus (13M+ elements from 2.3M screenshots)

→ Implemented unified action space during training to resolve naming conflicts across platforms

→ Standardized Basic Actions (click, type, scroll) and Custom Actions for extensibility

-----

Key Insights:

→ Pre-training on comprehensive cross-platform GUI data significantly improves grounding accuracy

→ Unified action space prevents performance degradation from naming conflicts

→ Instruction grounding data, while valuable, isn't critical - referring expression data is sufficient

→ Web-only training doesn't generalize well to other platforms

-----

Results:

→ Achieves 82.47% average grounding accuracy without planner

→ Reaches 85.14% accuracy with GPT-4 as planner

→ Outperforms previous SOTA across mobile, desktop and web platforms

→ Shows 14.63% success rate on OSWorld benchmark (compared to 9.21% baseline)

2/6
@rohanpaul_ai
Paper Title: "OS-ATLAS: A Foundation Action Model for Generalist GUI Agents"

Generated below podcast on this paper with Google's Illuminate.

https://video.twimg.com/ext_tw_video/1861138549690769411/pu/vid/avc1/720x720/oRPlbm8e6lhO-ppC.mp4

3/6
@rohanpaul_ai
The first multi-platform GUI grounding data synthesis toolkit that works across Windows, Linux, MacOS, Android and web platforms

4/6
@rohanpaul_ai

[2410.23218] OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

5/6
@rohanpaul_ai

6/6
@tOSUFever

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Nov 26, 2024

1/7
@rohanpaul_ai
O1 doesn't cheat on math tests - it actually knows how to solve them

A/B testing reveals o1's true mathematical reasoning capabilities beyond memorization

Original Problem:

OpenAI's Orion-1 (o1) model claims superior reasoning capabilities, but skeptics suggest its performance might stem from memorizing solutions rather than true reasoning abilities.

-----

Solution in this Paper:

→ Used A/B testing comparing o1's performance on two datasets: IMO problems (easily accessible) and CNT problems (less accessible but similar difficulty)

→ Implemented a 7-point grading system: 1 point for correct numerical answer, 2 points for intuitive approach, 4 points for detailed reasoning

→ Categorized problems into "search" type (finding specific solutions) and "solve" type (equations/optimization)

-----

Key Insights:

→ O1 shows strong intuitive reasoning and pattern discovery capabilities

→ Performs exceptionally well on "search" type problems (~70% accuracy)

→ Struggles with rigorous proof steps and "solve" type problems (~21% accuracy)

→ Often uses trial-and-error approach instead of formal proofs

-----

Results:

→ No significant performance difference between IMO (51.4%) and CNT (48%) datasets

→ T-statistics close to 0, suggesting o1 relies on reasoning rather than memorization

→ Outperforms GPT-4o's benchmark of 39.97% on both datasets

2/7
@rohanpaul_ai
Paper Title: "OpenAI-o1 AB Testing: Does the o1 model really do good reasoning in math problem solving?"

Generated below podcast on this paper with Google's Illuminate.

https://video.twimg.com/ext_tw_video/1861168366934990851/pu/vid/avc1/1080x1080/D1uoa5hsOde-Fyr9.mp4

3/7
@dikksonPau
It's not Orion-1...

4/7
@rohanpaul_ai
yes the paper refers to o1-preview and o1-mini variants

5/7
@rohanpaul_ai
This paper is actually referring to 01-preview model (not the final 01)

6/7
@rohanpaul_ai
[2411.06198] OpenAI-o1 AB Testing: Does the o1 model really do good reasoning in math problem solving?

7/7
@TShirtnJeans2
Wait, hold on. There are folks who have access to OpenAI's GPT-5/Orion-1 model?

I thought that was scheduled to come out until next year?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

The A.I Megathread (LLM , GPT , Development)

More options

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran