bnew

Veteran
Joined
Nov 1, 2015
Messages
56,193
Reputation
8,249
Daps
157,873






1/6
@rohanpaul_ai
First Open code LLM to reveal entire training pipeline and reproducible datasets

🎯 Original Problem:

Code LLMs lack transparency in training data and protocols, limiting research community's ability to establish strong baselines and gain deeper insights.

-----

πŸ› οΈ Solution in this Paper:

β†’ Introduces OpenCoder, a fully transparent code LLM with complete training data, processing pipeline, and protocols

β†’ Implements sophisticated data processing pipeline called RefineCode with 960B tokens across 607 programming languages

β†’ Uses aggressive file-level deduplication and language-specific filtering rules

β†’ Employs two-stage instruction tuning with annealing phase using high-quality synthetic data

-----

πŸ’‘ Key Insights:

β†’ File-level deduplication outperforms repository-level approach for maintaining data diversity

β†’ GitHub star-based filtering can reduce data diversity and affect distribution

β†’ High-quality data in annealing phase is more crucial than quantity

β†’ Two-stage instruction tuning improves both theoretical and practical coding tasks

-----

πŸ“Š Results:

β†’ OpenCoder-8B achieves 83.5% pass@1 on HumanEval benchmark

β†’ Surpasses all previous fully open models at 6B+ parameter scale

β†’ Demonstrates superior training efficiency compared to The Stack v2



GdQsa9zaoAA-gGQ.png


2/6
@rohanpaul_ai
Paper Title: "OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models"

Generated below podcast on this paper with Google's Illuminate.



https://video.twimg.com/ext_tw_video/1861161457960001538/pu/vid/avc1/1080x1080/mc7X5OmyGwoCxj21.mp4

3/6
@rohanpaul_ai
The illustration of our pretraining data processing workflow.



GdQtCqqbEAAcSeb.jpg


4/6
@rohanpaul_ai
πŸš€ OpenCoder surpasses all previous fully open models and other open-access models at the 6B+ parameter scale. The 8B version achieves 83.5% pass@1 on HumanEval benchmark, making it competitive with leading proprietary models.



GdQsyloaoAAnGsi.jpg


5/6
@rohanpaul_ai
Their instruction data synthesis workflow



GdQv0Tya8AAC_ma.jpg


6/6
@rohanpaul_ai
[2411.04905] OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,193
Reputation
8,249
Daps
157,873





1/8
@rohanpaul_ai
New benchmark exposes the true reasoning capabilities of LLMs using dynamic puzzle generation

K&K puzzles, proposed in this paper, reveal how LLMs balance memorization and reasoning in logical problem-solving

πŸ€– Original Problem:

LLMs show puzzling behavior in reasoning tasks - excellent performance on complex problems but basic mistakes on simple ones. This raises questions about whether they truly reason or just memorize training data.

-----

πŸ”§ Solution in this Paper:

β†’ Introduces Knights and Knaves (K&K) puzzles as a dynamic benchmark for testing logical reasoning

β†’ Develops Local Inconsistency-based Memorization Score (LiMem) that measures model performance on original vs perturbed puzzles

β†’ Creates two key modules:

- Abstract Module: Generates puzzles with specified complexity

- Natural Language Module: Converts abstract puzzles to natural text

β†’ Implements systematic perturbation tests at both mathematical and linguistic levels

-----

πŸ’‘ Key Insights:

β†’ LLMs can simultaneously use memorization and genuine reasoning

β†’ Fine-tuning improves generalization even as memorization increases

β†’ Models can develop reasoning skills even when trained only on question-answer pairs

β†’ More complex puzzles show higher memorization scores

β†’ Language-level perturbations affect models less than mathematical structure changes

-----

πŸ“Š Results:

β†’ Only advanced LLMs achieve >70% accuracy on 2-person puzzles

β†’ Performance drops to 11% for 8-person puzzles

β†’ GPT4o-mini reaches near 100% training accuracy on 3/5-person puzzles

β†’ LiMem scores ~50% on 8-person puzzles indicate heavy memorization

β†’ Models show 80% memorization under role-flipping perturbations



GdQzloRaoAUxfE7.png


2/8
@rohanpaul_ai
Paper Title: "On Memorization of Large Language Models in Logical Reasoning"

Generated below podcast on this paper with Google's Illuminate.



https://video.twimg.com/ext_tw_video/1861169337081700352/pu/vid/avc1/1080x1080/vJUxlnzGezPNqx37.mp4

3/8
@rohanpaul_ai




GdQzvb6aMAAsvfa.jpg


4/8
@rohanpaul_ai
πŸ“š [2410.23123] On Memorization of Large Language Models in Logical Reasoning



5/8
@rohanpaul_ai
🧩 The Knights and Knaves benchmark generates logical puzzles where some characters always tell truth (knights) and others always lie (knaves).

It has two key modules:

β†’ Abstract Module: Generates puzzles with specified number of people, tree width, and depth. Can perturb puzzles by changing statements or leaf nodes

β†’ Natural Language Module: Converts abstract puzzles to natural language, can perturb by changing names, role terms, statement order



GdQ0KARaoAE-XNc.jpg


6/8
@ICoffeeDaemon
Nicht schlecht



7/8
@fabmilo
Just compression/memorization. There is no reasoning in refining probabilities of tokens. We need a complete different system to achieve reasoning and tons of compute power.



8/8
@MiaAI_Builder
LLMs have been puzzling us with their behavior in reasoning tasks, indeed. Hope this benchmark helps us understand them better




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,193
Reputation
8,249
Daps
157,873



1/3
@rohanpaul_ai
This paper makes complex Multi-objective reinforcement learning (MORL) policies understandable by clustering them based on both behavior and objectives

When AI gives you too many options, this clustering trick saves the day

🎯 Original Problem:

Multi-objective reinforcement learning (MORL) generates multiple policies with different trade-offs, but these solution sets are too large and complex for humans to analyze effectively. Decision makers struggle to understand relationships between policy behaviors and their objective outcomes.

-----

πŸ› οΈ Solution in this Paper:

β†’ Introduces a novel clustering approach that considers both objective space (expected returns) and behavior space (policy actions)

β†’ Uses Highlights algorithm to capture 5 key states that represent each policy's behavior

β†’ Applies PAN (Pareto-Set Analysis) clustering to find well-defined clusters in both spaces simultaneously

β†’ Employs bi-objective evolutionary algorithm to optimize clustering quality across both spaces

-----

πŸ’‘ Key Insights:

β†’ First research to tackle MORL solution set explainability

β†’ Different policies with similar trade-offs can exhibit vastly different behaviors

β†’ Combining objective and behavior analysis reveals deeper policy insights

β†’ Makes MORL more practical for real-world applications

-----

πŸ“Š Results:

β†’ Outperformed traditional k-medoids clustering in MO-Highway and MO-Lunar-lander environments

β†’ Showed comparable performance in MO-Reacher and MO-Minecart scenarios

β†’ Successfully demonstrated practical application through highway environment case study



GdRNNQaaoAAiEE1.jpg


2/3
@rohanpaul_ai
Paper Title: "Navigating Trade-offs: Policy Summarization for Multi-Objective Reinforcement Learning"

Generated below podcast on this paper with Google's Illuminate.



https://video.twimg.com/ext_tw_video/1861197491238211584/pu/vid/avc1/1080x1080/56yXAj4Toyxny-Ic.mp4

3/3
@rohanpaul_ai
[2411.04784v1] Navigating Trade-offs: Policy Summarization for Multi-Objective Reinforcement Learning




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,193
Reputation
8,249
Daps
157,873




1/11
@rohanpaul_ai
Transform AI from task-completers to thought-provokers

🎯 Current AI primarily act as obedient assistants focused on task completion, stemming from 19th-century statistical models. This limits their potential to enhance human critical thinking and creates a binary perception of AI as either compliant servants or rebellious threats.

-----

πŸ”§ Ideas discussed in this Paper:

β†’ Transform AI from task-completing assistants into provocateurs that challenge users' thinking

β†’ Implement critical thinking tools from educational frameworks like Bloom's taxonomy and Toulmin model into AI systems

β†’ Design AI to critique work, surface biases, present counter-arguments, and question assumptions

β†’ Create interfaces beyond chat that function as "tools of thought" similar to maps, grids, and algebraic notation



GdLtAVCasAAWC3S.png


2/11
@rohanpaul_ai
Paper Title: "AI Should Challenge, Not Obey"

Generated below podcast on this paper with Google's Illuminate.



https://video.twimg.com/ext_tw_video/1860810218474721280/pu/vid/avc1/1080x1080/mJq53adIVFd2os5n.mp4

3/11
@rohanpaul_ai
[2411.02263] AI Should Challenge, Not Obey



4/11
@jmjjohnson
Sounds like what @arunbahl is building at @AloeInc - a β€œpersonal thought partner – a synthetic mind that can reason, purpose-built for human thinking.”



5/11
@rohanpaul_ai
Awesome!!



6/11
@EricFddrsn
That’s great - the people pleasing nature of the LLMs today is one of the main things that separates them of being good thought partners



7/11
@BergelEduardo
🎯🎯🎯 Yes! "AI should Challenge, Not Obey." One for Eternity..



8/11
@xone_4
Exactly.



9/11
@HAF_tech
I love this idea! Let's move beyond 19th-century statistical models and create AI that enhances human critical thinking



10/11
@LaneGrooms
Brainstorming: This has been my most successful use case since the first time I ran an LLM locally, I.e. had access to the system prompt. Glad it’s getting more attention.



11/11
@Sipera007
Have you tried randomising all the words in a pdf / source then asking an llm to re order it so it reads as intended? Interesting to see when it breaks vs when it works. For example less than 200 words easy, more than 600 not. Also simply removing all superfluous words like and?




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 
Top