1/6
@rohanpaul_ai
First Open code LLM to reveal entire training pipeline and reproducible datasets
Original Problem:
Code LLMs lack transparency in training data and protocols, limiting research community's ability to establish strong baselines and gain deeper insights.
-----
Solution in this Paper:
β Introduces OpenCoder, a fully transparent code LLM with complete training data, processing pipeline, and protocols
β Implements sophisticated data processing pipeline called RefineCode with 960B tokens across 607 programming languages
β Uses aggressive file-level deduplication and language-specific filtering rules
β Employs two-stage instruction tuning with annealing phase using high-quality synthetic data
-----
Key Insights:
β File-level deduplication outperforms repository-level approach for maintaining data diversity
β GitHub star-based filtering can reduce data diversity and affect distribution
β High-quality data in annealing phase is more crucial than quantity
β Two-stage instruction tuning improves both theoretical and practical coding tasks
-----
Results:
β OpenCoder-8B achieves 83.5% pass@1 on HumanEval benchmark
β Surpasses all previous fully open models at 6B+ parameter scale
β Demonstrates superior training efficiency compared to The Stack v2
2/6
@rohanpaul_ai
Paper Title: "OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models"
Generated below podcast on this paper with Google's Illuminate.
https://video.twimg.com/ext_tw_video/1861161457960001538/pu/vid/avc1/1080x1080/mc7X5OmyGwoCxj21.mp4
3/6
@rohanpaul_ai
The illustration of our pretraining data processing workflow.
4/6
@rohanpaul_ai
OpenCoder surpasses all previous fully open models and other open-access models at the 6B+ parameter scale. The 8B version achieves 83.5% pass@1 on HumanEval benchmark, making it competitive with leading proprietary models.
5/6
@rohanpaul_ai
Their instruction data synthesis workflow
6/6
@rohanpaul_ai
[2411.04905] OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models
To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
@rohanpaul_ai
First Open code LLM to reveal entire training pipeline and reproducible datasets
Original Problem:
Code LLMs lack transparency in training data and protocols, limiting research community's ability to establish strong baselines and gain deeper insights.
-----
Solution in this Paper:
β Introduces OpenCoder, a fully transparent code LLM with complete training data, processing pipeline, and protocols
β Implements sophisticated data processing pipeline called RefineCode with 960B tokens across 607 programming languages
β Uses aggressive file-level deduplication and language-specific filtering rules
β Employs two-stage instruction tuning with annealing phase using high-quality synthetic data
-----
Key Insights:
β File-level deduplication outperforms repository-level approach for maintaining data diversity
β GitHub star-based filtering can reduce data diversity and affect distribution
β High-quality data in annealing phase is more crucial than quantity
β Two-stage instruction tuning improves both theoretical and practical coding tasks
-----
Results:
β OpenCoder-8B achieves 83.5% pass@1 on HumanEval benchmark
β Surpasses all previous fully open models at 6B+ parameter scale
β Demonstrates superior training efficiency compared to The Stack v2
2/6
@rohanpaul_ai
Paper Title: "OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models"
Generated below podcast on this paper with Google's Illuminate.
https://video.twimg.com/ext_tw_video/1861161457960001538/pu/vid/avc1/1080x1080/mc7X5OmyGwoCxj21.mp4
3/6
@rohanpaul_ai
The illustration of our pretraining data processing workflow.
4/6
@rohanpaul_ai
OpenCoder surpasses all previous fully open models and other open-access models at the 6B+ parameter scale. The 8B version achieves 83.5% pass@1 on HumanEval benchmark, making it competitive with leading proprietary models.
5/6
@rohanpaul_ai
Their instruction data synthesis workflow
6/6
@rohanpaul_ai
[2411.04905] OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models
To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196