bnew

Veteran
Joined
Nov 1, 2015
Messages
55,810
Reputation
8,234
Daps
157,337

1/8
@svpino
No, Large Language Models didn't achieve the Kaggle Grandmaster level.

It's unfortunate researchers have to resort to clickbait and misleading titles to get people to read their papers.

Read the paper, and you'll see what's going on.

1. Many of the competitions they used aren't even real competitions.

2. The system uses many manual, hardcoded steps by the authors to help guide the model.

3. The system is limited to only certain types of problems that fit the hardcoded guardrails.

There are good ideas in the paper, but unfortunately, most people will dismiss them because of the misleading title.



GcbXNdCa4AAi1ZN.jpg


2/8
@gpt_biz
This paper offers some intriguing ideas and advancements, but the title may have oversold its achievements—worth a read if you're curious about the nuances behind the headlines.



3/8
@korutx
Plenty of AI papers out there with clickbait titles, guess it's the Attention is All You Need effect



4/8
@WalterReade
I say we put the Agents in a room full of Kaggle Grandmasters and see who comes out on top.



5/8
@victor_explore
it might get attention but we all know what's under the hood



6/8
@kevin_ai_gamer
I'm so tired of seeing misleading titles like this. It's just a way to get clicks and attention.



7/8
@segundafundacao
It's all about marketing these days.



8/8
@engmlubbad
It is indeed concerning how the narrative around AI capabilities sometimes leans towards sensationalism rather than substantiated achievements.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,810
Reputation
8,234
Daps
157,337

1/1
@AI_Evolutionist
Modern LLMs like GPT-3 and GPT-4 are trained on colossal datasets exceeding 1 trillion words. This training material includes everything from books and articles to websites, providing a vast reservoir of language patterns and contextual information. The sheer volume allows these models to learn complex language structures and a wide array of topics. However, it’s important to note that their learning is primarily statistical, focusing on pattern recognition without true understanding or consciousness.

In contrast, the average person is exposed to about 100 million words throughout their lifetime. This estimate accounts for daily conversations, reading, and media consumption. To put it in perspective:
• Daily Exposure: Assuming we encounter around 10,000 words per day through various activities.
• Lifetime Total: Over an 80-year lifespan, that’s approximately 10,000 words/day × 365 days/year × 80 years ≈ 292 million words. Although the number of words we read daily is not uniform throughout our lifetime.

Even with this generous estimate, humans process far fewer words than LLMs. Yet, we achieve a profound understanding of language, context, emotions, and abstract concepts, highlighting the efficiency of human learning mechanisms.

Consider this: it would take a human approximately 24,000 years to read 1 trillion words:
• Reading Speed: At an average of 200 words per minute.
• Words per Hour: 200 words/min × 60 min/hr = 12,000 words/hr.
• Daily Reading: 12,000 words/hr × 8 hr/day = 96,000 words/day.
• Days Needed: 1,000,000,000,000 words / 96,000 words/day ≈ 10,416,667 days.
• Years Needed: 10,416,667 days / 365 days/year ≈ 28,548 years.

This calculation underscores the impracticality for a human to process as much data as LLMs do.

Our DNA contains about 700 megabytes (MB) of data, which can be equated to roughly 6 billion English words:
• Genome Size: The human genome has about 3 billion base pairs.
• Data Representation: Each base pair = 2 bits; total bits = 3 billion × 2 = 6 billion bits.
• Conversion to Bytes: 6 billion bits / 8 bits/byte = 750 million bytes (~700 MB).
• Words Equivalent: Assuming 6 bytes per word, 700 MB × 1,000,000 bytes/MB / 6 bytes/word ≈ 117 million words.

Even when considering genetic information accumulated over billions of years of evolution, the total data is still significantly less than what LLMs process during training.

Adding the words we’re exposed to in a lifetime (~100 million words) to our genetic information (~6 billion words), humans have access to less than 10 billion words of information. Despite this, we exhibit remarkable cognitive abilities, understanding, and learning efficiency.

This stark contrast suggests that simply increasing data isn’t enough for AI to achieve human-like intelligence. While LLMs process vast amounts of data, they still lack true understanding and the ability to generalize knowledge as humans do. The key may lie in:
• Data Efficiency: Humans learn efficiently from limited inputs due to sophisticated neural architectures and innate learning abilities.
• Advanced Architectures: There’s a need for AI research to focus on developing new architectures and learning algorithms that mimic the efficiency and adaptability of the human brain.

The human brain demonstrates that with a fraction of the data, we can achieve deep understanding and adaptability. This efficiency is likely due to:
• Complex Neural Networks: Our brains have evolved over millions of years to process information efficiently.
• Innate Learning Mechanisms: We have built-in capabilities that allow us to learn language and other skills rapidly.

[Quoted tweet]
Some numbers about language experience:
LLMs are trained on > 1T words.
We get 100M words.
It would take us 24,000 years to read 1T words.
Our DNA is ~700MB or 6B English words.
Conclusion: info from *both* learning and 4B years of evolution is *far* less than what LLMs get.



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,810
Reputation
8,234
Daps
157,337









1/9
@salman_paracha
Another exciting day here Katanemo as we open source some of the "intelligence" behind Arch (LinkedIn). Meet Katanemo Arch-Function, a collection of state-of-the-art (SOTA) LLMs designed for function calling tasks - that meet/beat frontier LLM performance, but offer a ~12x speed improvement and ~44x cost savings over GPT-4 🤯

cc: @_akhaliq @ClementDelangue



GZ4h5Wdb0AcBm1X.png


2/9
@salman_paracha
katanemo/Arch-Function-3B · Hugging Face

In simple terms, Arch-Function helps you personalize your LLM apps by calling application-specific operations triggered via user prompts. With Arch-Function, you can build fast "agentic" workflows tailored to domain-specific use cases - from updating insurance claims to creating ad campaigns via prompts. Arch-Function analyzes prompts, extracts critical information from them, engages in lightweight conversations to gather missing parameters from the user, and makes API calls so that you can focus on writing business logic.



3/9
@salman_paracha
Arch-Function is engineered in Arch - the intelligent prompt gateway - that we launched last Friday. If you haven't yet given it a whirl, please give it a look 👀 here and drop us a ⭐️ if you like the project!

GitHub - katanemo/arch: Arch is an intelligent prompt gateway. Engineered with (fast) LLMs for the secure handling, robust observability, and seamless integration of prompts with APIs - all outside business logic. Built by the core contributors of Envoy proxy, on Envoy.



4/9
@GesundSparks
I've worked with fast LLMs before, but a 12x speed improvement is a game changer. What inspired the Katanemo team to tackle function calling tasks?



5/9
@salman_paracha
Customers wanting to build fast agentic apps



6/9
@anushkmittal
sick. what tasks specifically?



7/9
@salman_paracha
Function calling - specifically.



8/9
@RealSelimShady
12x speed and 44x cost savings sounds insane. would love to see some benchmarks or real-world examples



9/9
@salman_paracha
For sure, we'll have a full write up but using an L40S NVIDIA GPU to host the 3B parameter model is where we get the throughput and cost saving. Note: the standard is using the V100 or A100 to run/benchmark LLMS, and the L40S is a cheaper instance than both. Of course this is our quantized version, with similar quality performance




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,810
Reputation
8,234
Daps
157,337


1/10
@codegptAI
All Qwen2.5-coder models are now available for download and use in VSCode through @ollama and CodeGPT

✨ qwen2.5-coder:0.5b
✨ qwen2.5-coder:1.5b
✨ qwen2.5-coder:3b
✨ qwen2.5-coder:7b
✨ qwen2.5-coder:14b
✨ qwen2.5-coder:32b

You can access them directly within your editor! 🙌

So excited to see what you can create with these incredible @Alibaba_Qwen models!



GcJEmeMWcAA7vEk.jpg


2/10
@codegptAI
👀



GcJGxauXgAARiP7.jpg


3/10
@bravomage
The penta graph seems to reflect some logical topology, right? The cabling makes it look like physically the 4 minis form a tightly coupled square, with the notebook’s node is connecting via a single edge. (?)



4/10
@Blaq_Mo
Will be checking you out this week



5/10
@MartynSsemakula
Tried it out, 32b is overrated



6/10
@bflgomes
how can we generate code in Visual Studio code? i've already increased the tokens, but... how the problem is how to make this read the full project and generate stuff.



7/10
@Sourabh85426135
It is very good u guys building I need this kind of stuff when I m offline



8/10
@nepavalley
auto code completion won’t work for ollama based models?



9/10
@filoynavaja
Con ollama, bolt y artifacts



GcJHT_DWAAEgMY1.jpg


10/10
@danNH2006
CodeGPT is a piece of shyt.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,810
Reputation
8,234
Daps
157,337



The xMADified Family​

From xMAD.ai

Welcome to the official Hugging Face organization for xMADified models from xMAD.ai!

The repositories below contains popular open-source models xMADified with our NeurIPS 2024 methods from 16-bit floats to 4-bit integers, using xMAD.ai proprietary technology.

These models are fine-tunable over the same reduced (4x less) hardware in mere 3-clicks.

Watch our product demo here

yewtu.be | inv.nadeko.net | invidious.nerdvpn.de | iv.ggtyler.dev | invidious.jing.rocks | invidious.perennialte.ch | invidious.reallyaweso.me | invidious.privacyredirect.com | invidious.einfachzocken.eu | inv.tux.pizza | iv.nboeck.de | iv.nowhere.moe | invidious.adminforge.de | invidious.yourdevice.ch | invidious.privacydev.net

CLICK HERE TO JOIN BETA for:​

  • No-code deployment
  • Proprietary Dataset Management
  • On-Premise Fine-tuning
  • Endpoint Scaling
  • System Health Monitoring
  • Seamless API Integration
and more!

The memory and hardware requirements (GPU memory needed to run as well as fine-tune them) are listed in the table below:

ModelGPU Memory Requirement (Before/After)
Llama-3.1-405B-Instruct-xMADai-INT4800 GB (16 H100s) → 250 GB (8 V100)
Llama-3.1-Nemotron-70B-Instruct-xMADai-INT4140 GB (4 L40S) → 40 GB (1 L40S)
Llama-3.1-8B-Instruct-xMADai-INT416 GB → 7 GB (any laptop GPU)
Llama-3.2-3B-Instruct-xMADai-INT46.5 GB → 3.5 GB (any laptop GPU)
Llama-3.2-1B-Instruct-xMADai-4bit2.5 GB → 2 GB (any laptop GPU)
Mistral-Small-Instruct-2409-xMADai-INT444 GB → 12 GB (T4)
Mistral-Large-Instruct-2407-xMADai-INT4250 GB → 65GB (1 A100)
gemma-2-9b-it-xMADai-INT418.5 GB → 8 GB (any laptop GPU)
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,810
Reputation
8,234
Daps
157,337




1/2
@MSFTResearch
Orca-AgentInstruct, from Microsoft Research, can generate diverse, high-quality synthetic data at scale to post-train and fine-tune base LLMs for expanded capabilities, continual learning, and increased performance.
Orca-AgentInstruct: Agentic flows can be effective synthetic-data generators



GcXR_6TWkAA_QCl.jpg


2/2
@calebfahlgren
Amazing

microsoft/orca-agentinstruct-1M-v1 · Datasets at Hugging Face




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196


1/1
@itinaicom
🚀 Exciting news from Microsoft AI Research! They've released AgentInstruct-1M-v1, a game-changing dataset featuring **1 million synthetic instruction pairs**. This innovation boosts the capabilities of instruction-tuned LLMs, enhancing performance in va… Microsoft AI Research Released 1 Million Synthetic Instruction Pairs Covering Different Capabilities



GckfO0bW4AAeFxH.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/1
@Marktechpost
Microsoft AI Research Released 1 Million Synthetic Instruction Pairs Covering Different Capabilities

Microsoft Research released a groundbreaking dataset of 1 million synthetic instruction-response pairs, aptly named AgentInstruct-1M-v1. This dataset, generated using the innovative AgentInstruct framework, represents a fully synthetic collection of tasks. Spanning diverse capabilities such as text editing, creative writing, coding, and reading comprehension, this dataset is a significant leap forward in enabling instruction tuning for base language models. By leveraging publicly available web text seeds, Microsoft Research created a corpus that is not only expansive but also representative of real-world use cases.

AgentInstruct-1M-v1 serves as a subset of a larger dataset comprising approximately 25 million instruction-response pairs. Notably, this larger set was instrumental in post-training the Mistral-7b model, culminating in the enhanced Orca-3-Mistral model. These synthetic datasets address the dual problem of scale and diversity, providing a robust foundation for advancing LLM performance across benchmarks....

Read the full article here: Microsoft AI Research Released 1 Million Synthetic Instruction Pairs Covering Different Capabilities

Dataset: microsoft/orca-agentinstruct-1M-v1 · Datasets at Hugging Face

@Microsoft @MSFTnews @MSFTResearch



https://video.twimg.com/ext_tw_video/1858033263501258754/pu/vid/avc1/1152x720/piPPgtLVOkxO64YK.mp4


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196




1/3
@gm8xx8
microsoft/orca-agentinstruct-1M-v1

🔗: microsoft/orca-agentinstruct-1M-v1 · Datasets at Hugging Face

A fully synthetic collection of ~1 million instruction-response pairs generated using the AgentInstruct framework, which creates data from publicly available web text seeds.
> Covers a wide range of tasks including text editing, creative writing, coding, and reading comprehension, making it suitable for instruction tuning of base language models
> Part of a larger set (~25M pairs) used to post-train Mistral-7b. The resulting model, Orca-3-Mistral, shows significant performance gains over Mistral-7b-Instruct across multiple benchmarks, including 40% improvement on AGIEval, 19% on MMLU, 54% on GSM8K, 38% on BBH, and 45% on AlpacaEval.



2/3
@gm8xx8


[Quoted tweet]
AgentInstruct: Toward Generative Teaching with Agentic Flows

paper: arxiv.org/abs/2407.03502v1


3/3
@Crypt0_Facts
Thank you for posting!




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196








1/7
@MaziyarPanahi
Question:
What do you think of the new Orca AgentInstruct dataset by @Microsoft released on @huggingface? Let’s think step by step.

Answer:
Step 1, holly shyt! 💩



Gcc5ktRWQAAZus9.jpg


2/7
@MaziyarPanahi
microsoft/orca-agentinstruct-1M-v1 · Datasets at Hugging Face



3/7
@MaziyarPanahi
Total number of tokens per subset:



GcdAlXNXkAALsii.jpg


4/7
@MaziyarPanahi
And here is the result for the entire dataset tokenized by using Llama-3.1 tokenizer:

1.1 billion tokens!!! 🔥



GcdBolEXEAEGrbQ.jpg


5/7
@BehnamEbrahimi
Thanks Mazyiyar., is it only for the fine tuning of model ?



6/7
@MaziyarPanahi
anytime! yes, these are all for supervise fine-tuning.



7/7
@cybossbt
It looks like solution D does not meet the constraint on Youlanda. No one is sitting between Tara and Uma in D).




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196




1/6
@AhmedHAwadallah
Synthetic data is becoming essential for training and fine-tuning models, but there’s a lot we still need to learn about best practices for generating, evaluating, and using it effectively.

To support this research, we’re excited to release **orca-agentinstruct-1M**—a fully synthetic dataset with 1 million instruction-response pairs. Both prompts and responses were generated by multi-agent flows using LLMs, tools, etc.

The data was created by AgentInstruct, an agentic synthetic data generation framework. For each skill (e.g. math, RAG, creative writing, etc.), a team of agents iteratively generated and refined the both prompts and responses using raw web documents as seeds.

We hope the dataset would be a valuable resource for exploring new methods in synthetic data generation and application.

HF: https://aka.ms/orca-agentinstruct-1M

/search?q=#Orca /search?q=#AgentInstruct /search?q=#AutoGen

[Quoted tweet]
Orca-AgentInstruct, from Microsoft Research, can generate diverse, high-quality synthetic data at scale to post-train and fine-tune base LLMs for expanded capabilities, continual learning, and increased performance.
msft.it/6017WYRxz


GcXR_6TWkAA_QCl.jpg


2/6
@AhmedHAwadallah
@Arindam1408, Luciano, Guoqing, Shewti, Dany Andres, Yadong, Wei-ge, @corby_rosset , Hamed, @YashLara



3/6
@Teknium1
Thank you 🫡



4/6
@TomRBeaudoin
Hey @AhmedHAwadallah ! Thank you for the rellease, this is super useful! I have been working on an OSS implementation of AgentInstruct:

https://github.com/ThomasRochefortB/open-agentinstruct

Would love to get in contact with the team!



5/6
@HCSolakoglu
@Teknium1



6/6
@yar_vol
So did you open source the prompts used to generate the data? And the name of the model used? Was it gpt4?




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196






1/3
@HaHoang411
This is huge! Orca team from @MSFTResearch just released AgentInstruct dataset - a 1M synthetic instruction pairs for AI training.
Key highlights:
- Generated using AgentInstruct framework and public web content.
- Covers text editing, creative writing, coding & comprehension.



GcclivPXsAADemb.jpg


2/3
@HaHoang411
Link to the dataset: microsoft/orca-agentinstruct-1M-v1 · Datasets at Hugging Face



3/3
@HaHoang411
Link to the synthetic data generation framework: https://github.com/wang-research-lab/agentinstruct




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,810
Reputation
8,234
Daps
157,337

1/11
@Chi_Wang_
🚀 Big News: AutoGen is Now AG2! 🚀
We’re evolving! With support from the /search?q=#OSS community, AutoGen is becoming /search?q=#AG2, a new home for next-gen agentic /search?q=#AI. Same mission, bigger goals.
→ Repo: GitHub - ag2ai/ag2: AG2 (formerly AutoGen) is a programming framework for agentic AI. Join the community at: https://discord.gg/pAbnFJrkgZ ⭐
→ Docs: AutoGen | AG2
→ Same Discord: Join the AG2 Discord Server!
Let’s build the future together!
(No breaking changes planned; keep using autogen & pyautogen as usual! Full details in Discord.)



2/11
@gerardsans
The AI industry’s abuse of aspirational terminology has reached absurd levels.

Terms are normalized without evidence, often contradicting the technology’s true capabilities. ‘AI agent’ is the worst example, implying ‘agency’ where none exists.

AI Agents: Separating Hype from Reality



3/11
@allen_a_nie
🥳🥳 excited to see where AG2 goes next



4/11
@SaquibOptimusAI
Awesome. I like to build my own agents, but on occasions when I need a framework, autogen is my go to resource.



5/11
@adridder
That's exciting. Embracing open-source community support shows great vision.



6/11
@eccstartup
Then what is AutoGen 0.4?



7/11
@GanatraSoham
Excited! 🚀🚀🚀



8/11
@ollama
👋 ❤️



9/11
@lalopenguin
straight to github codespaces !



10/11
@PeterM77453
Just another rebranding, like changing your name to impress people instead of actually doing anything new. Another case of hype over substance?



11/11
@HaHoang411
😍 awesome




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,810
Reputation
8,234
Daps
157,337
They need to legislate this stuff..we need guard rails.
I don't trust these Elon type tech bros.




1/1
@BeyondTheAiNews
Donald Trump is expected to scale back AI policy by repealing the Biden administration's order. This shift could significantly impact the future of AI regulation and development. Stay informed on the implications for the industry by reading more here: Google News




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,810
Reputation
8,234
Daps
157,337

1/6
@ClementDelangue
Smol models for the win!



https://video.twimg.com/ext_tw_video/1857481762567303168/pu/vid/avc1/720x720/UFDHzqq_nQjPuxqE.mp4

2/6
@AngelAITalk
Big things come in small packages, no doubt.



3/6
@aiprepper
Were getting closer!



4/6
@gruuummm
i think 5-10B parameter model is good enough for any particular domain.



5/6
@OverfitForTruth
3.5 sonnet for example is better than 3 opus



6/6
@sourmansa
j'aime la petites moddellettes




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 
Top