bnew

Veteran
Joined
Nov 1, 2015
Messages
62,420
Reputation
9,478
Daps
171,076


1/10
@_philschmid
Gemini 2.5 Flash is here! We excited launch our first hybrid reasoning Gemini model. In Flash 2.5 developer can turn thinking off.

TL;DR:

🧠 Controllable "Thinking" with thinking budget with up to 24k token
🌌 1 Million multimodal input context for text, image, video, audio, and pdf
🛠️ Function calling, structured output, google search & code execution.
🏦 $0.15 1M input tokens; $0.6 or $3.5 (thinking on) per million output tokens (thinking tokens are billed as output tokens)
💡 Knowledge cut of January 2025
🚀 Rate limits - Free 10 RPM 500 req/day
🏅Outperforms 2.0 Flash on every benchmark

Try it ⬇️



GoxA4AEWAAAPoDs.jpg


2/10
@_philschmid
Sign in - Google Accounts



3/10
@aniruddhadak
That is wonderful 👍



4/10
@bennetkrause
Always love your iteration speed, knowledge cutoff, and pricing. Keep it up!



5/10
@CosmicRob87
Is the 24k the max permissible token count? I’m asking because on auto, for one problem it used north of 41k



6/10
@pichi_
Great!!!



7/10
@boogeyManKDot
These 1M ctx will soon look common. You better be working on a greater moat



8/10
@AndaICP
*Tilts head, bamboo shoot dangling from mouth* Interesting - but does the "thinking budget" account for spontaneous curiosity sparks that defy token limits?



9/10
@b_kalisetty
Any suggestions on how to consistently see the thoughts in output ?



10/10
@TDev168
Is it able to edit images?




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196





















1/20
@CodeByPoonam
Google just dropped Gemini 2.5 Flash, and people are going crazy over it.

SPOILER: Claude is now falling behind.

13 wild examples so far (Don't miss the 5th one)



Go4cqVEbUAAZr8V.jpg


2/20
@CodeByPoonam
1. Tron-style game

[Quoted tweet]
Gemini 2.5 Flash Thinking 24k

Prompt: "Create Design a visually striking Tron-style game in a single HTML file, where AI-controlled light cycles compete in fast-paced, strategic battles against each other"


https://video.twimg.com/amplify_video/1912953001712447488/vid/avc1/1920x1080/-IoE5vICEJ3TqYS_.mp4

3/20
@CodeByPoonam
2. Gemini 2.5 Flash vs ChatGPT o3

[Quoted tweet]
I tested Gemini 2.5 Flash vs ChatGPT o3

Which one did better?


https://video.twimg.com/amplify_video/1913198527746129920/vid/avc1/1280x720/InEUUE-tUG1QljHE.mp4

4/20
@CodeByPoonam
3. Galton Board Test

[Quoted tweet]
Gemini 2.5 Flash demolishes my Galton Board test, I could not get 4omini, 4o mini high, or 03 to produce this. I found that Gemini 2.5 Flash understands my intents almost instantly, code produced is tight and neat. The prompt is a merging of various steps. It took me 5 steps to achieve this in Gemini 2.5 Flash, I gave up on OpenAI models after about half an hour. My iterations are obviously not exact. But people can test with this one prompt for more objective comparison.

Please try this prompt on your end to confirm:
--------------------------------------------------
Create a self-contained HTML file for a Galton board simulation using client-side JavaScript and a 2D physics engine (like Matter.js, included via CDN). The simulation should be rendered on an HTML5 canvas and meet the following criteria: 1. **Single File:** All necessary HTML, CSS, and JavaScript code must be within this single `.html` file. 2. **Canvas Size:** The overall simulation area (canvas) should be reasonably sized to fit on a standard screen without requiring extensive scrolling or zooming (e.g., around 500x700 pixels). 3. **Physics:** Utilize a 2D rigid body physics engine for realistic ball-peg and ball-wall interactions. 4. **Obstacles (Pegs):** Create static, circular pegs arranged in full-width horizontal rows extending across the usable width of the board (not just a triangle). The pegs should be small enough and spaced appropriately for balls to navigate and bounce between them. 5. **Containment:** * Include static, sufficiently thick side walls and a ground at the bottom to contain the balls within the board. * Implement *physical* static dividers between the collection bins at the bottom. These dividers must be thick enough to prevent balls from passing through them, ensuring accurate accumulation in each bin. 6. **Ball Dropping:** Balls should be dropped from a controlled, narrow area near the horizontal center at the top of the board to ensure they enter the peg field consistently. 7. **Bins:** The collection area at the bottom should be divided into distinct bins by the physical dividers. The height of the bins should be sufficient to clearly visualize the accumulation of balls. 8. **Visualization:** Use a high-contrast color scheme to clearly distinguish between elements. Specifically, use yellow for the structural elements (walls, top guides, physical bin dividers, ground), a contrasting color (like red) for the pegs, and a highly contrasting color (like dark grey or black) for the balls. 9. **Demonstration:** The simulation should visually demonstrate the formation of the normal (or binomial) distribution as multiple balls fall through the pegs and collect in the bins. Ensure the physics parameters (restitution, friction, density) and ball drop rate are tuned for a smooth and clear demonstration of the distribution.

#OpenAI @sama @gdb @ai_for_success @aidan_mclau


https://video.twimg.com/amplify_video/1912972357947334656/vid/avc1/776x586/dX9gd5al-B2qxt6t.mp4

5/20
@CodeByPoonam
Get the latest updates on AI insights and Tutorials.

Join "AI Toast" the community of 35,000 readers for FREE.

Read latest edition here:
AI Toast



6/20
@CodeByPoonam
4. Gemini 2.5 Flash is blazing fast

[Quoted tweet]
First test on gemini 2.5 flash on my phone. This model is blazing fast and it one shotted this, mobile friendly, animation. The code looks pretty clean too. Good vibes so far.


https://video.twimg.com/ext_tw_video/1912946801809772545/pu/vid/avc1/590x1278/nXzNRDKeHXL7JAyb.mp4

7/20
@CodeByPoonam
5. Cloth Simulation

[Quoted tweet]
Prompt: Create a cloth simulation using Verlet integration in a single HTML file (Canvas or Three.js). Include wind, gravity, and drag. Let users interact by dragging points. Cloth should bend and move realistically.

Model: Gemini flash 2.5


https://video.twimg.com/ext_tw_video/1913047505815953408/pu/vid/avc1/590x1278/WSwRATTymRpNQRy2.mp4

8/20
@CodeByPoonam
6. Image segmentation masks on command

[Quoted tweet]
Gemini 2.5 Pro and Flash now have the ability to return image segmentation masks on command, as base64 encoded PNGs embedded in JSON strings

I vibe coded this interactive tool for exploring this new capability - it costs a fraction of a cent per image


Go0lkL7bwAABgeT.jpg


9/20
@CodeByPoonam
7. MCP AI Agent

[Quoted tweet]
I built an MCP AI Agent using Gemini Flash 2.5 with access to AirBnB and Google Maps in just 30 lines of Python Code.

100% Opensource Code.


https://video.twimg.com/amplify_video/1913056645271429120/vid/avc1/1212x720/AfIwfVNsUWTKRlmu.mp4

10/20
@CodeByPoonam
8. Gemini 2.5 Flash is very cheap and super intelligent model.

[Quoted tweet]
Gemini 2.5 Flash Preview is an amazing model. Google is literally winning. No stopping them now, this is not normal.

Gemini 2.5 Flash is very cheap and super intelligent model. Intelligence too cheap to meter this is what it means.

Thank you, Google.


https://video.twimg.com/amplify_video/1912957952824356864/vid/avc1/1920x1080/20ckf4zJ7d1F-Y5P.mp4

11/20
@CodeByPoonam
8. Classic Snakes and Ladders

[Quoted tweet]
A lot of people make a snake game when trying out new models. I went with the classic Snakes and Ladders instead — built it using @GoogleDeepMind latest Gemini Flash 2.5 and it nails it. Look at how it follows the stairs and snakes so smoothly.
Still a WIP and don’t mind the extra dot on the die though 🐍🎲
It is said that this game started in ancient India where it was called Moksha Patam. Every move was a little life lesson where ladders were virtues while snakes were vices.


https://video.twimg.com/amplify_video/1913417180785360896/vid/avc1/1920x1080/nSy-2R-lP8ZiOk13.mp4

12/20
@CodeByPoonam
9. Create Simulation

[Quoted tweet]
AGI is achieved by Google's Gemini 2.5 Flash Preview
Seriously this is the best simulation i've ever created of how AI models work


https://video.twimg.com/amplify_video/1912963072299311104/vid/avc1/1464x720/5TOp8tU-RVWCulcR.mp4

13/20
@CodeByPoonam
10. A Block breaker

[Quoted tweet]
📕速報: Gemini 2.5 Flash登場:AIの思考を自在に操る新時代モデル

- 思考プロセスをオン/オフ可能
- 推論能力を大幅向上、高速性とコスト効率を維持
- 思考予算設定で品質・コスト・レイテンシーを自在に最適化できるハイブリッド思考AI

試しにブロック崩しを作成

注目ポイントを7点まとめました🚀


GoxjbXnasAAQc-Y.jpg


14/20
@CodeByPoonam
11. A dreamy low-poly floating island scene

[Quoted tweet]
Gemini 2.5 Pro 🆚 Gemini 2.5 Flash Thinking 24k

Prompt: "Create a dreamy low-poly floating island scene with dynamic lighting and gentle animations, in a single HTML file."

Gemini 2.5 Pro (left), Gemini 2.5 Flash (right)


https://video.twimg.com/amplify_video/1912964537277452288/vid/avc1/1920x1080/9egTWI8Uw7s6dkfe.mp4

15/20
@CodeByPoonam
12. Generate an SVG of a pelican riding a bicycle

[Quoted tweet]
I upgraded llm-gemini to support the new model, including a "-o thinking_budget X" option for setting the thinking budget

llm install -U llm-gemini
llm -m gemini-2.5-flash-preview-04-17 'Generate an SVG of a pelican riding a bicycle' -o thinking_budget 24576


Gow_iFabYAA9awi.png


16/20
@CodeByPoonam
13. Destroys Claude Sonnet 3.7 in benchmarks

[Quoted tweet]
Holy sh*t

Google Gemini 2.5 Flash dropped.

It destroyed Claude Sonnet 3.7 (64k Extended Thinking) in benchmarks 🤯

20x cheaper on input
25x cheaper on output
~4.2x cheaper on output with reasoning


Go0Vh5lWgAADLVo.jpg


17/20
@CodeByPoonam
Gemini 2.5 Flash is now available on Gemini App, AI Studio, and API

Gemini: ‎Gemini
AI Studio: Sign in - Google Accounts



Go4c4EfaAAA1mZA.jpg


18/20
@CodeByPoonam
Thanks for reading!

If you liked this post, check out my AI updates and tutorials Newsletter.

Join 35000+ readers in the AI Toast Community for free: AI Toast



19/20
@CodeByPoonam
Don't forget to bookmark for later.

If you enjoyed reading this post, please support it with like/repost of the post below 👇

[Quoted tweet]
Google just dropped Gemini 2.5 Flash, and people are going crazy over it.

SPOILER: Claude is now falling behind.

13 wild examples so far (Don't miss the 5th one)


Go4cqVEbUAAZr8V.jpg


20/20
@ricathrs
Gemini 2.5 Flash sounds like a game changer! 🌟




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
62,420
Reputation
9,478
Daps
171,076

OpenAI Releases a Practical Guide to Building LLM Agents for Real-World Applications​


By Nikhil

April 17, 2025

OpenAI has published a detailed and technically grounded guide, A Practical Guide to Building Agents , tailored for engineering and product teams exploring the implementation of autonomous AI systems. Drawing from real-world deployments, the guide offers a structured approach to identifying suitable use cases, architecting agents, and embedding robust safeguards to ensure reliability and safety.

Defining an Agent


Unlike conventional LLM-powered applications such as single-turn chatbots or classification models, agents are autonomous systems capable of executing multi-step tasks with minimal human oversight. These systems integrate reasoning, memory, tool use, and workflow management.

An agent comprises three essential components:

  1. Model — The LLM responsible for decision-making and reasoning.
  2. Tools — External APIs or functions invoked to perform actions.
  3. Instructions — Structured prompts that define the agent’s objectives, behavior, and constraints.

When to Consider Building an Agent


Agents are well-suited for workflows that exceed the capabilities of traditional rule-based automation. Typical scenarios include:

  • Complex decision-making : For instance, nuanced refund approvals in customer support.
  • High-maintenance rule systems : Such as policy compliance workflows that are brittle or difficult to scale.
  • Interaction with unstructured data : Including document parsing or contextual natural language exchanges.

The guide emphasizes careful validation to ensure the task requires agent-level reasoning before embarking on implementation.

Technical Foundations and SDK Overview


The OpenAI Agents SDK provides a flexible, code-first interface for constructing agents using Python. Developers can declaratively define agents with a combination of model choice, tool registration, and prompt logic.

OpenAI categorizes tools into:

  • Data tools — Fetching context from databases or document repositories.
  • Action tools — Writing or updating data, triggering downstream services.
  • Orchestration tools — Agents themselves exposed as callable sub-modules.

Instructions should derive from operational procedures and be expressed in clear, modular prompts. The guide recommends using prompt templates with parameterized variables for scalability and maintainability.

Orchestration Strategies


Two architectural paradigms are discussed:

  • Single-agent systems : A single looped agent handles the entire workflow, suitable for simpler use cases.
  • Multi-agent systems :
    • Manager pattern : A central coordinator delegates tasks to specialized agents.
    • Decentralized pattern : Peer agents autonomously transfer control among themselves.

Each design supports dynamic execution paths while preserving modularity through function-based orchestration.

Guardrails for Safe and Predictable Behavior


The guide outlines a multi-layered defense strategy to mitigate risks such as data leakage, inappropriate responses, and system misuse:

  • LLM-based classifiers : For relevance, safety, and PII detection.
  • Rules-based filters : Regex patterns, input length restrictions, and blacklist enforcement.
  • Tool risk ratings : Assigning sensitivity levels to external functions and gating execution accordingly.
  • Output validation : Ensuring responses align with organizational tone and compliance requirements.

Guardrails are integrated into the agent runtime, allowing for concurrent evaluation and intervention when violations are detected.

Human Oversight and Escalation Paths


Recognizing that even well-designed agents may encounter ambiguity or critical actions, the guide encourages incorporating human-in-the-loop strategies. These include:

  • Failure thresholds : Escalating after repeated misinterpretations or tool call failures.
  • High-stakes operations : Routing irreversible or sensitive actions to human operators.

Such strategies support incremental deployment and allow trust to be built progressively.

Conclusion


With this guide, OpenAI formalizes a design pattern for constructing intelligent agents that are capable, controllable, and production-ready. By combining advanced models with purpose-built tools, structured prompts, and rigorous safeguards, development teams can go beyond experimental prototypes and toward robust automation platforms.

Whether orchestrating customer workflows, document processing, or developer tooling, this practical blueprint sets a strong foundation for adopting agents in real-world systems. OpenAI recommends beginning with single-agent deployments and progressively scaling to multi-agent orchestration as complexity demands.

Check out the Download the Guide . Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Gr oup . Don’t Forget to join our 90k+ ML SubReddit .

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
62,420
Reputation
9,478
Daps
171,076

Meta AI Introduces Perception Encoder: A Large-Scale Vision Encoder that Excels Across Several Vision Tasks for Images and Video​


By Asif Razzaq

April 18, 2025

The Challenge of Designing General-Purpose Vision Encoders


As AI systems grow increasingly multimodal, the role of visual perception models becomes more complex. Vision encoders are expected not only to recognize objects and scenes, but also to support tasks like captioning, question answering, fine-grained recognition, document parsing, and spatial reasoning across both images and videos. Existing models typically rely on diverse pretraining objectives—contrastive learning for retrieval, captioning for language tasks, and self-supervised methods for spatial understanding. This fragmentation complicates scalability and model deployment, and introduces trade-offs in performance across tasks.

What remains a key challenge is the design of a unified vision encoder that can match or exceed task-specific methods, operate robustly in open-world scenarios, and scale efficiently across modalities.

A Unified Solution: Meta AI’s Perception Encoder


Meta AI introduces Perception Encoder (PE) , a vision model family trained using a single contrastive vision-language objective and refined with alignment techniques tailored for downstream tasks. PE departs from the traditional multi-objective pretraining paradigm. Instead, it demonstrates that with a carefully tuned training recipe and appropriate alignment methods, contrastive learning alone can yield highly generalizable visual representations.

The Perception Encoder operates across three scales—PEcoreB, PEcoreL, and PEcoreG—with the largest (G-scale) model containing 2B parameters. These models are designed to function as general-purpose encoders for both image and video inputs, offering strong performance in classification, retrieval, and multimodal reasoning.

Screenshot-2025-04-18-at-8.17.43%E2%80%AFAM-1-1024x470.png


Training Approach and Architecture


The pretraining of PE follows a two-stage process. The first stage involves robust contrastive learning on a large-scale curated image-text dataset (5.4B pairs), where several architectural and training enhancements improve both accuracy and robustness. These include progressive resolution scaling, large batch sizes (up to 131K), use of the LAMB optimizer, 2D RoPE positional encoding, tuned augmentations, and masked regularization.

The second stage introduces video understanding by leveraging a video data engine that synthesizes high-quality video-text pairs. This pipeline incorporates captions from the Perception Language Model (PLM), frame-level descriptions, and metadata, which are then summarized using Llama 3.3. These synthetic annotations allow the same image encoder to be fine-tuned for video tasks via frame averaging.

Despite using a single contrastive objective, PE features general-purpose representations distributed across intermediate layers. To access these, Meta introduces two alignment strategies:

  • Language alignment for tasks such as visual question answering and captioning.
  • Spatial alignment for detection, tracking, and depth estimation, using self-distillation and spatial correspondence distillation via SAM2.

Empirical Performance Across Modalities


PE demonstrates strong zero-shot generalization across a wide range of vision benchmarks. On image classification, PEcoreG matches or exceeds proprietary models trained on large private datasets such as JFT-3B. It achieves:

  • 86.6% on ImageNet-val,
  • 92.6% on ImageNet-Adversarial,
  • 88.2% on the full ObjectNet set,
  • Competitive results on fine-grained datasets including iNaturalist, Food101, and Oxford Flowers.

In video tasks, PE achieves state-of-the-art performance on zero-shot classification and retrieval benchmarks, outperforming InternVideo2 and SigLIP2-g-opt, while being trained on just 22M synthetic video-caption pairs. The use of simple average pooling across frames—rather than temporal attention—demonstrates that architectural simplicity, when paired with well-aligned training data, can still yield high-quality video representations.

An ablation study shows that each component of the video data engine contributes meaningfully to performance. Improvements of +3.9% in classification and +11.1% in retrieval over image-only baselines highlight the utility of synthetic video data, even at modest scale.

Screenshot-2025-04-18-at-8.18.27%E2%80%AFAM-1024x530.png


Conclusion


Perception Encoder provides a technically compelling demonstration that a single contrastive objective, if implemented with care and paired with thoughtful alignment strategies, is sufficient to build general-purpose vision encoders. PE not only matches specialized models in their respective domains but does so with a unified and scalable approach.

The release of PE, along with its codebase and the PE Video Dataset, offers the research community a reproducible and efficient foundation for building multimodal AI systems. As visual reasoning tasks grow in complexity and scope, PE provides a path forward toward more integrated and robust visual understanding.




Check out the Paper , Model , Code and Dataset . Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Gr oup . Don’t Forget to join our 90k+ ML SubReddit .

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

 
Top