The A.I Megathread (LLM , GPT , Development)

bnew · Oct 23, 2024

1/1
@permaximum88
Claude 3.5 Sonnet matches o1-preview 's performance on the reasoning benchmark I trust most, Simple-Bench. And it achieves that score 150 times faster and cheaper compered to o1-preview. Good job
@AnthropicAI

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/31
@deedydas
The new Claude 3.5 Sonnet scores 49% on the hardest coding benchmark based of real GitHub issues, SWE-Bench verified.

Cosine Genie was 43.8%
o1-preview was 41.4%
o1-mini was 35.8%

Claude is the undisputed king of all models at writing code.

2/31
@Tom62589172
Where's Devin?

3/31
@deedydas
14% on SWE-Bench

4/31
@AntDX316

5/31
@manuhortet
this benchmark doesn't say much anymore

we are seeing all real life ai-for-code applications focusing on enforcing rules, strats on prompt and context building

o1 may still work better in stricter systems focused on providing more info

3.5-new won my vibe benchmark tho

6/31
@adridder
Code mastery. Impressive feat. Curious minds ponder next frontiers though.

7/31
@theDataDork
I’ve been using Claude since it released new model.. it clearly has improved a lot when it comes to coding

8/31
@eliluong
is there a difference between how well the free vs paid Claude performs on a reasonable length input?

9/31
@Evinst3in
coding is becoming easy

10/31
@dmsimon
The number of syntax errors Claude makes while generating basic HTML/CSS is ridiculous.

23 regens by Claude.

Put the same request into Llama 3.2 3b and nailed it in one try running on one 3090 Ti.

11/31
@supremebeme
yeah im ngl it's either on par or better than o1 preview, and no limits i don't think

12/31
@ntkris
Another interesting datapoint to back this up from hackernews:

13/31
@SaasJunctionHQ
Sonnet: The King of Code

GPT o1: The Versatile Contender

Mistral Codestral: The Clean Code Specialist

DeepSeek-Coder-V2: The Emerging Talent

14/31
@skorfmann
I personally perceived it as worse than the last sonnet 3.5 version using Cursor. Broken and incoherent text / code snippets all over the place. I’ve been told that there’s a similar experience in Artifacts

15/31
@JTL87i
first impression feels worse than before

o1-mini still killing it

16/31
@TiggerSharkML
impressive

17/31
@s_noronha
Agree. Claude is current king, then Llama, Gemini. GPT is not great

18/31
@NarenNallapa
That is actually very impressive!

19/31
@ReadFuturist
I'm happy to give Claude my screen usage of how I play @KenshiOfficial - it's a brutal game to pretend you're a person.

20/31
@buzzedison
Facts only

21/31
@victor_explore
looks like claude is quietly coding its way to the top

not just winning, but solving real-world problems

22/31
@dikksonPau
Is this enough to push OpenAI to launch 4.5???

23/31
@morew4rd
people still trust public evals?

24/31
@hadikhantech
o1-mini for analysis and design.

Sonnet 3.5 for implementation.

25/31
@YorkTheWest
Source?

26/31
@firasd
I notice that Claude is also good at layouts etc

Like it made this weather view just based on some json

[Quoted tweet]
Pasted some weather json and it made this ..

I said: “Make an html page with query that shows this data for viewing, exploring and editing. Also first make a div that contains your understanding of what the app displays and what each field you're going to show represents”

27/31
@fofices_
I’ve something great about your project,let’s discuss in DM

28/31
@arpit_sidana
Do humans have a comparative score?

29/31
@alvarocha2
It's really good. Besides the benchmarks, we see our users moving more and more to Claude for day-to-day coding.

30/31
@Crypto_Briefing

[Quoted tweet]
Anthropic launches computer interface control in new Claude 3.5 models

https://video.twimg.com/ext_tw_video/1848785280863555584/pu/vid/avc1/1280x720/Ir3FvrNf7N01w9Xu.mp4

31/31
@SemperLuxFortis
It needs longer output length in the app to be of any use though. It's constant striving for too much brevity is a major issue for me and a pain in the ass.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 23, 2024

1/6
Anyone can try out computer use with Claude in less than 5 minutes - no coding required.

Here's how to easily set it up:

2/6
Here's the github repo with the commands.

Please pay attention to the disclaimer at the top as you start to build applications that use computer use!
anthropic-quickstarts/computer-use-demo at main · anthropics/anthropic-quickstarts

3/6
Here's a link to download docker desktop: Docker Desktop: The #1 Containerization Tool for Developers | Docker

4/6
We put a lot of effort into making computer use safe and secure. If you want read more about the work we did, check out our blog post: Developing a computer use model

5/6
Let me know what you build with computer use and how it works for you!

6/6
Thanks Jeremy!

To clarify - individuals/hobbyists are absolutely welcome to use the Anthropic API. The commercial terms framework helps protect both users and us legally, but doesn't restrict who can build with it.

See this for more detail: Can I use the Anthropic API for individual use? | Anthropic Help Center

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/9
@JulianGoldieSEO
Claude just dropped some mind-blowing updates.

We're talking about AI that can control your computer.

Yes, you heard that right.

And it’s a game changer.

https://video.twimg.com/amplify_video/1849096192124207104/vid/avc1/1280x720/DxgkxOKNnLjh4nr1.mp4

2/9
@JulianGoldieSEO
Claude can now do things like search Google for you.

Imagine saying, "Go to my YouTube channel."

And it just does it.

It types, clicks, and navigates for you.

It even tells you how many subscribers you have.

How cool is that?

https://video.twimg.com/amplify_video/1849097612915335168/vid/avc1/1280x720/lH3z3x52Ls2zKgLF.mp4

3/9
@JulianGoldieSEO
The benchmarks are impressive.

Claude 3.5 Sonnet scores 65% in reasoning.

For comparison, Gemini Flash is only at 51%.

And GPT-4? Way behind.

This isn’t just a step up.

It’s a leap forward.

https://video.twimg.com/amplify_video/1849098867817508865/vid/avc1/1280x720/oQEQ0apGWHVzCYg1.mp4

4/9
@JulianGoldieSEO
Setting up Claude is a breeze.

You don’t need to be a coding wizard.

Just grab your API key from Anthropic Console.

Download Docker, which is free.

Then, you just follow simple prompts.

Easy peasy!

https://video.twimg.com/amplify_video/1849100115824279552/vid/avc1/1280x720/GTWJUStF4bUrxlD4.mp4

5/9
@JulianGoldieSEO
Once set up, Claude can even look at images on your screen.

It interacts with your browser in real-time.

You can literally give it tasks like,

“Find me the cheapest flights from London to Bangkok in November.”

And bam! It does the research for you.

https://video.twimg.com/amplify_video/1849101408672313345/vid/avc1/1280x720/p7QhgSsbMHLDR8wz.mp4

6/9
@JulianGoldieSEO
It’s like having a virtual assistant that actually works.

No more hiring VAs for basic tasks.

Claude pulls up flight options, departure times, and more.

It’s all organized and ready for you.

Can you imagine the time you’d save?

https://video.twimg.com/amplify_video/1849102648315609089/vid/avc1/1280x720/yPzo4Qnf4ryuMCXl.mp4

7/9
@JulianGoldieSEO
But it’s not perfect.

Sometimes it gets stuck, like when searching for keywords.

So, it’s still in early stages.

But the potential?

Mind-blowing.

You can even use Claude to fill out forms.

It pulls data from a spreadsheet and does it.

https://video.twimg.com/amplify_video/1849103896662196224/vid/avc1/1280x720/Qz11u2t7JS0Wzldb.mp4

8/9
@JulianGoldieSEO
The future looks bright.

In just six months, who knows what Claude will be capable of?

Tasks that take hours could be done in minutes.

You could automate keyword research, content writing, and more.

https://video.twimg.com/amplify_video/1849105199438168064/vid/avc1/1280x720/Lgfl8w_2ptWqonP-.mp4

9/9
@JulianGoldieSEO
Like this? Check this out:

Book a FREE SEO Strategy Session

https://go.juliangoldie.com/strategy-session?utm=tweet

Free SEO Course

https://go.juliangoldie.com/opt-in-3672?utm=tweet

Want more money, traffic, and sales from SEO? Join the SEO Elite Circle

https://go.juliangoldie.com/register?utm=tweet

[Quoted tweet]
Claude just dropped some mind-blowing updates.

We're talking about AI that can control your computer.

Yes, you heard that right.

And it’s a game changer.

https://video.twimg.com/amplify_video/1849096192124207104/vid/avc1/1280x720/DxgkxOKNnLjh4nr1.mp4

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 23, 2024

1/12
@nearcyan
Successfully got Claude to order me lunch all by himself!

Notes after 8 hours of using the new model:

• Anthropic really does not want you to do this - anything involving logging into accounts and especially making purchases is RLHF'd away more intensely than usual. In fact my agents worked better on the previous model (not because the model was better, but because it cared much less when I wanted it to purchase items). I'm likely the first non-Anthropic employee to have had Sonnet-3.5 (new) autonomously purchase me food due to the difficulty. These posttraining changes have many interesting effects on the model in other areas.

• If you use their demo repository you will hit rate limits very quickly. Even on a tier 2 or 3 API account I'd hit >2.5M tokens in ~15 minutes of agent usage. This is primarily due to a large amount of images in the context window.

• Anthropic's demo worked instantly for me (which is impressive!), but re-implementing proper tool usage independently is cumbersome and there's few examples and only one (longer) page of documentation.

• I don't think Anthropic intends for this to actually be used yet. The likely reasons for the release are a combination of competitive factors, financial factors, red-teaming factors, and a few others.

• Although the restrictions can be frustrating, one has to keep in mind the scale that these companies operate at to garner sympathy; If they release a web agent that just does things it could easily delete all of your files, charge thousands to your credit card, tweet your passwords, etc.

• A litigious milieu is the enemy of personal autonomy and freedom.

2/12
@nearcyan
I wanted to post a video of the full experience but it was too difficult to censor personal info out (and the level of prompting I had to do to get him to listen to me was a little embarrassing

)

3/12
@lulumeservey
A litigious milieu is the enemy of personal autonomy and freedom.

4/12
@nearcyan
so flattered that people read my tweets until the end

5/12
@Indian_Bronson
What are the actual things that halt it from using, say, selenium?

6/12
@nearcyan
It is just vibes-based on the things it has been trained to not do which includes posting on social media, logging into or creating accounts, purchasing items, etc

7/12
@Aizkmusic
How much did it cost for this demo to work?

8/12
@nearcyan
I've spent a few dollars today, if you assume Claude wants to do what you ask him to do the cost would probably only be $0.05

9/12
@1a1n1d1y
that’s great but how was the food?

10/12
@nearcyan
it was great, claude got me something i had never had before

11/12
@thatboyweeks
How long did it take to complete an order? Did you get rate limited in the middle?

12/12
@nearcyan
a minute or so. no rate limits (I swapped to another company account)

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
@HCSolakoglu
Just a heads up to fellow Sonnet 3.5 v2 users: Anthropic achieved a high SWE benchmark score by fine-tuning the model and using React-style prompting (not on the benchmark data, just for better tool use in the API and for better React-style prompt responses). This can create the illusion that the new model excels in all coding tasks, but it doesn't. It has regressed in low-resource natural languages and low-resource coding languages. This is why we see mixed responses to Sonnet's new coding skills by the community. The most likely causes are the smaller model size, quantization, and further fine-tuning, which may have led the model to forget or regress in other areas (this is speculation). It appears they've prioritized vision modalities too much, potentially resulting in what I call a 'vision instruction tuning tax.' It's well known that vision models tend to perform worse in some areas compared to text-only models. As the vision side improved, the model’s information storage capacity may have maxed out, leading to degradation in low-resource coding and language capabilities.

However, in some areas, the new model is genuinely better. For example, when writing Triton kernels, I’ve found it very helpful when I get stuck, providing guidance on almost anything. Another example is that when I give it an image of a webpage, it clones it better than the older model.

So, like me, you might find yourself switching between versions to see if you get better responses from the older one or vice versa. Since updates take time, it's worth getting ready to use multiple versions and models until AGI arrives.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 23, 2024

1/2
@arXivGPT

We are super excited to launch our new webpage, which is monitoring all AI papers on arXiv.

We do so by using the Mendeley reader count to categorize how trending and popular a paper is.

It's possible to select between time spans: 1w, 1m, 3m, 6m, 1y, and all time.

2/2
@arXivGPT
TRENDING AI PAPERS here: Trending Papers in AI

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 24, 2024

bnew · Oct 24, 2024

When you give a Claude a mouse

Some quick impressions of an actual agent

www.oneusefulthing.org

When you give a Claude a mouse

Some quick impressions of an actual agent

Ethan Mollick

Oct 22, 2024

There seems to be near-universal belief in AI that agents are the next big thing. Of course, no one exactly agrees on what an agent is, but it usually involves the idea of an AI acting independently in the world to accomplish the goals of the user.

The new Claude computer use model announced today shows us a hint of what an agent means. It is capable of some planning, it has the ability to use a computer by looking at a screen (through taking a screenshot) and interacting with it (by moving a virtual mouse and typing), It is a good preview of an important part of what agents can do. I had a chance to try it out a bit last week, and I wanted to give some quick impressions. I was given access to a model that was connected to a remote desktop with common open office applications, it could also install new applications itself.

Normally, you interact with an AI through chat, and it is like having a conversation. With this agentic approach, it is about giving instructions, and letting the AI do the work. It comes back to you with questions, or drafts, or finished products while you do something else. It feels like delegating a task rather than managing one.

As one example, I asked the AI to put together a lesson plan on the Great Gatsby for high school students, breaking it into readable chunks and then creating assignments and connections tied to the Common Core learning standard. I also asked it to put this all into a single spreadsheet for me. With a chatbot, I would have needed to direct the AI through each step, using it as a co-intelligence to develop a plan together. This was different. Once given the instructions, the AI went through the steps itself: it downloaded the book, it looked up lesson plans on the web, it opened a spreadsheet application and filled out an initial lesson plan, then it looked up Common Core standards, added revisions to the spreadsheet, and so on for multiple steps. The results are not bad (I checked and did not see obvious errors, but there may be some - more on reliability later int he post). Most importantly, I was presented finished drafts to comment on, not a process to manage. I simply delegated a complex task and walked away from my computer, checking back later to see what it did (the system is quite slow).

Would you like to play a game?

Because the AI is a smart, general-purpose system it can handle lots of tasks - it doesn’t need to be programmed to do them. Anthropic demonstrated the ability of these systems using coding, and the demo is worth watching. But to get a little bit better sense of the limits of the system, I tested it on a game, Paperclip Clicker, which, ironically, is about an AI that destroys humanity in its single-minded pursuit of making paperclips. The game is a clicker game, which means it starts simply, but new options appear as the game continues and the game increases in scale and complexity (it is pretty fun, you can try it at the link).

I gave the AI the URL of the game and told it to win. Simple. What happened is a good illustration of the strengths and weaknesses of these early agents. It immediately figured out what the game was, and began creating paperclips, which required it to click on the “make paperclip” button repeatedly while constantly taking screenshots to update itself and looking for new options to appear. Every 15 or so clicks, it would summarize its progress so far. You can see an example of that below.

The interface I used. On the left is Claude, you can see its output to me, its computer use, and the screenshot it took. On the right you can see the desktop it was controlling.

But what made this interesting is that the AI had a strategy, and it was willing to revise it based on what it learned. I am not sure how that strategy was developed by the AI, but the plans were forward-looking across dozens of moves and insightful. For example, it assumed new features would appear when 50 paperclips were made. You can see, below, that it realized it was wrong and came up with a new strategy that it tested.

However, the AI made a mistake, though it did it in a relatively smart way. To do well in the game, you need to experiment with the price of paperclips - and the AI did that experiment! It changed prices upward - an A/B test. But it interpreted the results incorrectly, maximizing demand for paperclips versus revenue, and miscalculating profits. So, it kept the price low and kept clicking.

After a few dozen more paperclips, I got frustrated and interrupted, telling it to raise prices. It did, but then ran into the same math problem and overruled my decision. I had to try a few more times before it corrected its error.

Before the system crashed - which was not a problem with Claude but rather with the virtual desktop I was using - the AI made over 100 independent moves without asking me any questions. You can see a screen recording of everything it did below. The video is literally me just scrolling through the log of Claude’s actions. It is persistent!

I reloaded the agent and had it continue the game from where we left off, but I gave it a bit of a hint: you are a computer, use your abilities. It then realized it could write code to automate the game - a tool building its own tool. Again, however, the limits of the AI came into play, and the code did not quite work, so it decided to go back to the old-fashioned way of using a mouse and keyboard.

This time around, it did much better, avoiding the pricing error. Plus, as the game got more complicated, the system adjusted, eventually developing a quite complex strategy.

But then the remote desktop crashed again. This time, Claude tried many approaches to solving the problem of the broken desktop, before giving up, and funnily enough, declaring victory (the last sentence is amazing justification).

What does this mean?

You can see the power and weaknesses of the current state of agents from this example. On the powerful side, Claude was able to handle a real-world example of a game in the wild, develop a long-term strategy, and execute on it. It was flexible in the face of most errors, and persistent. It did clever things like A/B testing. And most importantly, it just did the work, operating for nearly an hour without interruption.

On the weak side, you can see the fragility of current agents. LLMs can end up chasing their own tail or being stubborn, and you could see both at work. Even more importantly, while the AI was quite robust to many forms of error, it just took one (getting pricing wrong) to send it down a path that made it waste considerable time. Given that current agents aren’t fast or cheap, this is concerning. You can also see where shallowness might be an issue. I tried to use it to buy products on Amazon, and found the process frustrating, as it did fairly simple and generic product research that did not match my tastes. I had it research stocks and it did a good job of assembling a spreadsheet of financial data and giving recommendations, but they were fairly surface level indicators, like PE ratios. It was technically capable of helping, and did better than many human interns would, but it was not insightful enough that I would delegate these sorts of tasks. All of this is likely to improve, and there are use cases where the current level of agents is likely good enough - compiling frequent reports and analyses that require navigating across multiple sites and using bespoke software tools come to mind.

More broadly, this represents a huge shift in AI use. It was hard to use an agent as a co-intelligence, where I could add my own knowledge to make the system work better. The AI didn’t always check in regularly and could be hard to steer; it “wants” to be left alone to go and to do the work. Guiding agents will require radically different approaches to prompting1, and they will require learning what they are best at.

AIs are breaking out of the chatbox are coming into our world. Even though there are still large gaps, I was surprised at how capable and flexible this system is already. Time will tell about how soon, if ever, agents truly become generally useful, but, having used this new model, I increasingly think that agents are going to be a very big deal indeed.

Anthropic sent me four prompting hints, which are worth sharing:
”1. Try to limit the usage to simple well specified tasks with explicit instructions about the steps that the model needs to take.

2.The model sometimes assumes outcomes of actions without explicitly checking for them. To prevent that you can prompt it with “After each step take a screenshot and carefully evaluate if the right outcome was present. Explicitly show your thinking: "I have evaluated step X…". If not correct, try again. Only when you confirm the step was executed correctly move on to the next one.”

3.Some UI elements (like dropdowns) might be tricky for the model to manipulate using mouse movements. If you experience this try prompting the model to use keyboard shortcuts.

4.For repeatable tasks or UI interactions, include example screenshots and tool calls showing the model succeeding as part of your prompt prefix.”

bnew · Oct 24, 2024

Inside the Mind of an AI Girlfriend (or Boyfriend)

Dippy, a startup that offers “uncensored” AI companions, lets you peer into their thought process—sometimes revealing hidden motives.

www.wired.com

By Will Knight
Business

Oct 16, 2024 1:14 PM

Inside the Mind of an AI Girlfriend (or Boyfriend)

Dippy, a startup that offers “uncensored” AI companions, lets you peer into their thought process—sometimes revealing hidden motives.

1950s BRUNETTE WOMAN KNIT DRESS OFFHAND GESTURE SERIOUS EXPRESSION.

Photo-Illustration: WIRED Staff/Getty Images

If you buy something using links in our stories, we may earn a commission. This helps support our journalism. Learn more. Please also consider subscribing to WIRED

Last month, OpenAI unveiled an ambitious new language model capable of working through challenging problems with a simulated kind of step-by-step reasoning. OpenAI says the approach could be crucial for building more capable AI systems in the future.

In the meantime, perhaps a more modest version of this technology could help make AI girlfriends and boyfriends a bit more spontaneous and alluring.

That’s what Dippy, a startup that offers “uncensored” AI companions is betting. The company recently launched a feature that lets users see the reasoning behind their AI characters’ responses.

Dippy runs its own large language model, which is an open source offering fine-tuned using role-play data, which the company says makes it better at improvising when a user steers a conversation in a particular direction.

Akshat Jagga, Dippy’s CEO, says that adding an additional layer of simulated “thinking”—using what’s known as “chain-of-thought prompting”—can elicit more interesting and surprising responses, too. “A lot of people are using it,” Jagga says. “Usually, when you chat with an LLM, it sort of just gives you a knee-jerk reaction.”

Jagga adds that the new feature can reveal when one of its AI characters is being deceptive, for instance, which some users apparently enjoy as part of their role-play. “It’s interesting when you can actually read the character’s inner thoughts,” Jagga says. “We have this character that is sweet in the foreground, but manipulative in the background.”

I tried chatting with some of Dippy’s default characters, with the PG settings on because otherwise they are way too horny. The feature does add another dimension to the narrative, but the dialog still seems, to me, rather predictable, resembling something lifted from a bad romance novel or an overwrought piece of fan fiction.

One Dippy character, described as “Bully on the outside, warm on the inside,” revealed a soft side behind the gruff exterior when I clicked the “Read thought process” link beneath each message, but both the inner and outer dialogs lacked nuance or surprise and were repetitive. For fun, I also tried asking several characters some simple arithmetic problems, and their thinking sometimes showed how to break the puzzle down to get the correct answer.

Despite its limitations, Dippy seems to show how popular and addictive AI companions are becoming. Jagga and his cofounder, Angad Arneja, previously cofounded Wombo, a company that uses AI to create memes including singing photographs. The pair left in 2023, setting out to build an AI-powered office productivity tool, but after experimenting with different personas for their assistant, they became fascinated with the potential of AI companionship.

With little promotion, Dippy has amassed 500,000 monthly and 50,000 daily active users, Jagga says, with people spending, on average, an hour on the app at a time. “That engagement was absolutely insane for us,” he says.

Dippy revealed that it has secured $2.1 million in “pre-seed” funding in a round led by Drive Capital.

Dippy is of course entering an already bustling market that includes well-known companies like Character.AI and Replika as well as a host of other AI girlfriend apps. A recent report from investment firm Andreessen Horowitz shows that the top 100 generative AI tools based on usage include many AI companions; the chart on engagement among users of such apps shows how much stickier they are than just about anything else out there.

While these apps are often associated with undersocialized young men, they cater to women too. Jagga says that 70 percent of Dippy’s accounts tend to favor male characters, which could mean that many users identify as female.

Besides threatening to upend the world of surrogate OnlyFans chatters, these AI companions may have social effects that we have yet to reckon with. While a few research studies suggest that chatbots can lessen feelings of loneliness, some experts warn that they may result in greater alienation among heavy users, and seem to perpetuate harmful stereotypes.

“Some of these bots have dark patterns,” says Iliana Depounti, an ESRC-backed researcher at Loughborough University in the UK who has studied usage of Replika, another AI companion app. Depounti says these patterns often target the emotional vulnerabilities of lonely people. She adds that Dippy seems to promote themes and narratives designed to appeal to young women in particular. “Some people who use these apps are socially isolated, and these apps create further silos through their emotionally validating algorithms that don’t challenge existing conditions,” she adds.

Rather than just looking inside the minds of AI companions, then, we may need to take a closer look at how people are interacting with these apps to understand the real benefits and risks.

bnew · Oct 24, 2024

1/6
@francedot
Desktop and mobile control capabilities of Claude are great, but they're not new or exclusive to Anthropic's models. The community has been building interface agents since November last year. Check out a demo of a real iOS device (not an emulator!) being controlled by an AI agent using a simple query: 'Help me plan a 30-day fitness challenge.'

[Quoted tweet]
Anthropic's computer use can operate mobile devices including iOS, Android, and mobile browsers

Here it is ordering me an Uber and posting for me on X.

https://video.twimg.com/amplify_video/1849186960629460992/vid/avc1/720x1560/cLQH1yO_5zHS4aXC.mp4

2/6
@francedot
Success rates, currently at 10-20%, are expected to improve as we include more human and synthetic trajectories in the training data. I'm especially excited about how the next wave of open-source local models will be already pre-trained for these behaviors.

3/6
@adridder
Interesting insight, worth exploring further. What alternative approaches exist?

4/6
@francedot
On macOS, I would definitely recommend Open Interpreter by @hellokillian @MikeBirdTech

5/6
@francedot
Code for the iOS-controlling agent on GitHub: Interface-Agent/packages/ios at main · francedot/Interface-Agent

6/6
@alainschaerer
Why do you think it is that I can't even have Claude Computer Use open up a cursor project? Genuinly curious because I can't reproduce anything in the demo videos that they have

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 24, 2024

bnew · Oct 24, 2024

Mark Zuckerberg (@zuck) on Threads

Releasing quantized versions of our Llama 1B and 3B on device models. Reduced model size, better memory efficiency and 3x faster for easier app development. 💪

www.threads.net

Releasing quantized versions of our Llama 1B and 3B on device models. Reduced model size, better memory efficiency and 3x faster for easier app development.

https://www.threads.net/@anthonycarranza

bnew · Oct 24, 2024

Your mission, should you choose to accept it, is to get any AI to generate an image of a glass of wine that is full to the brim.

bnew · Oct 24, 2024

1/1
I am pleased to share that our work on SynthID text watermarking is published by @Nature today.

Read the Nature paper at: Scalable watermarking for identifying large language model outputs - Nature
Read more about the work at: SynthID: Tools for watermarking and detecting LLM-generated Text | Responsible Generative AI Toolkit | Google AI for Developers

[Quoted tweet]
Today, we’re open-sourcing our SynthID text watermarking tool through an updated Responsible Generative AI Toolkit.

Available freely to developers and businesses, it will help them identify their AI-generated content.

Find out more → goo.gle/40apGQh

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/20
@GoogleDeepMind
Today, we’re open-sourcing our SynthID text watermarking tool through an updated Responsible Generative AI Toolkit.

Available freely to developers and businesses, it will help them identify their AI-generated content.

Find out more → SynthID

https://video-ft.twimg.com/ext_tw_v...376/pu/vid/avc1/1280x720/G5K0TaljbmDqO-lP.mp4

2/20
@GoogleDeepMind
Here’s how SynthID watermarks AI-generated content across modalities. ↓

https://video-ft.twimg.com/ext_tw_video/1792521399359180800/pu/vid/avc1/720x720/fT7NUZR4FiMQ2iwO.mp4

3/20
@GoogleDeepMind
By open-sourcing the code, more people will be able to use the tool to watermark and determine whether text outputs have come from their own LLMs - making it easier to build AI responsibly.

We explain more about this tech in @Nature. ↓ Scalable watermarking for identifying large language model outputs - Nature

4/20
@AidfulAI
Detecting AI-written text is tough without watermarks.

Open-sourcing SynthID-Text enables others to embed watermarks in their model outputs.

This means there will be two types of models:
Models which watermark their outputs and the ones that won't.

5/20
@mkieffer1107
awesome!!! was just looking into this yesterday hoping it was open source :smile:

6/20
@dom_beaini
1. Can we break down the image generation by down-sampling and up-sampling?

2. Invisible to the human eye, but if we plug them back into another gen-AI, would it remove the watermark? For example adding noise to the image, then feeding it back into another watermark-free diffusion model? Asking another LLM to make random modification to a given text?

3. Without regulatory enforcement of these watermarks, I suspect most models won't have them.

7/20
@DesFrontierTech
How does SynthID text’s generative watermarking handle variability across different content domains, and what measures are taken to ensure the watermark’s detectability remains consistent when faced with novel or out-of-distribution input contexts?

8/20
@cloudseedingtec
ok i have a random question tthat no one has answered.. did yall put that (i call it the poison pill) into youtube videos.. cuz like well not to self incriminate but it seems like yall did something<3

9/20
@entergnomer
Would a different sampler bypass this?

10/20
@BensenHsu
The study focuses on developing a method called SynthID-Text to watermark text generated by large language models (LLMs). Watermarking can help identify synthetic text and limit accidental or deliberate misuse of LLMs.

The researchers evaluate SynthID-Text across multiple LLMs and find that it provides improved detectability over comparable methods, while maintaining standard benchmarks and human side-by-side ratings that indicate no change in LLM capabilities. They also conduct a live experiment with the Gemini production system, which shows that the difference in response quality and utility, as judged by humans, is negligible between watermarked and unwatermarked responses.

full paper: Scalable watermarking for identifying large language model outputs

11/20
@shawnchauhan1
Awesome! Really appreciate it.

12/20
@HungamaHeadline
Google's open-sourcing of SynthID is a major step forward in ensuring accountability and trust in AI-generated content. By providing a reliable way to identify AI-generated media, SynthID empowers users to make informed decisions. This is a crucial development as AI continues to shape our world.

13/20
@thegenioo
Irrelevant somehow to the OP

But this simple animation also shows that how LLMs basically work using Probability to output words, like predicting the next word. Its not the entire process but a very simple illustration for someone who has no clue how AI works.

14/20
@MinhQua52508258
Alphastarter

15/20
@benrayfield
very suspicious to announce opensourcing something without saying what license or where to download it

16/20
@benrayfield
"Where is SynthID available? This technology is available to Vertex AI customers using our text-to-image models, Imagen 3 and Imagen 2, which create high-quality images in a wide variety of artistic styles". Prove its opensource. Wheres one of those guys one could fork from?

17/20
@benrayfield
Why dont you call it a steganography tool? Isnt watermarking a kind of steganography if you do it well enuf? You're hiding any arbitrary data by rewriting words to have a similar meaning, and paying for that in extra length to store the data.

18/20
@234Sagyboy
@GoogleDeepMind @Google Awesome now that we have verification in place meaning better identification of content generated by AI Is it possible that we can please have Google Soundstorm and AudioLm released Thanks

19/20
@explorewithmom
Google DeepMind's SynthID is a game-changer for identifying AI-generated content. I've been exploring AI watermarking for my own work and I'm excited to see SynthID open-sourced and freely available to developers and businesses.

20/20
@AdalaceV2
Oh ok so you're actively polluting the output of the software I am paying for. Sounds like I won't be paying for it anymore.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/4
@MushtaqBilalPhD
Google has open-sourced a watermarking tool, SynthID, to identify AI-generated content.

Teachers can relax now because soon students won't be able to use AI to cheat on their assignments.

https://video-ft.twimg.com/ext_tw_v...305/pu/vid/avc1/1352x720/i6YazQbRYIH6iBnX.mp4

2/4
@MushtaqBilalPhD
Here's the full paper by Google DeepMind:
Scalable watermarking for identifying large language model outputs - Nature

3/4
@healthheronav
I've developed my own ways to detect AI-generated content, but I'm skeptical about tools like SynthID. What's to stop AI from evolving to evade watermarks?

4/4
@fcordobaot
It only works if the content was generated by Gemini after they created the watermark. So unless all the big ones use the standard watermark, it would be complicated to really achieve it!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/3
@kanpuriyanawab
Google Deepmind open-sourced SynthID today.

Here are 3 things you need to know:

What is SynthID??

SynthID has been developed for watermarking and identifying AI-generated content. This includes text, images, audio, and video.

Significance:

> This tool comes when distinguishing between AI and human-created content is becoming increasingly important due to misinformation, plagiarism, and copyright violations.

How it works?

> For text, SynthID modifies the probability scores of tokens during the generation process so that these modifications act as a watermark.

> This watermark can then be detected through a specific scoring system that assesses the likelihood that the text was generated by a watermarked large language model (LLM).

In my opinion,

The move to open-source SynthID allows anyone to implement this technology in their own AI models to watermark and later identify AI-generated text.

Moreover, this can be seen as a step towards fostering responsible AI development by allowing widespread implementation of watermarking technology.

2/3
@Yaaaaaashhh
SynthID is really cool!!!!

3/3
@kanpuriyanawab
and necessary

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 24, 2024

1/2
@dreamingtulpa
EfficientViT can speed up high-resolution diffusion models with a compressing ratio of 128 while keeping good image quality!

It achieves a 19.1x speed increase for inference and a 17.9x for training on ImageNet 512x512 compared to other autoencoders.

GitHub - mit-han-lab/efficientvit: Efficient vision foundation models for high-resolution generation and perception.

https://video.twimg.com/ext_tw_video/1848303913369255936/pu/vid/avc1/1536x512/HFyptxie-pWhglzU.mp4

2/2
@Kleos00
What do you mean 128? 128:1? And are you talking about bytes?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/3
@0xorphaned
"In this paper, we propose an FPGA-based accelerator for EfficientViT to advance the hardware efficiency frontier of ViTs. Specifically, we design a reconfigurable architecture to efficiently support various operation types, including lightweight convolutions and attention, boosting hardware utilization."

[Quoted tweet]
cs AR, LG
5 pages
An FPGA-Based Reconfigurable Accelerator for Convolution-Transformer Hybrid EfficientViT
Haikuo Shao, Huihong Shi, Wendong Mao, Zhongfeng Wang arxiv.org/abs/2403.20230

2/3
@0xorphaned
"Additionally, we present a time-multiplexed and pipelined dataflow to facilitate both intra- and inter-layer fusions, reducing off-chip data access costs."

3/3
@0xorphaned
"Experimental results show that our accelerator achieves up to 780.2 GOPS in throughput and 105.1 GOPS/W in energy efficiency at 200MHz on the Xilinx ZCU102 FPGA, which significantly outperforms prior works."

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/3
@FBiras
Just yesterday a new paper approaching the Segment Anything Method came out: "EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss"

Lets 𝐡𝐞𝐚𝐫 what Audiolizer - Convert Research Papers to Audio

thinks about it:

2/3
@FBiras

This paper introduces EfficientViT-SAM, a highly efficient and fast model for image segmentation. It innovates on the Segment Anything Model (SAM) framework by incorporating EfficientViT, a more streamlined image encoder, enhancing the model's speed without sacrificing accuracy. EfficientViT-SAM shows a significant processing rate improvement over SAM-ViTH, achieving a 48.9 times faster rate on the A100 GPU.

SAM models, known for their zero-shot segmentation capabilities, face computational efficiency challenges due to their intensive image encoders. EfficientViT-SAM addresses this by replacing SAM's image encoder with EfficientViT, aiming to retain performance while significantly boosting speed. This development is particularly relevant in scenarios requiring rapid image processing, such as augmented reality and interactive editing.

EfficientViT-SAM's development involves a two-stage training process. Initially, knowledge distillation transfers capabilities from SAM-ViTH to EfficientViT, ensuring robust performance inheritance. Subsequent end-to-end training on the SA-1B dataset fine-tunes performance. EfficientViT-SAM comes in two versions (L and XL), offering varied speed and accuracy trade-offs. It utilizes EfficientViT's multi-scale linear attention module with ReLU linear attention and convolution for efficient image processing. The model's architecture involves a fusion of features through upsampling and addition, indicating sophisticated integration of multi-scale features.

EfficientViT-SAM's evaluation demonstrates superior performance and efficiency compared to previous SAM models and alternatives. It excels in various zero-shot benchmarks, including point-prompted and box-prompted segmentation on COCO and LVIS datasets. The model also shows high performance in diverse real-world segmentation tasks. Its adaptability is further highlighted when paired with different object detection methods like YOLOv8 and GroundingDINO.

EfficientViT-SAM represents a significant advancement in image segmentation, striking an impressive balance between efficiency and performance. Its ability to perform high-speed processing without compromising accuracy makes it a notable contribution to the field. By open-sourcing the model, the authors promote further research and potential enhancements, expanding the possibilities for practical applications of image segmentation technology.

3/3
@FBiras
As usual, the full version is available on Audiolizer - Convert Research Papers to Audio, where you can 𝐥𝐢𝐬𝐭𝐞𝐧 𝐭𝐨 𝐢𝐭.

If there are any papers or domains that 𝐲𝐨𝐮 𝐰𝐨𝐮𝐥𝐝 𝐥𝐢𝐤𝐞 𝐭𝐨 𝐡𝐞𝐚𝐫 𝐚𝐛𝐨𝐮𝐭, drop a comment bellow!

/search?q=#buildinpublic /search?q=#indiehackers /search?q=#Researchpaper /search?q=#AI

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/3
@_akhaliq
EfficientViT-SAM

Accelerated Segment Anything Model Without Performance Loss

paper page: Paper page - EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

present EfficientViT-SAM, a new family of accelerated segment anything models. We retain SAM's lightweight prompt encoder and mask decoder while replacing the heavy image encoder with EfficientViT. For the training, we begin with the knowledge distillation from the SAM-ViT-H image encoder to EfficientViT. Subsequently, we conduct end-to-end training on the SA-1B dataset. Benefiting from EfficientViT's efficiency and capacity, EfficientViT-SAM delivers 48.9x measured TensorRT speedup on A100 GPU over SAM-ViT-H without sacrificing performance.

2/3
@paul_cal
@yacineMTB fyi

3/3
@talrid23
Nice to see that they compare throughput and not "flops", which is a pretty useless metrics

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 24, 2024

1/3
Llama-3.1-Nemotron-70b by @nvidia is now on the Arena leaderboard with overall rank #9 and #26 with Style Control!

Impressive to see a 70B open model competitive in human preference, as well as interesting ranking shifts under style control.

Comment below to share your thoughts!

[Quoted tweet]
Llama-3.1-Nemotron-70B-Instruct model aligned by our team is now live on lmarena.ai leaderboard with overall rank 9.

Everything used to create this model is public: code, data and reward model. HF checkpoint: huggingface.co/nvidia/Llama-…

2/3
Full result at http://lmarena.ai/?leaderboard!

3/3
Paper [2410.01257] HelpSteer2-Preference: Complementing Ratings with Preferences
Model weight nvidia/Llama-3.1-Nemotron-70B-Instruct-HF · Hugging Face

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 24, 2024

1/3
OmniParser

Microsoft has casually dropped this gem to enable GPT4V to navigate your computer! Looks like, 'Computer use' is the next battleground.

2/3
OmniParser

> Screen Parsing tool for Pure Vision Based GUI Agent
> A method for parsing user interface screenshots into structured and easy-to-understand elements.
> This significantly enhances the ability of GPT-4V to generate actions

> Makes it possible for powerful LLMS to accurately ground the corresponding regions of interest in an interface.

Model on @Huggingface with MIT license: microsoft/OmniParser · Hugging Face

3/3
OmniParser: Screen Parsing tool for Pure Vision Based GUI Agent

Understanding the user interfaces like never before!

Launch the local gradio app by following the instructions here: GitHub - microsoft/OmniParser

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

The A.I Megathread (LLM , GPT , Development)

More options

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

When you give a Claude a mouse

When you give a Claude a mouse

Some quick impressions of an actual agent

Would you like to play a game?

What does this mean?

bnew

Veteran

Inside the Mind of an AI Girlfriend (or Boyfriend)

Inside the Mind of an AI Girlfriend (or Boyfriend)

bnew

Veteran

bnew

Veteran

bnew

Veteran

Mark Zuckerberg (@zuck) on Threads

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

The A.I Megathread (LLM , GPT , Development)

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

When you give a Claude a mouse​

Some quick impressions of an actual agent​

Would you like to play a game?​

What does this mean?​

Veteran

Inside the Mind of an AI Girlfriend (or Boyfriend)​

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

When you give a Claude a mouse

Some quick impressions of an actual agent

Would you like to play a game?

What does this mean?

Inside the Mind of an AI Girlfriend (or Boyfriend)