Is it a wrap for Software Engineers? Devin autonomous AI software engineer...

IIVI · Jun 23, 2024

bnew said:

I think an unintended side effect of this tech is that people will actually have less expertise.

Like the first tweet says, you need to be a Senior to understand and write the best code with it possible. That includes fixing code it's outputting that you know is wrong.

However, Seniors really found out those shortcomings because they've gone through what every single profession must go through: the learning cycle.

You write code, shyt doesn't work, you got to fix it and understand what went wrong. With that piece of knowledge and lessons, you can now find the error in any piece of code about that topic whether written by a person or A.I. People, especially Juniors who are going to use A.I for like 99% of everything won't have the experience of fixing and making much of anything from pure scratch so they won't understand if what the A.I is doing is truly correct or not. They'll more or less accept it without verifying if it's optimal or even correct.

It'll definitely have impact on writing code no dount, but it'll make worse engineers, codebases and some very costly bugs imo without people who know what they're doing. Unfortunately, I think the longer people use it without keeping their own skills sharp, the lower their ceiling gets.

It really is like the calculator: if someone studies up and knows their material, they can do all kinds of crazy shyt with it. The normal person with weaker math skills can still add do some operations, but nothing advanced.

Strapped · Jun 23, 2024

It will get real interesting when we get full fledge universal soldiers powered by AI

IIVI · Jun 23, 2024

Strapped said:
It will get real interesting when we get full fledge universal soldiers powered by AI

Man, we probably are closer to Universal Soldiers rather than Universal Basic Income :mjlol:

bnew · Jul 25, 2024

Forget coding bootcamps: Airtable’s AI can build your app in seconds

Airtable's Cobuilder transform app development with AI, allowing non-coders to create custom applications in seconds using natural language prompts.

venturebeat.com

Forget coding bootcamps: Airtable’s AI can build your app in seconds

Michael Nuñez@MichaelFNunez

July 25, 2024 11:00 AM

Credit: VentureBeat made with Midjourney

Airtable, the $11 billion no-code platform unicorn, has unveiled Cobuilder, an AI-powered tool that generates customizable applications in seconds using natural language prompts. This launch could reshape the landscape of enterprise software development by allowing non-technical employees to build complex applications without coding knowledge.

Kelly O’Shaughnessy, head of core product and product lead for Airtable Cobuilder, explained the tool’s significance in an interview with VentureBeat. “Cobuilder is the fastest way to build no-code applications, making it possible to create customizable apps with natural language in just seconds,” she said. “Paired with the vast amounts of knowledge across industries, use cases, and business concepts that today’s LLMs have, Cobuilder helps anyone take an idea from concept to reality in seconds and transform their workflows.”

A screenshot of Airtable’s new Cobuilder tool shows how it generates a customized app for managing the release of a women’s skateboarding shoe. The interface displays a kanban board with various stages of the product launch, demonstrating the AI’s ability to create complex project management solutions from simple natural language inputs. (Image Credit: Airtable)

AI-powered app generation: How Cobuilder transforms ideas into software

The technology behind Cobuilder uses large language models (LLMs) to interpret user prompts and generate appropriate application structures. O’Shaughnessy elaborated on the process, telling VentureBeat, “Cobuilder generates an application by analyzing a user’s natural language prompt and matching the request with relevant publicly available data the LLM provider has access to.”

This approach could significantly reduce the time and resources required for application development, a process that traditionally involves multiple stakeholders and can take months or even years. O’Shaughnessy highlighted this advantage, citing Airtable CEO Howie Liu’s LinkedIn post on the new product launch, where he said, “Traditional software development is expensive and slow, but the bigger problem is the disconnect that happens when there are layers of separation between the technical software developers building an app and the business stakeholders who understand the actual requirements for the app.”

The adoption of AI-generated applications in enterprise environments raises questions about data privacy and security. Addressing these concerns, O’Shaughnessy told VentureBeat, “Airtable protects the privacy and security of customers’ data. No customer data is used to train current or future LLMs.”

Balancing innovation and limitations: The current state of AI-generated apps

While Cobuilder represents a leap forward in no-code development, it currently relies on publicly available data and user-provided information to create applications. Airtable plans to enhance Cobuilder’s capabilities, including the ability to incorporate existing company data from Airtable and embed AI automations within generated apps.

The launch of Cobuilder is part of Airtable’s broader strategy to integrate AI across its platform. Earlier this year, the company introduced Airtable AI, which has already seen adoption by major clients like AWS. Future plans include expanding document extraction capabilities and enabling AI-powered internet search integration.

For Airtable, this move represents a significant bet on the future of enterprise software development. As businesses increasingly seek ways to empower non-technical staff and reduce reliance on traditional development processes, tools like Cobuilder could become increasingly attractive.

The future of enterprise software: Airtable’s vision for AI-driven development

O’Shaughnessy envisions a transformative impact. “This combo of no-code and AI unlocks the ability for non-experts and non-developers to describe the workflow they need in plain language—as though having a conversation with a developer—and then Cobuilder helps create an app with the best design and operational structures in seconds.”

As Airtable continues to push the boundaries of no-code development with AI integration, it positions itself at the forefront of a potential new era in enterprise software creation. The success of Cobuilder could not only solidify Airtable’s position in the market but also potentially reshape how businesses approach software development in the coming years.

KBtheKey · Jul 25, 2024

Prolly going to be some serious pay walls coming for everyone who doesn't know how (or from who) to get this stuff for the free

bnew · Aug 15, 2024

Move over, Devin: Cosine’s Genie takes the AI coding crown

Cosine has spent nearly a year curating a dataset and training an experimental OpenAI model with long context.

venturebeat.com

Move over, Devin: Cosine’s Genie takes the AI coding crown

Carl Franzen@carlfranzen

August 12, 2024 9:53 AM

Oil lamp abstract artwork emitting code cloud

Credit: VentureBeat made with Midjourney V6

It wasn’t long ago that the startup Cognition was blowing minds with its product Devin, an AI-based software engineer powered by OpenAI’s GPT-4 foundation large language model (LLM) on the backend that could autonomously write and edit code when given instructions in natural language text.

But Devin emerged in March 2024 — five months ago — an eternity in the fast-moving generative AI space.

Now, another “C”-named startup, Cosine, which was founded through the esteemed Y Combinator startup accelerator in San Francisco, has announced its own new autonomous AI-powered engineer Genie, which it says handily outperforms Devin, scoring 30% on third-party benchmark test SWE-Bench compared to Devin’s 13.8%, and even surpassing the 19% scored by Amazon’s Q and Factory’s Code Droid.

Screenshot-2024-08-12-at-12.12.46%E2%80%AFPM.png

Screenshot from Cosine’s website showing Genie’s performance on SWE-Bench compared to other AI coding engineer models. Credit: Cosine
“This model is so much more than a benchmark score: it was trained from the start to think and behave like a human SWE [software engineer],” wrote Cosine’s co-founder and CEO Alistair Pullen in a post on his account on the social network X.

I'm excited to share that we've built the world's most capable AI software engineer, achieving 30.08% on SWE-Bench – ahead of Amazon and Cognition. This model is so much more than a benchmark score: it was trained from the start to think and behave like a human SWE. pic.twitter.com/OyvqKLxcGV

— Alistair (@AlistairPullen) August 12, 2024

I'm excited to share that we've built the world's most capable AI software engineer, achieving 30.08% on SWE-Bench – ahead of Amazon and Cognition. This model is so much more than a benchmark score: it was trained from the start to think and behave like a human SWE.

What is Genie and what can it do?

Genie is an advanced AI software engineering model designed to autonomously tackle a wide range of coding tasks, from bug fixing to feature building, code refactoring and validation through comprehensive testing, as instructed by human engineers or managers.

It operates either fully autonomously or in collaboration with users and aims to provide the experience of working alongside a skilled colleague.

“We’ve been chasing the dream of building something that can genuinely automatically perform end-to-end programming tasks with no intervention and a high degree of reliability – an artificial colleague. Genie is the first step in doing exactly that,” wrote Pullen in the Cosine blog post announcing Genie’s performance and limited, invitation-only availability.

The AI can write software in a multitude of languages — there are 15 listed in its technical report as being sources of data, including:

JavaScript
Python
TypeScript
TSX
Java
C#
C++
C
Rust
Scala
Kotlin
Swift
Golang
PHP
Ruby

Cosine claims Genie can emulate the cognitive processes of human engineers.

“My thesis on this is simple: make it watch how a human engineer does their job, and mimic that process,” Pullen explained in the blog post.

The code Genie generates is stored in a user’s GitHub repo, meaning Cosine does not retain a copy, nor any of the attendant security risks.

Furthermore, Cosine’s software platform is already integrated with Slack and system notifications, which it can use to alert users of its state, ask questions, or flag issues as a good human colleague would.

”Genie also can ask users clarifying questions as well as respond to reviews/comments on the PRs [pull requests] it generates,” Pullen wrote to VentureBeat. “We’re trying to get Genie to behave like a colleague, so getting the model to use the channels a colleague would makes the most sense.”

Powered by a long context OpenAI model

Unlike many AI models that rely on foundational models supplemented with a few tools, Genie was developed through a proprietary process that involves training and fine-tuning a long token output AI model from OpenAI .

“In terms of the model we’re using, it’s a (currently) non-general availability GPT-4o variant that OpenAI have allowed us to train as part of the experimental access program,” Pullen wrote to VentureBeat via email. “The model has performed well and we’ve shared our learnings with the OpenAI finetuning team and engineering leadership as a result. This was a real turning point for us as it convinced them to invest resources and attention in our novel techniques.”

While Cosine doesn’t specify the particular model, OpenAI just recently announced the limited availability of a new GPT-4o Long Output Context model which can spit out up to 64,000 tokens of output instead of GPT-4o’s initial 4,000 — a 16-fold increase.

The training data was key

“For its most recent training run Genie was trained on billions of tokens of data, the mix of which was chosen to make the model as competent as possible on the languages our users care about the most at the current time,” wrote Pullen in Cosine’s technical report on the agent.

With its extensive context window and a continuous loop of improvement, Genie iterates and refines its solutions until they meet the desired outcome.

Cosine says in its blog post that it spent nearly a year curating a dataset with a wide range of software development activities from real engineers.

“In practice, however, getting such and then effectively utilising that data is extremely difficult, because essentially it doesn’t exist,” Pullen elaborated in his blog post, adding. “Our data pipeline uses a combination of artefacts, static analysis, self-play, step-by-step verification, and fine-tuned AI models trained on a large amount of labelled data to forensically derive the detailed process that must have happened to have arrived at the final output. The impact of the data labelling can’t be understated, getting hold of very high-quality data from competent software engineers is difficult, but the results were worth it as it gave so much insight as to how developers implicitly think about approaching problems.”

In an email to VentureBeat, Pullen clarified that: “We started with artefacts of SWEs doing their jobs like PRs, commits, issues from OSS repos (MIT licensed) and then ran that data through our pipeline to forensically derive the reasoning, to reconstruct how the humans came to the conclusions they did. This proprietary dataset is what we trained the v1 on, and then we used self-play and self-improvement to get us the rest of the way.”

This dataset not only represents perfect information lineage and incremental knowledge discovery but also captures the step-by-step decision-making process of human engineers.

“By actually training our models with this dataset rather than simply prompting base models which is what everyone else is doing, we have seen that we’re no longer just generating random code until some works, it’s tackling problems like a human,” Pullen asserted.

Pricing

In a follow-up email, Pullen described how Genie’s pricing structure will work.

He said it will initially be broken into two tiers:

“1. An accessible option priced competitively with existing AI tools, around the $20 mark. This tier will have some feature and usage limitations but will showcase Genie’s capabilities for individuals and small teams.

2. An enterprise-level offering with expanded features, virtually unlimited usage and the ability to create a perfect AI colleague who’s an expert in every line code ever written internally. This tier will be priced more substantially, reflecting its value as a full AI engineering colleague.”

Implications and Future Developments

Genie’s launch has far-reaching implications for software development teams, particularly those looking to enhance productivity and reduce the time spent on routine tasks. With its ability to autonomously handle complex programming challenges, Genie could potentially transform the way engineering resources are allocated, allowing teams to focus on more strategic initiatives.

“The idea of engineering resource no longer being a constraint is a huge driver for me, particularly since starting a company,” wrote Pullen. “The value of an AI colleague that can jump into an unknown codebase and solve unseen problems in timeframes orders of magnitude quicker than a human is self-evident and has huge implications for the world.”

Cosine has ambitious plans for Genie’s future development. The company intends to expand its model portfolio to include smaller models for simpler tasks and larger models capable of handling more complex challenges. Additionally, Cosine plans to extend its work into open-source communities by context-extending one of the leading open-source models and pre-training on a vast dataset.

Availability and Next Steps

While Genie is already being rolled out to select users, broader access is still being managed.

Interested parties can apply for early access to try Genie on their projects by filling out a web form on the Cosine website.

Cosine remains committed to continuous improvement, with plans to ship regular updates to Genie’s capabilities based on customer feedback.

“SWE-Bench recently changed their submission requirements to include the full working process of AI models, which poses a challenge for us as it would require revealing proprietary methodologies,” noted Pullen. “For now, we’ve decided to keep these internal processes confidential, but we’ve made Genie’s final outputs publicly available for independent verification on GitHub.”

More on Cosine

Cosine is a human reasoning lab focused on researching and codifying how humans perform tasks, intending to teach AI to mimic, excel at, and expand on these tasks.

Founded in 2022 by Pullen, Sam Stenner, and Yang Li, the company’s mission is to push the boundaries of AI by applying human reasoning to solve complex problems, starting with software engineering.

Cosine has already raised $2.5 million in seed funding from Uphonest and SOMA Capital, with participation from Lakestar, Focal and others.

With a small but highly skilled team, Cosine has already made significant strides in the AI field, and Genie is just the beginning.

“We truly believe that we’re able to codify human reasoning for any job and industry,” Pullen stated in the announcement blog post. “Software engineering is just the most intuitive starting point, and we can’t wait to show you everything else we’re working on.”

bnew · Aug 28, 2024

Amazon's AI assistant saves 4,500 years of development time, CEO Andy Jassy says

Amazon CEO Andy Jassy says the company's AI assistant Amazon Q has dramatically cut the time needed for Java updates. Amazon estimates this saved thousands of developer years and hundreds of millions of dollars.

the-decoder.com

AI in practice

Aug 23, 2024

Amazon's AI assistant saves 4,500 years of development time, CEO Andy Jassy says

Amazon

Amazon CEO Andy Jassy says the company's AI assistant Amazon Q has dramatically cut the time needed for Java updates. Amazon estimates this saved thousands of developer years and hundreds of millions of dollars.

In a LinkedIn post, Jassy highlighted efficiency gains from using Amazon Q for software development. The AI tool significantly reduced time and costs for updating Java applications to newer versions.

Jassy called updating core software one of the most tedious but crucial tasks for development teams. Since it doesn't add new features or visibly improve user experience, teams often avoid or delay these updates in favor of new projects.

Using a new code transformation feature, Amazon Q cut the average time to upgrade an app to Java 17 from about 50 developer days to just a few hours, Jassy said. The company estimates this saved about 4,500 developer years.

Amazon Q Code Transformation analyzes existing code, suggests changes, and implements them. It updates package dependencies, revises outdated and inefficient code, and integrates security practices.

Nearly 80 percent of AI-generated code used without changes

In six months, Amazon updated over half its Java production systems to newer versions much faster and with less effort. Developers used 79 percent of Amazon Q's auto-generated code reviews without changes.

Jassy said the benefits go beyond saved development time: The updates improved security and lowered infrastructure costs, leading to estimated yearly efficiency gains of $260 million.

He sees this as proof that large companies can achieve major efficiency gains in maintaining core software by using AI. For Amazon, it was a "game changer."

Of course, keep in mind that Jassy is trying to sell his company's AI software here. But even if Amazon's estimates are way off and the actual savings are a fraction of what Jassy claims, the numbers would still be significant.

Then again, others have criticized AI code for being crap and creating more work than it solves. We won't know until some neutral long-term studies of AI's impact on coding come out.

Summary

Amazon CEO Andy Jassy says the company's AI assistant Amazon Q has dramatically reduced the time needed to update Java applications to newer versions, saving an estimated 4,500 developer years and hundreds of millions of dollars.
Using Amazon Q's new code transformation feature, the average time to upgrade an app to Java 17 was cut from about 50 developer days to just a few hours. In six months, Amazon updated over half its Java production systems, with developers using 79 percent of the AI-generated code reviews without changes.
Jassy sees the benefits going beyond saved development time, with the updates improving security and lowering infrastructure costs, leading to estimated yearly efficiency gains of $260 million. He believes this proves that large companies can achieve major efficiency gains in maintaining core software by using AI.

Sources
LinkedIn

bnew · Sep 14, 2024

1/40
@amasad
AI is incredible at writing code.

But that's not enough to create software. You need to set up a dev environment, install packages, configure DB, and, if lucky, deploy.

It's time to automate all this.

Announcing Replit Agent in early access—available today for subscribers:

2/40
@amasad
Just go to Replit logged in homepage. Write what you want to make and click "start building"!

3/40
@amasad
Will thread in some runs in here. 4-min run for a startup landing page with waiting list complete with postgres persistence

4/40
@amasad

5/40
@amasad

6/40
@amasad

7/40
@amasad
idea to software all on your phone

8/40
@amasad

9/40
@amasad
Healthcare app!

10/40
@amasad
Ad spend calc from someone who can't code

11/40
@amasad
Stripe!

12/40
@amasad

13/40
@amasad
Replit clone w/ Agent!

14/40
@amasad
Sentiment analysis in 23 minutes!

15/40
@amasad
Website with CMS in 10 minutes!

16/40
@amasad
Love the work tools people are building

17/40
@amasad
Mobile game made on mobile

18/40
@amasad
Passed the frontend hiring bar

19/40
@amasad
Social media ad generator

20/40
@amasad
3D platformer game

21/40
@elonmusk
It can’t write good video games (yet)

22/40
@amasad
Grok-3 will be Donkey-Kong-complete

23/40
@msg
its cute that the replit agent refers to the project as OUR project and when it updates the replit functionality it states that it communicated with the TEAM

24/40
@amasad
It’s actually implemented as a team! Hahaha

25/40
@007musk
When can existing users get this feature?

26/40
@amasad
Paid users already have it. It's on your loggedin homepage

27/40
@alexchristou_
Looks sweet

Would be cool to try out without having to upgrade out of the gate.

Have been a paid member before

28/40
@amasad
Will do that after beta

29/40
@0xastro98
Hey Amjad, is this available for Core members?

30/40
@amasad
yes, just go to your homepage

31/40
@itsPaulAi
Ok that's just insane. Congrats on the launch!

32/40
@amasad
Thank you! Aider in Replit is still super useful as they serve slightly different use cases. I used it yesterday. Thanks for the demos.

33/40
@hwchase17

34/40
@amasad
The team spent more time in langsmith than their significant others the past few weeks :D

35/40
@arben777
To this day I have not used Replit at all. I will be booting it up and seeing how this agent performs advancing a project with ~8,000 lines.

I have found many of these tools are solid for quick boot ups or simple "shallow" demos but many fall short in bigger codebases.

36/40
@DarbyBaileyXO
are you kidding!? less than 10 minutes from idea and now it's building in the background

prompt:

i want to build an app that shows all the hot springs on a map for idaho and oregon, where people can plan a road trip and also see what KOA's or AIRBnB's are nearby so they can plan an itinerary and see driving times and optimal stops along with gas stations for those stops and the scenic views they will see at those stops, on their way to the selected hot spring https://nitter.poast.org/messages/media/1831781177122025961

37/40
@seflless
The mobile support out of the gate! This is

. The mobile experience will be so enhanced by this type of thing.

38/40
@mckaywrigley
It’s time.

What a release.

Replit is AGI infrastructure.

39/40
@0interestrates
congrats amjad! this is fire

40/40
@BasedBeffJezos
This is sickk

Congrats @amasad

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 23, 2024

1/31
@deedydas
The new Claude 3.5 Sonnet scores 49% on the hardest coding benchmark based of real GitHub issues, SWE-Bench verified.

Cosine Genie was 43.8%
o1-preview was 41.4%
o1-mini was 35.8%

Claude is the undisputed king of all models at writing code.

2/31
@Tom62589172
Where's Devin?

3/31
@deedydas
14% on SWE-Bench

4/31
@AntDX316

5/31
@manuhortet
this benchmark doesn't say much anymore

we are seeing all real life ai-for-code applications focusing on enforcing rules, strats on prompt and context building

o1 may still work better in stricter systems focused on providing more info

3.5-new won my vibe benchmark tho

6/31
@adridder
Code mastery. Impressive feat. Curious minds ponder next frontiers though.

7/31
@theDataDork
I’ve been using Claude since it released new model.. it clearly has improved a lot when it comes to coding

8/31
@eliluong
is there a difference between how well the free vs paid Claude performs on a reasonable length input?

9/31
@Evinst3in
coding is becoming easy

10/31
@dmsimon
The number of syntax errors Claude makes while generating basic HTML/CSS is ridiculous.

23 regens by Claude.

Put the same request into Llama 3.2 3b and nailed it in one try running on one 3090 Ti.

11/31
@supremebeme
yeah im ngl it's either on par or better than o1 preview, and no limits i don't think

12/31
@ntkris
Another interesting datapoint to back this up from hackernews:

13/31
@SaasJunctionHQ
Sonnet: The King of Code

GPT o1: The Versatile Contender

Mistral Codestral: The Clean Code Specialist

DeepSeek-Coder-V2: The Emerging Talent

14/31
@skorfmann
I personally perceived it as worse than the last sonnet 3.5 version using Cursor. Broken and incoherent text / code snippets all over the place. I’ve been told that there’s a similar experience in Artifacts

15/31
@JTL87i
first impression feels worse than before

o1-mini still killing it

16/31
@TiggerSharkML
impressive

17/31
@s_noronha
Agree. Claude is current king, then Llama, Gemini. GPT is not great

18/31
@NarenNallapa
That is actually very impressive!

19/31
@ReadFuturist
I'm happy to give Claude my screen usage of how I play @KenshiOfficial - it's a brutal game to pretend you're a person.

20/31
@buzzedison
Facts only

21/31
@victor_explore
looks like claude is quietly coding its way to the top

not just winning, but solving real-world problems

22/31
@dikksonPau
Is this enough to push OpenAI to launch 4.5???

23/31
@morew4rd
people still trust public evals?

24/31
@hadikhantech
o1-mini for analysis and design.

Sonnet 3.5 for implementation.

25/31
@YorkTheWest
Source?

26/31
@firasd
I notice that Claude is also good at layouts etc

Like it made this weather view just based on some json

[Quoted tweet]
Pasted some weather json and it made this ..

I said: “Make an html page with query that shows this data for viewing, exploring and editing. Also first make a div that contains your understanding of what the app displays and what each field you're going to show represents”

27/31
@fofices_
I’ve something great about your project,let’s discuss in DM

28/31
@arpit_sidana
Do humans have a comparative score?

29/31
@alvarocha2
It's really good. Besides the benchmarks, we see our users moving more and more to Claude for day-to-day coding.

30/31
@Crypto_Briefing

[Quoted tweet]
Anthropic launches computer interface control in new Claude 3.5 models

https://video.twimg.com/ext_tw_video/1848785280863555584/pu/vid/avc1/1280x720/Ir3FvrNf7N01w9Xu.mp4

31/31
@SemperLuxFortis
It needs longer output length in the app to be of any use though. It's constant striving for too much brevity is a major issue for me and a pain in the ass.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/11
@paulgauthier
The new Sonnet tops aider's code editing leaderboard at 84.2%. Using --architect mode it sets SOTA at 85.7% with DeepSeek as the editor model.

To give it a try:

pip install -U aider-chat
aider --sonnet

Aider LLM Leaderboards

2/11
@paulgauthier
Here's a breakdown of most of the top model scores on aider's code editing benchmark:

84% Claude 3.5 Sonnet 10/22
80% o1-preview
77% Claude 3.5 Sonnet 06/20
72% DeepSeek V2.5
72% GPT-4o 08/06
71% o1-mini
68% Claude 3 Opus

3/11
@paulgauthier
The new Sonnet also sets SOTA on aider's more demanding refactoring benchmark with a score of 92.1%!

92% Sonnet 10/22
75% o1-preview
72% Opus
64% Sonnet 06/20
49% GPT-4o 08/06
45% o1-mini

Aider LLM Leaderboards

4/11
@SystemSculpt
This is the only benchmark that matters.

5/11
@eztati
Is not o1 mini better at coding than o1 preview?

6/11
@clumma
How is leakage of the benchmark into training data prevented?

7/11
@spyced
I thought the point of architect/editor split was for when your architect model is (1) bad at formatting diffs or (2) too slow to generate them. Neither of which applies to Sonnet?

8/11
@calebfahlgren
massive

9/11
@leonard_cremer
Was waiting for this when the model was released. Great improvement

Aider is great!

10/11
@zlumer_eth
Getting too close to 100%, the benchmark will need extra coverage soon it seems

11/11
@anushkmittal
cool. how's the real world performance tho

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 23, 2024

1/9
@deedydas
Got Claude Computer Use to "build a Timer app" from scratch!

—downloaded React + deps
—wrote the code
—opened local server (in a Docker VM)
—fixed its own styling
—TESTED it fully
—restyled it like an iPhone
—fixed a syntax error

Software engineering will not be the same.

1/4

https://video.twimg.com/amplify_video/1849118641935253504/vid/avc1/1280x720/XX-p0M2VzMHfLHR1.mp4

2/9
@deedydas
Observations:
— The new Claude Sonnet has been RLHF'd to not do a bunch of things like fill forms, solve captchas or login to pages.
— It's still slow. This video took ~7mins in realtime, was sped up 16x
— New abilities I hadn't seen before: point + click, bulk copy strings

2/4

3/9
@deedydas
— This demo loads Ubuntu in a Docker image. UI is nowhere like the demo videos, and its a bit clunky. 120s timeout, you can't tell the progress of longer running bash tasks.
— Managed to build backend + frontend apps
— Can learn how to use APIs from the internet

3/4

4/9
@deedydas
So much interesting stuff can be built on top of this for the first time ever.

If you like hacking on this, apply quickly to join us on Nov 2 for the Menlo x Anthropic Builder Day 2024 (sadly Computer Use can't fill this form for you yet)!

Menlo/Anthropic Builder Day 2024

4/4

5/9
@dmsimon
Wow, a timer.

6/9
@pathuglife
Great demo Deedy!

One observation I’ve made while working with browsers is that screenshots are often taken and processed even when pages are still loading, resulting in unnecessary token usage and computation costs. By ensuring that screenshots are only captured after the page has fully loaded, we could significantly reduce compute overhead and improve efficiency.

7/9
@typesteady

[Quoted tweet]
"I built this entirely with AI"

Yes, I can tell.

8/9
@HadijPk
good demo. I feel like Claude is on to something very interesting. I can't wait it to replace old RPA in legacy enterprises

9/9
@FeatureCrewPod
Its better at everyday tasks than coding imp

[Quoted tweet]
New #AI Agent from @AnthropicAI can now...

Delete emails

Manage files

Try to draw
Watch the full video:

https://video.twimg.com/amplify_video/1848941581199400960/vid/avc1/1920x1080/NGOul_PPaOzxRNk9.mp4

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

Serious · Oct 23, 2024

bnew said:
1/31
@deedydas
The new Claude 3.5 Sonnet scores 49% on the hardest coding benchmark based of real GitHub issues, SWE-Bench verified.

Cosine Genie was 43.8%
o1-preview was 41.4%
o1-mini was 35.8%

Claude is the undisputed king of all models at writing code.

2/31
@Tom62589172
Where's Devin?

3/31
@deedydas
14% on SWE-Bench

4/31
@AntDX316

5/31
@manuhortet
this benchmark doesn't say much anymore

we are seeing all real life ai-for-code applications focusing on enforcing rules, strats on prompt and context building

o1 may still work better in stricter systems focused on providing more info

3.5-new won my vibe benchmark tho

6/31
@adridder
Code mastery. Impressive feat. Curious minds ponder next frontiers though.

7/31
@theDataDork
I’ve been using Claude since it released new model.. it clearly has improved a lot when it comes to coding

8/31
@eliluong
is there a difference between how well the free vs paid Claude performs on a reasonable length input?

9/31
@Evinst3in
coding is becoming easy

10/31
@dmsimon
The number of syntax errors Claude makes while generating basic HTML/CSS is ridiculous.

23 regens by Claude.

Put the same request into Llama 3.2 3b and nailed it in one try running on one 3090 Ti.

11/31
@supremebeme
yeah im ngl it's either on par or better than o1 preview, and no limits i don't think

12/31
@ntkris
Another interesting datapoint to back this up from hackernews:

13/31
@SaasJunctionHQ
Sonnet: The King of Code

GPT o1: The Versatile Contender

Mistral Codestral: The Clean Code Specialist

DeepSeek-Coder-V2: The Emerging Talent

14/31
@skorfmann
I personally perceived it as worse than the last sonnet 3.5 version using Cursor. Broken and incoherent text / code snippets all over the place. I’ve been told that there’s a similar experience in Artifacts

15/31
@JTL87i
first impression feels worse than before

o1-mini still killing it

16/31
@TiggerSharkML
impressive

17/31
@s_noronha
Agree. Claude is current king, then Llama, Gemini. GPT is not great

18/31
@NarenNallapa
That is actually very impressive!

19/31
@ReadFuturist
I'm happy to give Claude my screen usage of how I play @KenshiOfficial - it's a brutal game to pretend you're a person.

20/31
@buzzedison
Facts only

21/31
@victor_explore
looks like claude is quietly coding its way to the top not just winning, but solving real-world problems

22/31
@dikksonPau
Is this enough to push OpenAI to launch 4.5???

23/31
@morew4rd
people still trust public evals?

24/31
@hadikhantech
o1-mini for analysis and design.

Sonnet 3.5 for implementation.

25/31
@YorkTheWest
Source?

26/31
@firasd
I notice that Claude is also good at layouts etc

Like it made this weather view just based on some json

[Quoted tweet]
Pasted some weather json and it made this ..

I said: “Make an html page with query that shows this data for viewing, exploring and editing. Also first make a div that contains your understanding of what the app displays and what each field you're going to show represents”

27/31
@fofices_
I’ve something great about your project,let’s discuss in DM

28/31
@arpit_sidana
Do humans have a comparative score?

29/31
@alvarocha2
It's really good. Besides the benchmarks, we see our users moving more and more to Claude for day-to-day coding.

30/31
@Crypto_Briefing

[Quoted tweet]
Anthropic launches computer interface control in new Claude 3.5 models

https://video.twimg.com/ext_tw_video/1848785280863555584/pu/vid/avc1/1280x720/Ir3FvrNf7N01w9Xu.mp4

31/31
@SemperLuxFortis
It needs longer output length in the app to be of any use though. It's constant striving for too much brevity is a major issue for me and a pain in the ass.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/11
@paulgauthier
The new Sonnet tops aider's code editing leaderboard at 84.2%. Using --architect mode it sets SOTA at 85.7% with DeepSeek as the editor model.

To give it a try:

pip install -U aider-chat
aider --sonnet

Aider LLM Leaderboards

2/11
@paulgauthier
Here's a breakdown of most of the top model scores on aider's code editing benchmark:

84% Claude 3.5 Sonnet 10/22
80% o1-preview
77% Claude 3.5 Sonnet 06/20
72% DeepSeek V2.5
72% GPT-4o 08/06
71% o1-mini
68% Claude 3 Opus

3/11
@paulgauthier
The new Sonnet also sets SOTA on aider's more demanding refactoring benchmark with a score of 92.1%!

92% Sonnet 10/22
75% o1-preview
72% Opus
64% Sonnet 06/20
49% GPT-4o 08/06
45% o1-mini

Aider LLM Leaderboards

4/11
@SystemSculpt
This is the only benchmark that matters.

5/11
@eztati
Is not o1 mini better at coding than o1 preview?

6/11
@clumma
How is leakage of the benchmark into training data prevented?

7/11
@spyced
I thought the point of architect/editor split was for when your architect model is (1) bad at formatting diffs or (2) too slow to generate them. Neither of which applies to Sonnet?

8/11
@calebfahlgren
massive

9/11
@leonard_cremer
Was waiting for this when the model was released. Great improvement

Aider is great!

10/11
@zlumer_eth
Getting too close to 100%, the benchmark will need extra coverage soon it seems

11/11
@anushkmittal
cool. how's the real world performance tho

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

Just tried Claude, code looks clean :ehh:

@Matt504

Matt504 · Oct 23, 2024

Serious said:
Just tried Claude, code looks clean

@Matt504

I plan to try it out soon

RageKage · Oct 23, 2024

Not where I work, it's a tool but not a replacement, businesses are wary of sharing info to cloud ai systems

Serious · Oct 23, 2024

Matt504 said:
I plan to try it out soon

I’m tempted to login right now.

I got a couple problems I’m working on

TRUEST · Oct 24, 2024

Strapped said:
It will get real interesting when we get full fledge universal soldiers powered by AI

I’m working on this. A thousand soldiers assembled at whim, with materials found everywhere. Go to Russia. Take out their leader. Erect yourself as the new leader. But you? Who is you? The unseen, the wind, the branch, the tree, the power behind the throne. The one who was before you were.

Is it a wrap for Software Engineers? Devin autonomous AI software engineer...

Superstar

Superstar

Superstar

Veteran

Forget coding bootcamps: Airtable’s AI can build your app in seconds​

AI-powered app generation: How Cobuilder transforms ideas into software​

Balancing innovation and limitations: The current state of AI-generated apps​

The future of enterprise software: Airtable’s vision for AI-driven development​

Top Tier

Veteran

Move over, Devin: Cosine’s Genie takes the AI coding crown​

What is Genie and what can it do?​

Powered by a long context OpenAI model​

The training data was key​

Pricing​

Implications and Future Developments​

Availability and Next Steps​

More on Cosine​

Veteran

Amazon's AI assistant saves 4,500 years of development time, CEO Andy Jassy says​

Nearly 80 percent of AI-generated code used without changes​

Veteran

Veteran

Veteran

Veteran

YSL as a gang must end

All Star

Veteran

Superstar

Similar threads

Forget coding bootcamps: Airtable’s AI can build your app in seconds

AI-powered app generation: How Cobuilder transforms ideas into software

Balancing innovation and limitations: The current state of AI-generated apps

The future of enterprise software: Airtable’s vision for AI-driven development

Move over, Devin: Cosine’s Genie takes the AI coding crown

What is Genie and what can it do?

Powered by a long context OpenAI model

The training data was key

Pricing

Implications and Future Developments

Availability and Next Steps

More on Cosine

Amazon's AI assistant saves 4,500 years of development time, CEO Andy Jassy says

Nearly 80 percent of AI-generated code used without changes