I didnt say they trained it on strictly their own interview question data
I said that AI being able to regurgitate (via a LLM) problem solving of problems that are typically presented on an interview question doesnt create an AI worker capable to create, maintain, update, support and iterate on a software product of even small-size codebase.
right now o1 is a template-creator at best, trained on public data, including repositories. it is not trained on proprietary libraries and it doesnt have reasoning capability (nor does any generative LLM) and the more obscure/undocumented a library, API or code is - the worse the AI learning be. After all, an LLM is ALWAYS only as good as its training data. It cannot reason nor can it deduce nor does it have a state-based knowledge. the whole point of transformers is for them to be stateless.
it can help to create some skeleton based on copycat approach right now for folks who dont know or know basics of programming, but thats about it. it may help you learn a language faster or create very skeleton-y code.
I think every time we talk about LLMs taking real programming jobs, we need to emphasize that LLMs are stateless by their nature. Transformers are stateless. The only way state is passed is via context window. This is a big, gargantuan problem with maintaining and adding and having siloed knowledge that is up-to-date, correct and compartmentalized . Right now there are apis to feedback the prompt history each time, but this isnt a real solution with real programming tasks.
for example, I have had a real-world task in maintaining a codebase across several different legacy repositories such as clearcase, and making fixes to proprietary C libraries that only exists in my company but is used by dozens of different products/components. I also need to know how the library is used by these components, and the ability to intrerpret how this library may be used, how often and what issues can arise depending on what component uses it.
only in fantasy world does someone sit down on a programming job and solve interview tasks. real world doesnt work like that. Not to mention, LLMs rely on human language for prompt. In programming, so many things cannot be easily defined or specified in rigidity of language. many things require visual feedback (web development for example or texture creation and rig wiring in graphics) or can be described succinctly in diagrams but not language.
I dont see real programming jobs with people who have a lot experience being taken AI anytime soon. Maybe down the road when someone attaches state, logic to actual reasoning and not merely inference.
September 12, 2024
Introducing OpenAI o1-preview
A new series of reasoning models for solving hard problems. Available now.Update on September 17, 2024: Rate limits are now 50 queries per week for o1-preview and 50 queries per day for o1-mini.
We've developed a new series of AI models designed to spend more time thinking before they respond. They can reason through complex tasks and solve harder problems than previous models in science, coding, and math.
Today, we are releasing the first of this series in ChatGPT and our API. This is a preview and we expect regular updates and improvements. Alongside this release, we’re also including evaluations for the next update, currently in development.
How it works
We trained these models to spend more time thinking through problems before they respond, much like a person would. Through training, they learn to refine their thinking process, try different strategies, and recognize their mistakes.In our tests, the next model update performs similarly to PhD students on challenging benchmark tasks in physics, chemistry, and biology. We also found that it excels in math and coding. In a qualifying exam for the International Mathematics Olympiad (IMO), GPT-4o correctly solved only 13% of problems, while the reasoning model scored 83%. Their coding abilities were evaluated in contests and reached the 89th percentile in Codeforces competitions. You can read more about this in our technical research post.
As an early model, it doesn't yet have many of the features that make ChatGPT useful, like browsing the web for information and uploading files and images. For many common cases GPT-4o will be more capable in the near term.
But for complex reasoning tasks this is a significant advancement and represents a new level of AI capability. Given this, we are resetting the counter back to 1 and naming this series OpenAI o1.
i'm not a coder but there have been a few times where i ran into issues of various LLM's not knowing some api calls even when they're well documented, i usually just paste as much as i'm allowed to see if it can make sense of it and sometimes that works for me. other times i just modify the prompt to instruct it on crucial info to jump start a more accurate response.
these AI companies including openAI offer fine-tuning services for their models so companies can train their proprietary libraries and other data when RAG isn't good enough to take on the large amount of data they have.
the real solution to stateless if not fine tuning the model is increasing the context window which has been happening albeit not as fast as people would like.
to your point of visual feedback to be able to properly address certain tasks, text-to-text large language models hav evolved to multi-modality large language models that support text, images, speech and video. I know exactly what you mean though because I've gone don the rabbit hole of trying to describe visual modifications I wanted in desktop apps and webpages and it can be frustrating with natural language alone but not entirely impossible.
I think chatgpt Memory feature is how they're currently augmenting state based knowledge into their model. I actually like how transformers are stateless because the chat can go into a direction i don't want it to or be contaminated with bad responses I can't remove that taints further responses.
i don't think a larger context window is better than a state based model especially when dealing with information that changes constantly.
I agree with you that logic/reasoning will need to greatly improved for these models to make real headway into software engineering and other logic heavy domains.
what do you think of the progress being made on agents & models for solving github issues? SWE-bench