In this paper, we explore the capabilities of state-of-the-art large language models (LLMs) such as GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra in solving undergraduate-level control problems. Controls provides an interesting case study for LLM reasoning due to its combination of mathematical theory and engineering design. We introduce ControlBench, a benchmark dataset tailored to reflect the breadth, depth, and complexity of classical control design. We use this dataset to study and evaluate the problem-solving abilities of these LLMs in the context of control engineering. We present evaluations conducted by a panel of human experts, providing insights into the accuracy, reasoning, and explanatory prowess of LLMs in control engineering. Our analysis reveals the strengths and limitations of each LLM in the context of classical control, and our results imply that Claude 3 Opus has become the state-of-the-art LLM for solving undergraduate control problems. Our study serves as an initial step towards the broader goal of employing artificial general intelligence in control engineering.
Subjects: | Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) |
Cite as: | arXiv:2404.03647 [math.OC] |
(or arXiv:2404.03647v1 [math.OC] for this version) | |
[2404.03647] Capabilities of Large Language Models in Control Engineering: A Benchmark Study on GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra Focus to learn more |
1/1
There are people paid $140k as junior analysts to literally do this all day
1/2
Uh, so Claude AI is pretty good.
Step 1. Upload PDF of finance paper to Claude AI (https://sciencedirect.com/science/a...EP_U9V5G0w7pAzgpd1Ib0wIpRd7hFkLaTU7IE-rxJ9swA)
Step 2: Ask Claude to make a bar graph from Table 7 for one column and first five rows
Step 3: Profit
2/2
Yeah, I've found Claude much better so far, although the code UI is worse on Claude
1/4
Our new GPT-4 Turbo is now available to paid ChatGPT users. We’ve improved capabilities in writing, math, logical reasoning, and coding.
Source: GitHub - openai/simple-evals
2/4
For example, when writing with ChatGPT, responses will be more direct, less verbose, and use more conversational language.
3/4
We continue to invest in making our models better and look forward to seeing what you do. If you haven’t tried it yet, GPT-4 Turbo is available in ChatGPT Plus, Team, Enterprise, and the API.
4/4
UPDATE: the MMLU points weren’t clear on the previous graph. Here’s an updated one.
To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
I wonder if 4.0 will ever be for free users.