1/40
Automating AI research is exciting! But can LLMs actually produce novel, expert-level research ideas?
After a year-long study, we obtained the first statistically significant conclusion: LLM-generated ideas are more novel than ideas written by expert human researchers.
2/40
In our new paper:
[2409.04109] Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers
We recruited 49 expert NLP researchers to write novel ideas on 7 NLP topics.
We built an LLM agent to generate research ideas on the same 7 topics.
After that, we recruited 79 experts to blindly review all the human and LLM ideas.
2/
3/40
When we say “experts”, we really do mean some of the best people in the field.
Coming from 36 different institutions, our participants are mostly PhDs and postdocs.
As a proxy metric, our idea writers have a median citation count of 125, and our reviewers have 327.
3/
4/40
We specify a very detailed idea template to make sure both human and LLM ideas cover all the necessary details to the extent that a student can easily follow and execute all the steps.
We paid $300 for each idea, plus a $1000 bonus to the top 5 human ideas.
4/
5/40
We also used an LLM to standardize the writing styles of human and LLM ideas to avoid potential confounders, while preserving the original content.
Shown below is a randomly selected LLM-generated idea, as an example of how our ideas look like.
5/
6/40
Our 79 expert reviewers submitted 298 reviews in total, so each idea got 2-4 independent reviews.
Our review form is inspired by ICLR & ACL, with breakdown scores + rationales on novelty, excitement, feasibility, and expected effectiveness, apart from the overall score.
6/
7/40
With these high-quality human ideas and reviews, we compare the results.
We performed 3 different statistical tests accounting for all the possible confounders we could think of.
It holds robustly that LLM ideas are rated as significantly more novel than human expert ideas.
7/
8/40
Apart from the human-expert comparison, I’ll also highlight two interesting analyses of LLMs:
First, we find LLMs lack diversity in idea generation. They quickly start repeating previously generated ideas even though we explicitly told them not to.
8/
9/40
Second, LLMs cannot evaluate ideas reliably yet. When we benchmarked previous automatic LLM reviewers against human expert reviewers using our ideas and reviewer scores, we found that all LLM reviewers showed a low agreement with human judgments.
9/
10/40
We include many more quantitative and qualitative analyses in the paper, including examples of human and LLM ideas with the corresponding expert reviews, a summary of experts’ free-text reviews, and our thoughts on how to make progress in this emerging research direction.
10/
11/40
For the next step, we are recruiting more expert participants for the second phase of our study, where experts will implement AI and human ideas into full projects for a more reliable evaluation based on real research outcomes.
Sign-up link:
Interest Form for Participating in the AI Researcher Human Study (Execution Stage)
11/
12/40
This project wouldn't have been possible without our amazing participants who wrote and reviewed ideas. We can't name them publicly yet as more experiments are ongoing and we need to preserve anonymity. But I want to thank you all for your tremendous support!!
12/
13/40
Also shout out to many friends who offered me helpful advice and mental support, esp. @rose_e_wang @dorazhao9 @aryaman2020 @irena_gao @kenziyuliu @harshytj__ @IsabelOGallegos @ihsgnef @gaotianyu1350 @xinranz3 @xiye_nlp @YangjunR @neilbband @mertyuksekgonul @JihyeonJe
13/
14/40
Finally I want to thank my supportive, energetic, insightful, and fun advisors @tatsu_hashimoto @Diyi_Yang
Thank you for teaching me how to do the most exciting research in the most rigorous way, and letting me spend so much time and $$$ on this crazy project!
14/14
15/40
Great work and cool findings. I have to say there is a huge confounding factor that is intrinsic motivation. Experts ( here PhD students) might not be sharing their best novel ideas because the incentive is monetary instead of publication which brings in long term benefits for students such as job offers/ prestige/ visibility. Your section 6.1 already mentions this.
Another confounding factor is time. Do you think best research / good work happens in 10 days ? I personally have never been able to come up with a good idea in 10 days. But at the same time I don’t know if an LLM can match my ideation ( at-least not yet)
16/40
this is a really cool study!
did the LLM idea's get ran past a Google search? i see the novelty scores for LLM idea's are higher but i've personally noticed asking for novel idea's sometimes results in copy-pasta from the LLM from obscure blog posts/research papers
17/40
18/40
This is cool! Want to come on my podcast and chat about it?
19/40
Great read thank you!
20/40
awesome study! but whats the creativity worth if it’s less feasible?
in my own experience it often suggests ideas that are flawed in some fundamental regards that it misses.
theres only so much facts and constraints the attention mechanism can take into account
21/40
Your thread is very popular today!
/search?q=#TopUnroll Thread by @ChengleiSi on Thread Reader App @vankous for
unroll
22/40
in my paper"Automating psychological hypothesis generation with AI: when large language models meet causal graph"
Automating psychological hypothesis generation with AI: when large language models meet causal graph - Humanities and Social Sciences Communications , we have approved our workflow/algo can even generate novel hypothesis better than GPT4 and Phd .
23/40
What is the best rated LLM generated idea?
24/40
looks like we're out of a job
25/40
Skynet when tho?
26/40
Interesting! I may be doing he exact same experiment, only I let both ChatGPT and Claude participate in a language experiment I choose.
27/40
Interesting work, but would be interesting to know which LLM and prompts were used for this.
28/40
I'd love to see some examples of these novel research ideas. How do they hold up in peer review and actual experimentation?
29/40
All words are arbitrary, yet NLP is a field that treats language and words as statistically significant, excluding the social semiotic and status gain illusions inherent to language. So NLP begins as a notably degenerate field. A novel idea in NLP is technically oxymoronic.
30/40
@threadreaderapp unroll
31/40
Question, if you believe this result, are you going to switch to primarily using LLMs to pick research ideas for yourself?
32/40
Great work! Novelty can be subjective, varying with a topic’s maturity and reviewers’ perspectives. Rather than fully automating research, building practical LLM research assistants could be exciting. Looking forward to the next stage, making LLM research agents more powerful!
33/40
Nice job eager to read. One question, what if you change the topic… biology, math, arts?
34/40
kudos, super refreshing to see people invest in long term and interesting questions! gg
35/40
asking chatgpt to come up with a product, "an angry birds delivery dating app", VCs are jumping through my window, slapping my face with wads of cash
36/40
@Piniisima
37/40
Hahahahahahahahahahah, no.
38/40
Did LLMs write this derivative drivel?
39/40
大佬牛逼
40/40
The infinite unknown is a temptation, and in the face of the limits of our feelings we always have room to manoeuvre. Understanding is only the starting point for crossing over.Creation is the temptation to go beyond the unknown.
To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196