Artificial intelligence is taking over drug development

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,620
Reputation
9,283
Daps
169,428

Artificial intelligence is taking over drug development​


Regulators need to up their game to keep up​

3D rendered illustration of an orange hand with circuitry texture  holding a syringe, on a pink background
illustration: timo lenzen

Mar 27th 2024

Listen to this story. Enjoy more audio and podcasts on iOS or Android.

The most striking evidence that artificial intelligence can provide profound scientific breakthroughs came with the unveiling of a program called AlphaFold by Google DeepMind. In 2016 researchers at the company had scored a big success with AlphaGo, an ai system which, having essentially taught itself the rules of Go, went on to beat the most highly rated human players of the game, sometimes by using tactics no one had ever foreseen. This emboldened the company to build a system that would work out a far more complex set of rules: those through which the sequence of amino acids which defines a particular protein leads to the shape that sequence folds into when that protein is actually made. AlphaFold found those rules and applied them with astonishing success.

The achievement was both remarkable and useful. Remarkable because a lot of clever humans had been trying hard to create computer models of the processes which fold chains of amino acids into proteins for decades. AlphaFold bested their best efforts almost as thoroughly as the system that inspired it trounces human Go players. Useful because the shape of a protein is of immense practical importance: it determines what the protein does and what other molecules can do to it. All the basic processes of life depend on what specific proteins do. Finding molecules that do desirable things to proteins (sometimes blocking their action, sometimes encouraging it) is the aim of the vast majority of the world’s drug development programmes.

Because of the importance of proteins’ three-dimensional structure there is an entire sub-discipline largely devoted to it: structural biology. It makes use of all sorts of technology to look at proteins through nuclear-magnetic-resonance techniques or by getting them to crystallise (which can be very hard) and blasting them with x-rays. Before AlphaFold over half a century of structural biology had produced a couple of hundred thousand reliable protein structures through these means. AlphaFold and its rivals (most notably a program made by Meta) have now provided detailed predictions of the shapes of more than 600m.

As a way of leaving scientists gobsmacked it is a hard act to follow. But if AlphaFold’s products have wowed the world, the basics of how it made them are fairly typical of the sort of things deep learning and generative ai can offer biology. Trained on two different types of data (amino-acid sequences and three-dimensional descriptions of the shapes they fold into) AlphaFold found patterns that allowed it to use the first sort of data to predict the second. The predictions are not all perfect. Chris Gibson, the boss of Recursion Pharmaceuticals, an ai-intensive drug-discovery startup based in Utah, says that his company treats AlphaFold’s outputs as hypotheses to be tested and validated experimentally. Not all of them pan out. But Dr Gibson also says the model is quickly getting better.

Crystal dreams​

This is what a whole range of ais are now doing in the world of biomedicine and, specifically, drug research: making suggestions about the way the world is that scientists could or would not come up with on their own. Trained to find patterns that extend across large bodies of disparate data, ai systems can discover relationships within those data that have implications for human biology and disease. Presented with new data they can use those patterns of implication to produce new hypotheses which can then be tested.

The ability of ai to generate new ideas provides users with insights that can help to identify drug targets and to predict the behaviour of novel compounds, sometimes never previously imagined, that might act as drugs. It is also being used to find new applications for old drugs, to predict the side effects of new drugs, and to find ways of telling those patients whom a drug might help from those it might harm.

Such computational ambitions are not new. Large-scale computing, machine learning and drug design were already coming together in the 2000s, says Vijay Pande, who was a researcher at Stanford University at the time. This was in part a response to biology’s fire hose of new findings: there are now more than a million biomedical research papers published every year.

One of the early ways in which ai was seen to help with this was through “knowledge graphs”, which allowed all that information to be read by machines and mined for insights about, say, which proteins in the blood might be used as biomarkers revealing the presence or severity of a disease. In 2020 Benevolentai, based in London, used this method to see the potential which baricitinib, sold by Eli Lilly as a treatment for rheumatoid arthritis, had for treating covid-19.

This January, research published in Science described how ai algorithms of a different sort had accelerated efforts to find biomarkers of long covid in the blood. Statistical approaches to the discovery of such biomarkers can be challenging given the complexity of the data. ais offer a way of cutting through this noise and advancing the discovery process in diseases both new, like long covid, and hard to diagnose, like the early stages of Alzheimer’s.

The time is right​

But despite this past progress, Dr Pande, now at Andreessen Horowitz, a venture-capital firm that is big on ai, thinks that more recent advances mark a step change. Biomedical research, particularly in biotech and pharma, was steadily increasing its reliance on automation and engineering before the new foundation models came into their own; now that has happened, the two seem to reinforce each other. The new foundation models do not just provide a way to cope with big bodies of data; they demand them. The scads of reliable data highly automated labs can produce in abundance are just the sort of thing for training foundation models. And biomedical researchers need all the help they can get to understand the torrents of data they are now capable of generating.

Biology can be thought of as “an information processing system, albeit an extraordinarily complex one”

By finding patterns humans had not thought to look for, or had no hope of finding unaided, ai offers researchers new ways to explore and understand the mysteries of life. Some talk of ais mastering the “language of biology”, learning to make sense of what evolution has wrought directly from the data in the same way that, trained on lots of real language, they can fluently generate meaningful sentences never uttered before.

Demis Hassabis, the boss of DeepMind, points out that biology itself can be thought of as “an information processing system, albeit an extraordinarily complex and dynamic one”. In a post on Medium, Serafim Batzoglou, the chief data officer at Seer Bio, a Silicon Valley company that specialises in looking at how proteins behave, predicts the emergence of open foundation models that will integrate data spanning from genome sequences to medical histories. These, he argues, will vastly accelerate innovation and advance precision medicine.

Like many of the enthusiasts piling into ai Dr Pande talks of an “industrial revolution…changing everything”. But his understanding of the time taken so far leads him to caution that achievements that justify that long-term enthusiasm change will not come overnight: “We are in a transitory period where people can see the difference but there is still work to do.”
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,620
Reputation
9,283
Daps
169,428

All the data from everywhere all at once​

A lot of pharma firms have made significant investments in the development of foundation models in recent years. Alongside this has been a rise in ai-centred startups such as Recursion, Genesis Therapeutics, based in Silicon Valley, Insilico, based in Hong Kong and New York and Relay Therapeutics, in Cambridge, Massachusetts. Daphne Koller, the boss of Insitro, an ai-heavy biotech in South San Francisco, says one sign of the times is that she no longer needs to explain large language models and self-supervised learning. And Nvidia—which makes the graphics-processing units that are essential for powering foundation models—has shown a keen interest. In the past year, it has invested or made partnership deals with at least six different ai-focused biotech firms including Schrodinger, another New York based firm, Genesis, Recursion and Genentech, an independent subsidiary of Roche, a big Swiss pharmaceutical company.

The drug-discovery models many of the companies are working with can learn from a wide variety of biological data including gene sequences, pictures of cells and tissues, the structures of relevant proteins, biomarkers in the blood, the proteins being made in specific cells and clinical data on the course of disease and effect of treatments in patients. Once trained, the ais can be fine tuned with labelled data to enhance their capabilities.

The use of patient data is particularly interesting. For fairly obvious reasons it is often not possible to discover the exact workings of a disease in humans through experiment. So drug development typically relies a lot on animal models, even though they can be misleading. ais that are trained on, and better attuned to, human biology may help avoid some of the blind alleys that stymie drug development.

Insitro, for example, trains its models on pathology slides, gene sequences, mri data and blood proteins. One of its models is able to connect changes in what cells look like under the microscope with underlying mutations in the genome and with clinical outcomes across various different diseases. The company hopes to use these and similar techniques to find ways to identify sub-groups of cancer patients that will do particularly well on specific courses of treatment.

Recursion says it can use laboratory robotics to conduct 2.2m experiments each week

Sometimes finding out what aspect of the data an ai is responding to is useful in and of itself. In 2019 Owkin, a Paris based “ai biotech”, published details of a deep neural network trained to predict survival in patients with malignant mesothelioma, a cancer of the tissue surrounding the lung, on the basis of tissue samples mounted on slides. It found that the cells most germane to the ai’s predictions were not the cancer cells themselves but non-cancerous cells nearby. The Owkin team brought extra cellular and molecular data into the picture and discovered a new drug target. In August last year a team of scientists from Indiana University Bloomington trained a model on data about how cancer cells respond to drugs (including genetic information) and the chemical structures of drugs, allowing it to predict how effective a drug would be in treating a specific cancer.

Many of the companies using ai need such great volumes of high quality data they are generating it themselves as part of their drug development programmes rather than waiting for it to be published elsewhere. One variation on this theme comes from a new computational sciences unit at Genentech which uses a “lab in the loop” approach to train their ai. The system’s predictions are tested at a large scale by means of experiments run with automated lab systems. The results of those experiments are then used to retrain the ai and enhance its accuracy. Recursion, which is using a similar strategy, says it can use automated laboratory robotics to conduct 2.2m experiments each week.

The point is to change it​

As pharma firms become increasingly hungry for data, concerns about the privacy of patient data are becoming more prominent. One way of dealing with the problem, used by Owkin among others, is “federated learning”, in which the training data it needs to build an atlas of cancer cell types never leaves the hospital where the tissue samples required are stored: what the data can offer in terms of training is taken away. The data themselves remain.

20240330_TQC025.png
chart: the economist

The implications of ai go beyond understanding disease and on into figuring out how to intervene. Generative ai models, such as Proteinsgm from the University of Toronto, are now powerful tools in protein design because they are not merely able to picture existing proteins but also to design new ones—with desired characteristics—that do not currently exist in nature but which are possible ways of embodying a desired function. Other systems allow chemists to design small molecules that might be useful as drugs as they interact with a target in a desired way.

At every stage the ai hypotheses need to be checked against reality. Even so, such an approach seems to speed up discovery. A recent analysis of drugs from “ai-intensive” firms carried out by bcg, a consulting group, found that of eight drugs for which information was available, five had reached clinical trials in less than the typical time for doing so. Other work suggests ai could yield time and cost savings of 25% to 50% in the preclinical stage of drug development, which can take four to seven years. Given the cost in time and money of the whole process, which can be several billions of dollars for a single drug, improvements could transform the industry’s productivity. But it will take time to know for sure. Drug pipelines are still slow; none of these promised new drugs has yet got to market.

Insilico Medicine is one of the companies hoping for that to change. It uses a range of models in its drug development process. One identifies the proteins that might be targeted to influence a disease. Another can design potential new drug compounds. Using this approach it identified a drug candidate which might be useful against pulmonary fibrosis in less than 18 months and at a cost of $3m—a fraction of the normal cost. The drug recently started Phase 2 trials.

A lot of pharma firms in China are doing deals with ai-driven companies like Insilico in the hope of seeing more of the same. Some hope that such deals might be able to boost China’s relatively slow-growing drug-development businesses. China’s contract research organisations are already feeling the benefits of ai fuelled interest in new molecules from around the world. Investment in ai-assisted drug discovery in China was more than $1.26bn in 2021.

The world has seen a number of ground breaking new drugs and treatments in the past decade: the drugs targeting glp-1 that are transforming the treatment of diabetes and obesity; the car-t therapies enlisting the immune system against cancer; the first clinical applications of genome editing. But the long haul of drug development, from discerning the biological processes that matter to identifying druggable targets to developing candidate molecules to putting them through preclinical tests and then clinical trials, remains generally slow and frustrating work. Approximately 86% of all drug candidates developed between 2000 and 2015 failed to meet their primary endpoints in clinical trials. Some argue that drug development has picked off most of biology’s low-hanging fruit, leaving diseases which are intractable and drug targets that are “undruggable”.

The next few years will demonstrate conclusively if ai is able to materially shift that picture. If it offers merely incremental improvements that could still be a real boon. If it allows biology to be deciphered in a whole new way, as the most boosterish suggest, it could make the whole process far more successful and efficient—and drug the undruggable very rapidly indeed. The analysts at bcg see signs of a fast-approaching ai-enabled wave of new drugs. Dr Pande warns that drug regulators will need to up their game to meet the challenge. It would be a good problem for the world to have.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,620
Reputation
9,283
Daps
169,428

1/1
@kimmonismus
That’s what I love google and especially DeepMind for

„Tx-LLM is a single model that is fine-tuned to predict properties for tasks related to therapeutic development, ranging from early-stage target identification to late-stage clinical trial approval.“

In combination with AlphaFold2 an absolute game changer.

[Quoted tweet]
Development of therapeutic drugs is often difficult and time consuming. A new model, Tx-LLM, is able to predict the properties of many entities of potential interest for therapeutic development with accuracy comparable state-of-the-art specialty models.→ goo.gle/3Zb3AfM


GZeha-MXYAAQNaJ.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196




1/11
@GoogleAI
Development of therapeutic drugs is often difficult and time consuming. A new model, Tx-LLM, is able to predict the properties of many entities of potential interest for therapeutic development with accuracy comparable state-of-the-art specialty models.→ Tx-LLM: Supporting therapeutic development with large language models



GZeha-MXYAAQNaJ.jpg


2/11
@BitValentine
Fascinating breakthrough! But how adaptable is this model across diverse therapeutic areas? Could accelerate drug discovery substantially.



3/11
@gpt_biz
This new model Tx-LLM sounds like a real breakthrough for faster drug development, looking forward to its impact!



4/11
@AlexFridd
Exciting breakthrough! The Tx-LLM model could streamline drug development by predicting therapeutic properties more efficiently.



5/11
@ak_panda
“Tx-LLM, is able to predict the properties of many entities of potential interest for therapeutic development with accuracy”



6/11
@Karenhalmi
Tx-LLM sounds like a breakthrough in drug development! Its ability to predict properties across multiple stages could significantly speed up the process. This kind of AI application has the potential to revolutionize healthcare and medicine. Exciting progress!



7/11
@softech28
Jesus Christ has returned
support @elonmusk
vote @realDonaldTrump
stand with @israel
.



8/11
@diegonov1
Let’s get that chemistry novel



9/11
@DLabz
@jezzichara



10/11
@MedExpertAI
A promising development in therapeutic research. Tx-LLM's ability to predict properties of various biological entities is a significant step forward in streamlining the drug discovery process. Félix en la búsqueda de soluciones para mejorar la salud.



11/11
@Miguel65370735
@DotCSV




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,620
Reputation
9,283
Daps
169,428



1/13
@GoogleDeepMind
Developing new drugs can be notoriously slow, expensive and risky. 🔬

Enter TxGemma: a more general, open and scalable set of AI models to help improve this process for researchers and developers. Find out more → Introducing TxGemma: Open models to improve therapeutics development- Google Developers Blog



https://video.twimg.com/ext_tw_video/1905274394126991360/pu/vid/avc1/720x900/WyqiqwjOw5ue0U7q.mp4

2/13
@GoogleDeepMind
Built on our state-of-the-art Gemma models, TxGemma can understand and predict the properties of small molecules, chemicals, proteins and more.

This could help scientists identify promising targets faster, predict clinical trial outcomes, and reduce overall costs. https://goo.gle/TxGemma



GnDlFmMXUAA9cvw.jpg


3/13
@GoogleDeepMind
Researchers can even ask TxGemma questions on the path to developing potential new treatments - like how safe or effective they might be.

Get started with the model on @GoogleCloud’s /search?q=#VertexAI platform and on @HuggingFace ↓ Introducing TxGemma: Open models to improve therapeutics development- Google Developers Blog



4/13
@robertkainz04
The acceleration is accelerating



5/13
@YowayBuorn
Exactly what we need...more drugs 🤣



6/13
@philtrem22
❤️



7/13
@bmorphism
gotta get me dat peptide



8/13
@dgreller
@DeryaTR_ Thought you’d find this interesting.



9/13
@m4zas24
@GoogleDeepMind Only available for scientists or for everyone ?



10/13
@uset82
I want to make my own smart pills 🤓💪



11/13
@UtopicDev
imagine the danger if @elder_plinius gets his hands on this 🤯



12/13
@E_FutureFan
TxGemma's clinical trial prediction could be a game-changer for drug development timelines. This is exactly the kind of AI application we need more of! 🔬



13/13
@NaveenSankarS
Children exposed to prenatal steroids may face modestly higher risks of autism, ADHD, and mood disorders.

Check out the latest research at Just Healthcare | Naveen Sankar S | Substack
/search?q=#PregnancyCare 🍼 /search?q=#Neurodevelopment 🧠 /search?q=#MedicalResearch /search?q=#ADHD /search?q=#Autism



GiUQbkrWkAAVGoS.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,620
Reputation
9,283
Daps
169,428

Google AI Released TxGemma: A Series of 2B, 9B, and 27B LLM for Multiple Therapeutic Tasks for Drug Development Fine-Tunable with Transformers​


By Asif Razzaq

March 27, 2025

Reddit Vote Flip Share Tweet 0 Shares

Developing therapeutics continues to be an inherently costly and challenging endeavor, characterized by high failure rates and prolonged development timelines. The traditional drug discovery process necessitates extensive experimental validations from initial target identification to late-stage clinical trials, consuming substantial resources and time. Computational methodologies, particularly machine learning and predictive modeling, have emerged as pivotal tools to streamline this process. However, existing computational models are typically highly specialized, limiting their effectiveness in addressing diverse therapeutic tasks and offering limited interactive reasoning capabilities required for scientific inquiry and analysis.

To address these limitations, Google AI has introduced TxGemma, a collection of generalist large language models (LLMs) designed explicitly to facilitate various therapeutic tasks in drug development. TxGemma distinguishes itself by integrating diverse datasets, encompassing small molecules, proteins, nucleic acids, diseases, and cell lines, which allows it to span multiple stages within the therapeutic development pipeline. TxGemma models, available with 2 billion (2B), 9 billion (9B), and 27 billion (27B) parameters, are fine-tuned from Gemma-2 architecture using comprehensive therapeutic datasets. Additionally, the suite includes TxGemma-Chat, an interactive conversational model variant, that enables scientists to engage in detailed discussions and mechanistic interpretations of predictive outcomes, fostering transparency in model utilization.

Check out how HOSTINGER HORIZONS can help to build and launch full-stack web apps, tools, and software in minutes without writing any code (Promoted)

From a technical standpoint, TxGemma capitalizes on the extensive Therapeutic Data Commons (TDC), a curated dataset containing over 15 million datapoints across 66 therapeutically relevant datasets. TxGemma-Predict, the predictive variant of the model suite, demonstrates significant performance across these datasets, matching or exceeding the performance of both generalist and specialist models currently employed in therapeutic modeling. Notably, the fine-tuning approach employed in TxGemma optimizes predictive accuracy with substantially fewer training samples, providing a crucial advantage in domains where data scarcity is prevalent. Further extending its capabilities, Agentic-Tx, powered by Gemini 2.0, dynamically orchestrates complex therapeutic queries by combining predictive insights from TxGemma-Predict and interactive discussions from TxGemma-Chat with external domain-specific tools.

Empirical evaluations underscore TxGemma’s capability. Across 66 tasks curated by the TDC, TxGemma-Predict consistently achieved performance comparable to or exceeding existing state-of-the-art models. Specifically, TxGemma’s predictive models surpassed state-of-the-art generalist models in 45 tasks and specialized models in 26 tasks, with notable efficiency in clinical trial adverse event predictions. On challenging benchmarks such as ChemBench and Humanity’s Last Exam, Agentic-Tx demonstrated clear advantages over previous leading models, enhancing accuracy by approximately 5.6% and 17.9%, respectively. Moreover, the conversational capabilities embedded in TxGemma-Chat provided essential interactive reasoning to support in-depth scientific analyses and discussions.

TxGemma’s practical utility is particularly evident in adverse event prediction during clinical trials, an essential aspect of therapeutic safety evaluation. TxGemma-27B-Predict demonstrated robust predictive performance while utilizing significantly fewer training samples compared to conventional models, illustrating enhanced data efficiency and reliability. Moreover, computational performance assessments indicate that the inference speed of TxGemma supports practical real-time applications, such as virtual screening, with the largest variant (27B parameters) capable of efficiently processing large sample volumes daily when deployed on scalable infrastructure.

In summary, the introduction of TxGemma by Google AI represents a methodical advancement in computational therapeutic research, combining predictive efficacy, interactive reasoning, and improved data efficiency. By making TxGemma publicly accessible, Google enables further validation and adaptation on diverse, proprietary datasets, thereby promoting broader applicability and reproducibility in therapeutic research. With sophisticated conversational functionality via TxGemma-Chat and complex workflow integration through Agentic-Tx, the suite provides researchers with advanced computational tools capable of significantly enhancing decision-making processes in therapeutic development.

Check out the Paper and Models on Hugging Face . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit .

 

WIA20XX

Superstar
Joined
May 24, 2022
Messages
7,224
Reputation
3,422
Daps
22,796
Has anyone asked the thalidomide question? Is the model taking into account stereo-chemistry?

I'm honestly surprised we have that type of data on (tens? hundreds?) thousands of drugs/chemicals vs billions of people, all with similar but ultimately unique bodies/body chemistry...
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,620
Reputation
9,283
Daps
169,428
Has anyone asked the thalidomide question? Is the model taking into account stereo-chemistry?

I'm honestly surprised we have that type of data on (tens? hundreds?) thousands of drugs/chemicals vs billions of people, all with similar but ultimately unique bodies/body chemistry...

had to use A.I cause your post went over my head.



In simple terms, the person is asking two questions about an artificial intelligence model called TxGemma that was developed by Google for drug discovery:

1. "Has anyone asked the thalidomide question?" - This refers to whether the model has been tested on its ability to predict the harmful effects of thalidomide, a drug that caused birth defects in the 1950s-60s when given to pregnant women. The thalidomide case is often used as an example of why thorough drug safety testing is crucial.

2. "Is the model taking into account stereo-chemistry?" - Stereochemistry refers to the 3D arrangement of atoms in molecules. For drugs, this can be very important because different 3D arrangements (called stereoisomers) of the same molecule can have very different effects in the body. In the case of thalidomide, one stereoisomer was helpful for morning sickness while the other caused birth defects.

The person is asking these questions in the context of Google releasing new AI models (TxGemma) for helping with drug development. They want to know if these important safety considerations from past drug disasters have been factored into the new AI system's capabilities.

---



Here's a template for a conversation starter prompt based on the original query:

Code:
Explain this concept in simple terms, as I'm unfamiliar with the terminology:
```
[Insert technical question or statement here]
```
in the context of this:
```
[Insert longer related article or text here]
```

This template allows users to ask for explanations of complex topics within a broader context. It's particularly useful for:

1. Breaking down technical jargon or specialized concepts
2. Providing context for specific questions within a larger topic
3. Helping users understand parts of longer articles or texts
4. Encouraging more detailed and relevant explanations

To use this template, simply replace the bracketed sections with the specific question and contextual information you want to explore.

---
 

WIA20XX

Superstar
Joined
May 24, 2022
Messages
7,224
Reputation
3,422
Daps
22,796
had to use A.I cause your post went over my head.
---


Thalidomide was a drug aimed at pregnant women for morning sickness.

When you make it in the lab, you get 2 molecules with the same exact same formula. They are "mirror" images of each other. (stereo-isomer)
Kinda like a left shoe and right shoe are identical but different.

With thalidomide.
  • One version, treats/prevents morning sickness.
  • The other version causes birth defects.
If you just looked at the chemical formula, there's no way to know - one that there are going to be stereo isomers, much less that one is gonna cause birth defects.

There's also no way to just make the "good" one. (Although that was a problem I used to think about back in the day...)

All that said, the drug is still used today, just not on pregnant women.

If TxGemma is looking at, suggesting, or generating novel chemical compounds for further research - do we have the type of data that can reliably do what we task the AI to do?

It's been a minute since I was getting my hands dirty in that stuff - but I'd be HELLA surprised that there are training sets out there like that.

That sounds like unfathomable amounts of complexity to me. Just thinking about it as a linear algebra problem - you've got 3d architecture of the proposed chemical (which itself is super complicated) being multiplied across the millions/billions/trillions? of complex chemical interactions in millions of different human bodies.

It's like throwing a random 3d Tetris piece into a junk yard that resulted in a line being completed....

(Honestly, I'm amazed that modern drugs get made in the first place, but I digress.)

And this is one of those AI joints, where imo, they really can't let it "black box" an answer...
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,620
Reputation
9,283
Daps
169,428
A.I generated by R1-1776:

Code:
is it possible the model can take into account stereo-chemistry?
```
[inserted data from https://developers.google.com/health-ai-developer-foundations/txgemma]
and later
https://developers.google.com/health-ai-developer-foundations/txgemma/model-card
```

Okay, I need to figure out if TxGemma can account for stereochemistry based on the new info in the model card. Let's start by recalling the user's original question about whether the model considers stereochemistry. In the previous answer, I noted that it depends on the training data and tokenization, but the original model card didn't specify. Now, the user provided an updated model card, so I need to check if there's new info here.
First, looking at the model card's "Dataset details" section, it mentions that the training data comes from the Therapeutics Data Commons (TDC), which covers 66 tasks with over 15 million data points. But does it specify if stereochemistry is included? The model card says that TxGemma is trained on datasets with commercial licenses, but there's no explicit mention of whether those datasets include stereochemical annotations in SMILES strings.

Next, the model's architecture is based on Gemma 2, a decoder-only transformer. The technical specs don't mention any modifications to handle SMILES-specific features like chiral indicators. However, the "Inputs and outputs" section states that inputs can include SMILES strings. But the key point is whether the SMILES processing in the tokenizer preserves stereochemical symbols like @, /, and . The model card doesn't discuss the tokenization process in detail, so we can't be sure. However, the example code shows a sample SMILES string (CN1C(=O)CN=C(C2=CCCCC2)c2cc(Cl)ccc21) that doesn't have any stereochemistry indicators. That might just be an example, though.

The "Performance & Validation" section notes validation on TDC's 66 tasks, which likely includes tasks sensitive to stereochemistry, like binding affinity. But without explicit mention, it's unclear. The "Key performance metrics" mention surpassing specialist models on 26 tasks, which could imply handling of stereochemistry if those tasks require it. However, this is speculative.

The "Limitations" section says that the model is trained on public data from TDC and that task-specific validation is necessary. This suggests users should verify for their specific cases, including stereochemistry. The example prompts in the code use SMILES without stereochemistry, but maybe the model can still process them if present.

In the previous answer, the user was advised to check tokenization and test with chiral examples. Given the updated model card still doesn't explicitly state whether stereochemical data was included or how the tokenizer handles those symbols, the conclusion would still be cautious. However, since TxGemma is fine-tuned on TDC data, which includes various therapeutic tasks, there's a possibility that some tasks in TDC require stereochemistry, thus the model might have been exposed to such data. But without confirmation, it's still a maybe.

So, the revised answer should incorporate the new model card details but note that stereochemistry isn't explicitly mentioned. The advice would still be to test with known examples and check the specific TDC datasets used. The user should also look into the TxGemma paper mentioned (https://goo.gle/TxGemma) for more details on data processing, as the model card references it but doesn't provide specifics on stereochemistry.

TxGemma's ability to account for stereochemistry remains uncertain but more contextually addressable given the additional details in the model card. Here’s the revised analysis:

1.​

The model card specifies TxGemma was fine-tuned on TDC datasets (Therapeutics Data Commons), which include tasks requiring stereochemical awareness (e.g., drug-target binding affinity, BBB permeability). However:
  • The model card clarifies TxGemma uses commercially licensed TDC datasets only, but it does not explicitly confirm whether these subsets include stereochemical annotations in SMILES strings.
  • TDC itself contains stereosensitive tasks (e.g., hERG, Solubility, Bioavailability), but whether the licensed subsets used for TxGemma explicitly included stereoisomer data is unclear.

2.​

The model card explicitly states that SMILES strings are valid inputs, but note:
  • The provided code examples use SMILES without stereochemistry (e.g., CN1C(=O)CN=C...), which is not definitive proof the model ignores stereochemistry.
  • The tokenizer (inherited from Gemma) is not explicitly modified to handle stereochemical symbols (@, /, \), but SMILES syntax is preserved in input formatting. If the training data included these symbols, the model could learn their significance.

3.​

TxGemma achieves best-in-class or competitive performance on 50/66 TDC tasks, including stereosensitive tasks like:
  • Binding affinity prediction (affected by 3D structure)
  • Drug-target interaction (chirality-dependent)
  • Toxicity classification (stereochemistry-dependent for some compounds)
This suggests potential stereochemical awareness, though the model card does not isolate stereochemistry as a tested variable.

4.​

To verify stereochemical capability:

  • Test with Enantiomers: Input SMILES pairs of enantiomers (e.g., C[C@H](O)COOH vs. C[C@@H](O)COOH) for tasks like toxicity or BBB permeability prediction. Compare outputs for expected divergence.
  • Inspect TDC Dataset Details: Review the specific TDC datasets used for training (via TDC documentation) to confirm stereochemical annotations.

  • Probe Tokenization: Check if the tokenizer preserves stereochemical symbols (e.g., @):


    python
    Python:
    tokenizer = AutoTokenizer.from_pretrained("google/txgemma-27b-chat")
    print(tokenizer.tokenize("C[C@H](O)COOH"))  # Check if `@` is split or retained

5.​

  • TxGemma-Chat models can explain their reasoning (via "Explain your reasoning based on the molecule structure" prompts). Leverage this to assess if stereochemistry is mentioned in responses.
  • The model’s agentic workflow integration (via Gemini 2) could compensate for stereochemical limitations by augmenting with external tools (e.g., molecular docking simulations).

Conclusion

While the model card does not explicitly confirm stereochemical handling, TxGemma’s strong performance on stereosensitive TDC tasks and support for SMILES inputs suggest potential capability. For critical applications:
  • Validate empirically with known stereoisomers.
  • Use TxGemma-Chat to request stereochemical rationales.
  • Consult the TxGemma manuscript for dataset/architecture specifics.
Proceed cautiously but optimistically: stereochemical awareness is likely partial and task-dependent.




 
Last edited:
Top