bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,835











Photorealistic Video Generation with Diffusion Models​

Agrim Gupta1,2,*, Lijun Yu2, Kihyuk Sohn2, Xiuye Gu2, Meera Hahn2, Li Fei-Fei1,

Irfan Essa2, 3, Lu Jiang2, José Lezama2

1Stanford; 2Google Research; 3Georgia Institute of Teechnology

*Work partially done during an internship at Google.

arXiv PDF More samples


Text-to-Video Examples​


Abstract​

We present W.A.L.T, a transformer-based approach for photorealistic video generation via diffusion modeling. Our approach has two key design decisions. First, we use a causal encoder to jointly compress images and videos within a unified latent space, enabling training and generation across modalities. Second, for memory and training efficiency, we use a window attention architecture tailored for joint spatial and spatiotemporal generative modeling. Taken together these design decisions enable us to achieve state-of-the-art performance on established video (UCF-101 and Kinetics-600) and image (ImageNet) generation benchmarks without using classifier free guidance. Finally, we also train a cascade of three models for the task of text-to-video generation consisting of a base latent video diffusion model, and two video super-resolution diffusion models to generate videos of 512 x 896 resolution at 8 frames per second.




system_fig.png

W.A.L.T: We encode images and videos into a shared latent space. The transformer backbone processes these latents with blocks having two layers of window-restricted attention: spatial layers capture spatial relations in both images and video, while spatiotemporal layers model temporal dynamics in videos and passthrough images via identity attention mask. Text conditioning is done via spatial cross-attention.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,835

Phi-2: The surprising power of small language models​

Published December 12, 2023

By Mojan Javaheripi, Senior Researcher Sébastien Bubeck, Partner Research Manager

Contributors​

Marah Abdin, Jyoti Aneja, Sebastien Bubeck, Caio César Teodoro Mendes, Weizhu Chen, Allie Del Giorno, Ronen Eldan, Sivakanth Gopi, Suriya Gunasekar, Mojan Javaheripi, Piero Kauffmann, Yin Tat Lee, Yuanzhi Li, Anh Nguyen, Gustavo de Rosa, Olli Saarikivi, Adil Salim, shytal Shah, Michael Santacroce, Harkirat Singh Behl, Adam Taumann Kalai, Xin Wang, Rachel Ward, Philipp Witte, Cyril Zhang, Yi Zhang





Satya Nadella on stage at Microsoft Ignite 2023 announcing Phi-2.


Figure 1. Satya Nadella announcing Phi-2 at Microsoft Ignite 2023.

Over the past few months, our Machine Learning Foundations team at Microsoft Research has released a suite of small language models (SLMs) called “Phi” that achieve remarkable performance on a variety of benchmarks. Our first model, the 1.3 billion parameter Phi-1(opens in new tab), achieved state-of-the-art performance on Python coding among existing SLMs (specifically on the HumanEval and MBPP benchmarks). We then extended our focus to common sense reasoning and language understanding and created a new 1.3 billion parameter model named Phi-1.5(opens in new tab), with performance comparable to models 5x larger.

We are now releasing Phi-2(opens in new tab), a 2.7 billion-parameter language model that demonstrates outstanding reasoning and language understanding capabilities, showcasing state-of-the-art performance among base language models with less than 13 billion parameters. On complex benchmarks Phi-2 matches or outperforms models up to 25x larger, thanks to new innovations in model scaling and training data curation.

With its compact size, Phi-2 is an ideal playground for researchers, including for exploration around mechanistic interpretability, safety improvements, or fine-tuning experimentation on a variety of tasks. We have made Phi-2(opens in new tab) available in the Azure AI Studio model catalog to foster research and development on language models.


MICROSOFT RESEARCH PODCAST

Peter Lee wearing glasses and smiling at the camera with the Microsoft Research Podcast logo to the left


AI Frontiers: AI for health and the future of research with Peter Lee​

Peter Lee, head of Microsoft Research, and Ashley Llorens, AI scientist and engineer, discuss the future of AI research and the potential for GPT-4 as a medical copilot.

Listen now

Opens in a new tab


Key Insights Behind Phi-2​

The massive increase in the size of language models to hundreds of billions of parameters has unlocked a host of emerging capabilities that have redefined the landscape of natural language processing. A question remains whether such emergent abilities can be achieved at a smaller scale using strategic choices for training, e.g., data selection.

Our line of work with the Phi models aims to answer this question by training SLMs that achieve performance on par with models of much higher scale (yet still far from the frontier models). Our key insights for breaking the conventional language model scaling laws with Phi-2 are twofold:

Firstly, training data quality plays a critical role in model performance. This has been known for decades, but we take this insight to its extreme by focusing on “textbook-quality” data, following upon our prior work “Textbooks Are All You Need.” Our training data mixture contains synthetic datasets specifically created to teach the model common sense reasoning and general knowledge, including science, daily activities, and theory of mind, among others. We further augment our training corpus with carefully selected web data that is filtered based on educational value and content quality. Secondly, we use innovative techniques to scale up, starting from our 1.3 billion parameter model, Phi-1.5, and embedding its knowledge within the 2.7 billion parameter Phi-2. This scaled knowledge transfer not only accelerates training convergence but shows clear boost in Phi-2 benchmark scores.






A bar plot comparing the performance of Phi-2 (with 2.7B parameters) and Phi-1.5 (with 1.3B parameters) on common sense reasoning, language understanding, math, coding, and the Bigbench-hard benchmark. Phi-2 outperforms Phi1.5 in all categories. The commonsense reasoning tasks are PIQA, WinoGrande, ARC easy and challenge, and SIQA. The language understanding tasks are HellaSwag, OpenBookQA, MMLU, SQuADv2, and BoolQ. The math task is GSM8k, and coding includes the HumanEval and MBPP benchmarks.


Figure 2. Comparison between Phi-2 (2.7B) and Phi-1.5 (1.3B) models. All tasks are evaluated in 0-shot except for BBH and MMLU which use 3-shot CoT and 5-shot, respectively.

Training Details​

Phi-2 is a Transformer-based model with a next-word prediction objective, trained on 1.4T tokens from multiple passes on a mixture of Synthetic and Web datasets for NLP and coding. The training for Phi-2 took 14 days on 96 A100 GPUs. Phi-2 is a base model that has not undergone alignment through reinforcement learning from human feedback (RLHF), nor has it been instruct fine-tuned. Despite this, we observed better behavior with respect to toxicity and bias compared to existing open-source models that went through alignment (see Figure 3). This is in line with what we saw in Phi-1.5 due to our tailored data curation technique, see our previous tech report(opens in new tab) for more details on this. For more information about the Phi-2 model, please visit Azure AI | Machine Learning Studio(opens in new tab).





A barplot comparing the safety score of Phi-1.5, Phi-2, and Llama-7B models on 13 categories of the ToxiGen benchmark. Phi-1.5 achieves the highest score on all categories, Phi-2 achieves the second-highest scores and Llama-7B achieves the lowest scores across all categories.


Figure 3. Safety scores computed on 13 demographics from ToxiGen. A subset of 6541 sentences are selected and scored between 0 to 1 based on scaled perplexity and sentence toxicity. A higher score indicates the model is less likely to produce toxic sentences compared to benign ones.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,835

Phi-2 Evaluation​

Below, we summarize Phi-2 performance on academic benchmarks compared to popular language models. Our benchmarks span several categories, namely, Big Bench Hard (BBH) (3 shot with CoT), commonsense reasoning (PIQA, WinoGrande, ARC easy and challenge, SIQA), language understanding (HellaSwag, OpenBookQA, MMLU (5-shot), SQuADv2 (2-shot), BoolQ), math (GSM8k (8 shot)), and coding (HumanEval, MBPP (3-shot)).

With only 2.7 billion parameters, Phi-2 surpasses the performance of Mistral and Llama-2 models at 7B and 13B parameters on various aggregated benchmarks. Notably, it achieves better performance compared to 25x larger Llama-2-70B model on muti-step reasoning tasks, i.e., coding and math. Furthermore, Phi-2 matches or outperforms the recently-announced Google Gemini Nano 2, despite being smaller in size.

Of course, we acknowledge the current challenges with model evaluation, and that many public benchmarks might leak into the training data. For our first model, Phi-1, we did an extensive decontamination study to discard this possibility, which can be found in our first report “Textbooks Are All You Need.” Ultimately, we believe that the best way to judge a language model is to test it on concrete use cases. Following that spirit, we also evaluated Phi-2 using several Microsoft internal proprietary datasets and tasks, comparing it again to Mistral and Llama-2. We observed similar trends, i.e. on average, Phi-2 outperforms Mistral-7B, and the latter outperforms the Llama-2 models (7B, 13B, and 70B).


ModelSizeBBHCommonsense

Reasoning
Language

Understanding
MathCoding
Llama-27B40.062.256.716.521.0
13B47.865.061.934.225.4
70B66.569.267.664.138.3
Mistral7B57.266.463.746.439.4
Phi-22.7B59.268.862.061.153.7


Table 1. Averaged performance on grouped benchmarks compared to popular open-source SLMs.

ModelSizeBBHBoolQMBPPMMLU
Gemini Nano 23.2B42.479.327.255.8
Phi-22.7B59.383.359.156.7


Table 2. Comparison between Phi-2 and Gemini Nano 2 Model on Gemini’s reported benchmarks.

In addition to these benchmarks, we also performed extensive testing on commonly used prompts from the research community. We observed a behavior in accordance with the expectation we had given the benchmark results. For example, we tested a prompt used to probe a model’s ability to solve physics problems, most recently used to evaluate the capabilities of the Gemini Ultra model, and achieved the following result:




An example prompt is given to Phi-2 which says “A skier slides down a frictionless slope of height 40m and length 80m. What's the skier’s speed at the bottom?”. Phi-2 then answers the prompt by explaining the conversion of potential energy to kinetic energy and providing the formulas to compute each one. It then proceeds to compute the correct speed using the energy formulas.


Figure 4. Phi-2’s output on a simple physics problem, which includes an approximately correct square root calculation.


The model is then provided with a student’s wrong answer to the skier physics problem and asked if it can correct the student’s mistake. Phi-2 replies with the student’s mistake, i.e., using the wrong formula for potential energy, and provides the correct formula.


Figure 5. Similarly to Gemini’s test we also further queried Phi-2 with a student’s wrong answer to see if Phi-2 could identify where the mistake is (it did, despite Phi-2 being not fine-tuned for chat or instruction-following). We note however that it is not fully an apple-to-apple comparison with the Gemini Ultra’s output described in the Gemini report, in particular in the latter case the student’s answer was given as an image with handwritten text rather than raw text in our case.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,835

Human brain-like supercomputer with 228 trillion links coming in 2024​


Australians develop a supercomputer capable of simulating networks at the scale of the human brain.​


Sejal Sharma
Published: Dec 13, 2023 07:27 AM EST
INNOVATION


An artist’s impression of the DeepSouth supercomputer
An artist’s impression of the DeepSouth supercomputer.
WSU



Australian scientists have their hands on a groundbreaking supercomputer that aims to simulate the synapses of a human brain at full scale.

The neuromorphic supercomputer will be capable of 228 trillion synaptic operations per second, which is on par with the estimated number of operations in the human brain.

A team of researchers at the International Centre for Neuromorphic Systems (ICNS) at Western Sydney University have named it DeepSouth. IBM's


To be operational by April 2024​

The incredible computational power of the human brain can be seen in the way it performs billion-billion mathematical operations per second using only 20 watts of power. DeepSouth achieves similar levels of parallel processing by employing neuromorphic engineering, a design approach that mimics the brain's functioning.


Highlighting the distinctive features of DeepSouth, Professor André van Schaik, the Director of the ICNS, emphasized that the supercomputer is designed with a unique purpose – to operate in a manner similar to networks of neurons, the basic units of the human brain.

Neuromorphic systems utilize interconnected artificial neurons and synapses to perform tasks. These systems attempt to emulate the brain' ability to learn, adapt, and process information in a highly parallel and distributed manner.

Often applied in the field of AI and machine learning, a neuromorphic system is used with the goal of creating more efficient and brain-like computing systems.

Traditional computing architectures are typically based on the von Neumann architecture, where computers are composed of separate CPUs and memory units, where data and instructions are stored in the latter.

DeepSouth can handle large amounts of data at a rapid pace while consuming significantly less power and being physically smaller than conventional supercomputers.

"Progress in our understanding of how brains compute using neurons is hampered by our inability to simulate brain-like networks at scale. Simulating spiking neural networks on standard computers using Graphics Processing Units (GPUs) and multicore Central Processing Units (CPUs) is just too slow and power intensive. Our system will change that," Professor van Schaik said.




The system is scalable​

The team named the supercomputer DeepSouth based on IBM's TrueNorth system, which started the idea of building computers that act like large networks of neurons, and Deep Blue, the first computer to beat a world chess champion.

The name also gives a nod to where the supercomputer is located geographically: Australia, which is situated in the southern hemisphere.

The team believes DeepSouth will help in advancements in diverse fields like sensing, biomedical, robotics, space, and large-scale AI applications.

The team believes DeepSouth will also revolutionize smart devices. This includes devices like mobile phones and sensors used in manufacturing and agriculture.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,835



The brain may learn about the world the same way some computational models do

Two studies find “self-supervised” models, which learn about their environment from unlabeled data, can show activity patterns similar to those of the mammalian brain.

To make our way through the world, our brain must develop an intuitive understanding of the physical world around us, which we then use to interpret sensory information coming into the brain.

How does the brain develop that intuitive understanding? Many scientists believe that it may use a process similar to what’s known as “self-supervised learning.” This type of machine learning, originally developed as a way to create more efficient models for computer vision, allows computational models to learn about visual scenes based solely on the similarities and differences between them, with no labels or other information.

A pair of studies from researchers at the K. Lisa Yang Integrative Computational Neuroscience (ICoN) Center at MIT offers new evidence supporting this hypothesis. The researchers found that when they trained models known as neural networks using a particular type of self-supervised learning, the resulting models generated activity patterns very similar to those seen in the brains of animals that were performing the same tasks as the models.

The findings suggest that these models are able to learn representations of the physical world that they can use to make accurate predictions about what will happen in that world, and that the mammalian brain may be using the same strategy, the researchers say.

Neural Foundations of Mental Simulation: Future Prediction of Latent Representations on Dynamic Scenes

Humans and animals have a rich and flexible understanding of the physical world, which enables them to infer the underlying dynamical trajectories of objects and events, plausible future states, and use that to plan and anticipate the consequences of actions. However, the neural mechanisms underlying these computations are unclear. We combine a goal-driven modeling approach with dense neurophysiological data and high-throughput human behavioral readouts to directly impinge on this question. Specifically, we construct and evaluate several classes of sensory-cognitive networks to predict the future state of rich, ethologically-relevant environments, ranging from self-supervised end-to-end models with pixel-wise or object-centric objectives, to models that future predict in the latent space of purely static image-based or dynamic video-based pretrained foundation models. We find strong differentiation across these model classes in their ability to predict neural and behavioral data both within and across diverse environments. In particular, we find that neural responses are currently best predicted by models trained to predict the future state of their environment in the latent space of pretrained foundation models optimized for dynamic scenes in a self-supervised manner. Notably, models that future predict in the latent space of video foundation models that are optimized to support a diverse range of sensorimotor tasks, reasonably match both human behavioral error patterns and neural dynamics across all environmental scenarios that we were able to test. Overall, these findings suggest that the neural mechanisms and behaviors of primate mental simulation are thus far most consistent with being optimized to future predict on dynamic, reusable visual representations that are useful for Embodied AI more generally.








[Submitted on 19 May 2023 (v1), last revised 25 Oct 2023 (this version, v2)]

Neural Foundations of Mental Simulation: Future Prediction of Latent Representations on Dynamic Scenes​

Aran Nayebi, Rishi Rajalingham, Mehrdad Jazayeri, Guangyu Robert Yang
Humans and animals have a rich and flexible understanding of the physical world, which enables them to infer the underlying dynamical trajectories of objects and events, plausible future states, and use that to plan and anticipate the consequences of actions. However, the neural mechanisms underlying these computations are unclear. We combine a goal-driven modeling approach with dense neurophysiological data and high-throughput human behavioral readouts to directly impinge on this question. Specifically, we construct and evaluate several classes of sensory-cognitive networks to predict the future state of rich, ethologically-relevant environments, ranging from self-supervised end-to-end models with pixel-wise or object-centric objectives, to models that future predict in the latent space of purely static image-based or dynamic video-based pretrained foundation models. We find strong differentiation across these model classes in their ability to predict neural and behavioral data both within and across diverse environments. In particular, we find that neural responses are currently best predicted by models trained to predict the future state of their environment in the latent space of pretrained foundation models optimized for dynamic scenes in a self-supervised manner. Notably, models that future predict in the latent space of video foundation models that are optimized to support a diverse range of sensorimotor tasks, reasonably match both human behavioral error patterns and neural dynamics across all environmental scenarios that we were able to test. Overall, these findings suggest that the neural mechanisms and behaviors of primate mental simulation are thus far most consistent with being optimized to future predict on dynamic, reusable visual representations that are useful for Embodied AI more generally.
Comments:20 pages, 10 figures, NeurIPS 2023 Camera Ready Version (spotlight)
Subjects:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Neurons and Cognition (q-bio.NC)
Cite as:arXiv:2305.11772 [cs.AI]
(or arXiv:2305.11772v2 [cs.AI] for this version)
[2305.11772] Neural Foundations of Mental Simulation: Future Prediction of Latent Representations on Dynamic Scenes
Focus to learn more

Submission history​

From: Aran Nayebi [view email]
[v1] Fri, 19 May 2023 15:56:06 UTC (4,080 KB)
[v2] Wed, 25 Oct 2023 15:34:16 UTC (4,347 KB)


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,835

Researchers perform AI speech recognition with human brain cells​


DECEMBER 11, 2023

  • Researchers used lab-grown brain cells to perform complex speech recognition tasks
  • The 'mini brain' organoid system, called Brainoware, demonstrates a new form of AI
  • It was capable of distinguishing a single voice from 240 audio recordings of multiple Japanese vowel sounds

Neuron AI


SAM JEANS

Sam is a science and technology writer who has worked…

Clusters of human brain cells cultivated in Petri dishes have been integrated with computers to achieve a fundamental level of speech recognition.

Feng Guo, from Indiana University Bloomington, explains of the study, published in Nature Electronics, “This is a preliminary demonstration to show the feasibility of the concept. There’s still a considerable journey ahead.”

Guo points out two primary challenges in conventional AI that this form of biological AI seeks to solve: high energy consumption and the inherent limitations of silicon chips, like their distinct processing and information storage functions.

Guo’s team, along with others, such as Australia’s Cortical Labs, which trained brain cells to play Pong in 2022, are exploring biocomputing with living nerve cells as a potential solution to these challenges.

These brain organoids –self-organized, three-dimensional tissue cultures resembling mini-brains – emerge from stem cells under specific growth conditions.

They can grow to a few millimeters in diameter and contain up to 100 million nerve cells. By comparison, a human brain has approximately 100 billion nerve cells. The organoids are positioned atop a microelectrode array, which both stimulates the organoid and records neuronal activity. Guo’s team refers to this setup as “Brainoware.”

Essentially, Brainoware is a new form of AI quite different from what we usually see in computers and smartphones.

Instead of using regular chips, researchers have created a small cluster of human brain cells – the brain organoid. This tiny ‘mini-brain’ is grown in a lab from stem cells, and it can perform some basic tasks that we usually associate with AI, like recognizing speech patterns.


cells.png

A) A diagram of the “Brainoware” system, which shows a brain organoid (a lab-grown mini-brain) connected to a device that records and stimulates its electrical activity. B) A microscopic image of the brain organoid, stained to highlight its different cell types, such as mature neurons, astrocytes, early-stage neurons, and progenitor cells, showing its complex 3D structure. Source: Nature Electronics.


How it works​

The brain organoid is placed on a special device that can send and read electrical signals.

By doing this, the researchers can communicate with the organoid, kind of like teaching it to respond to certain patterns or inputs. In the study, they trained it to recognize different voices from audio clips.

One of the most remarkable aspects of Brainware is that it learns and adapts. Just like a human brain gets better at tasks with practice, the organoid improves its ability to recognize voices the more it’s exposed to them.

This brings us a step closer to creating AI that works more like the human brain, which is super efficient and doesn’t need a lot of energy.

However, there are challenges. Growing these brain organoids is tricky – they’re hard to create, tough to replicate consistently, and don’t last long, but the team is working on solutions.


Brainoware performance​

In an unsupervised speech recognition experiment, the organoids were trained to distinguish a single voice from 240 audio recordings of eight individuals uttering Japanese vowel sounds. These sounds were converted into signal sequences and spatial patterns for the organoids.

Initially, the organoids showed an accuracy rate of approximately 30 to 40%, which improved to 70 to 80% after two days of training.


More about the study​

Bio-inspired AI takes a few different forms, such as neuromorphic chips based on biological neurons. This goes a step further by creating computational architecture from biological organoids.

Here’s more detail about how it works:

  1. Bio-inspired AI hardware: The study, published in Nature Electronics, introduces Brainoware, a novel AI hardware that employs biological neural networks within a brain organoid. This marks a fundamental shift from traditional brain-inspired silicon chips, offering a more authentic emulation of brain function.
  2. Brainoware’s structure and functionality: Brainoware operates by interfacing a brain organoid, grown from human pluripotent stem cells, with a high-density multielectrode array. This setup allows for both the transmission of electrical signals to the organoid and the detection of neural responses. The organoid exhibits properties like nonlinear dynamics, memory, and the ability to process spatial information.
  3. Applications demonstrated in the study: The team successfully applied Brainoware in practical scenarios, such as speech recognition and predicting nonlinear chaotic equations (like the Hénon map). This shows Brainoware’s ability to improve its computing performance through training, emphasizing its potential for tasks requiring adaptive learning.
  4. Challenges and limitations: Despite its innovative approach, Brainoware faces several technical challenges, including the generation and maintenance of brain organoids. Additionally, the hardware’s reliance on peripheral equipment hinders its potential. In other words, you need a lot of supporting equipment to enable the brain organs to work correctly.
  5. Future directions and potential: The study suggests that with advancements in organoid cultivation and solving practical issues associated with organoids, Brainoware could evolve into a more efficient and sophisticated system. This could lead to AI hardware that more closely mimics human brain function, potentially lowering energy consumption.

In the future, these types of biocomputing systems might eventually perform AI tasks more energy-efficiently than traditional silicon-based chips.

Developments in bio-inspired AI from this year show immense promise in helping the AI industry overcome the confines of brute-force computing and create energy-efficient technologies as elegant as nature.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,835

It’s time for developers and enterprises to build with Gemini Pro


Dec 13, 2023
3 min read
Learn more about how to integrate Gemini Pro into your app or business at ai.google.dev.

Jeanine Banks

VP & GM, Developer X and DevRel

Burak Gokturk
VP & GM, Cloud AI & Industry Solutions



A graphic with text that says Build with Gemini

Last week, we announced Gemini, our largest and most capable AI model and the next step in our journey to make AI more helpful for everyone. It comes in three sizes: Ultra, Pro and Nano. We've already started rolling out Gemini in our products: Gemini Nano is in Android, starting with Pixel 8 Pro, and a specifically tuned version of Gemini Pro is in Bard.


Today, we’re making Gemini Pro available for developers and enterprises to build for your own use cases, and we’ll be further fine-tuning it in the weeks and months ahead as we listen and learn from your feedback.

Gemini Pro is available today

The first version of Gemini Pro is now accessible via the Gemini API and here’s more about it:

  • Gemini Pro outperforms other similarly-sized models on research benchmarks.
  • Today’s version comes with a 32K context window for text, and future versions will have a larger context window.
  • It’s free to use right now, within limits, and it will be competitively priced.
  • It comes with a range of features: function calling, embeddings, semantic retrieval and custom knowledge grounding, and chat functionality.
  • It supports 38 languages across 180+ countries and territories worldwide.
  • In today’s release, Gemini Pro accepts text as input and generates text as output. We’ve also made a dedicated Gemini Pro Vision multimodal endpoint available today that accepts text and imagery as input, with text output.
  • SDKs are available for Gemini Pro to help you build apps that run anywhere. Python, Android (Kotlin), Node.js, Swift and JavaScript are all supported.

A screenshot of a code snippet illustrating the SDKs supporting Gemini.


Gemini Pro has SDKs that help you build apps that run anywhere.​

Google AI Studio: The fastest way to build with Gemini

Google AI Studio is a free, web-based developer tool that enables you to quickly develop prompts and then get an API key to use in your app development. You can sign into Google AI Studio with your Google account and take advantage of the free quota, which allows 60 requests per minute — 20x more than other free offerings. When you’re ready, you can simply click on “Get code” to transfer your work to your IDE of choice, or use one of the quickstart templates available in Android Studio, Colab or Project IDX. To help us improve product quality, when you use the free quota, your API and Google AI Studio input and output may be accessible to trained reviewers. This data is de-identified from your Google account and API key.


Google AI Studio is a free, web-based developer tool that enables you to quickly develop prompts and then get an API key to use in your app development.​

Build with Vertex AI on Google Cloud

When it's time for a fully-managed AI platform, you can easily transition from Google AI Studio to Vertex AI, which allows for customization of Gemini with full data control and benefits from additional Google Cloud features for enterprise security, safety, privacy and data governance and compliance.

With Vertex AI, you will have access to the same Gemini models, and will be able to:
  • Tune and distill Gemini with your own company’s data, and augment it with grounding to include up-to-minute information and extensions to take real-world actions.
  • Build Gemini-powered search and conversational agents in a low code / no code environment, including support for retrieval-augmented generation (RAG), blended search, embeddings, conversation playbooks and more.
  • Deploy with confidence. We never train our models on inputs or outputs from Google Cloud customers. Your data and IP are always your data and IP.
To read more about our new Vertex AI capabilities, visit the Google Cloud blog.

Gemini Pro pricing

Right now, developers have free access to Gemini Pro and Gemini Pro Vision through Google AI Studio, with up to 60 requests per minute, making it suitable for most app development needs. Vertex AI developers can try the same models, with the same rate limits, at no cost until general availability early next year, after which there will be a charge per 1,000 characters or per image across Google AI Studio and Vertex AI.

A screenshot of input and output prices for Gemini Pro.


Big impact, small price: Because of our investments in TPUs, Gemini Pro can be served more efficiently.​

Looking ahead

We’re excited that Gemini is now available to developers and enterprises. As we continue to fine-tune it, your feedback will help us improve. You can learn more and start building with Gemini on ai.google.dev, or use Vertex AI’s robust capabilities on your own data with enterprise-grade controls.

Early next year, we’ll launch Gemini Ultra, our largest and most capable model for highly complex tasks, after further fine-tuning, safety testing and gathering valuable feedback from partners. We’ll also bring Gemini to more of our developer platforms like Chrome and Firebase.

We’re excited to see what you build with Gemini.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,835

Imagen 2​

Our most advanced text-to-image technology

Imagen 2 is our most advanced text-to-image diffusion technology, delivering high-quality, photorealistic outputs that are closely aligned and consistent with the user’s prompt. It can generate more lifelike images by using the natural distribution of its training data, instead of adopting a pre-programmed style.

Imagen 2’s powerful text-to-image technology is available for developers and Cloud customers via the Imagen API in Google Cloud Vertex AI.

The Google Arts and Culture team is also deploying our Imagen 2 technology in their Cultural Icons experiment, allowing users to explore, learn and test their cultural knowledge with the help of Google AI.

A shot of a 32-year-old female, up and coming conservationist in a jungle; athletic with short, curly hair and a warm smile

Prompt: A shot of a 32-year-old female, up and coming conservationist in a jungle; athletic with short, curly hair and a warm smile

A jellyfish on a dark blue background

Prompt: A jellyfish on a dark blue background

Small canvas oil painting of an orange on a chopping board. Light is passing through orange segments, casting an orange light across part of the chopping board. There is a blue and white cloth in the background. Caustics, bounce light, expressive brush strokes

Prompt: Small canvas oil painting of an orange on a chopping board. Light is passing through orange segments, casting an orange light across part of the chopping board. There is a blue and white cloth in the background. Caustics, bounce light, expressive brush strokes



Improved image-caption understanding​

Text-to-image models learn to generate images that match a user’s prompt from details in their training datasets’ images and captions. But the quality of detail and accuracy in these pairings can vary widely for each image and caption.

To help create higher-quality and more accurate images that better align to a user’s prompt, further description was added to image captions in Imagen 2’s training dataset, helping Imagen 2 learn different captioning styles and generalize to better understand a broad range of user prompts.

These enhanced image-caption pairings help Imagen 2 better understand the relationship between images and words — increasing its understanding of context and nuance.

Here are examples of Imagen 2’s prompt understanding:

AI Image generated from prompt Soft purl the streams the birds renew their notes and through the air their mingled music floats (A Hymn to the Evening by Phillis Wheatley)

Prompt: “Soft purl the streams, the birds renew their notes, And through the air their mingled music floats.” (A Hymn to the Evening by Phillis Wheatley)

AI generated image of a painted underwater scene.

Prompt: “Consider the subtleness of the sea; how its most dreaded creatures glide under water, unapparent for the most part, and treacherously hidden beneath the loveliest tints of azure." (Moby-dikk by Herman Melville)

AI generated photo-realistic image of a singing robin

Prompt: ”The robin flew from his swinging spray of ivy on to the top of the wall and he opened his beak and sang a loud, lovely trill, merely to show off. Nothing in the world is quite as adorably lovely as a robin when he shows off - and they are nearly always doing it." (The Secret Garden by Frances Hodgson Burnett)



More realistic image generation​

Imagen 2’s dataset and model advances have delivered improvements in many of the areas that text-to-image tools often struggle with, including rendering realistic hands and human faces and keeping images free of distracting visual artifacts.

A collage of AI generated imagery showing realistic faces and hands

Examples of Imagen 2 generating realistic hands and human faces.

We trained a specialized image aesthetics model based on human preferences for qualities like good lighting, framing, exposure, sharpness, and more. Each image was given an aesthetics score which helped condition Imagen 2 to give more weight to images in its training dataset that align with qualities humans prefer. This technique improves Imagen 2’s ability to generate higher-quality images.

AI generated images of flowers with different aesthetics scores

AI-generated images using the prompt “Flower”, with lower aesthetics scores (left) to higher scores (right).



Fluid style conditioning​

Imagen 2’s diffusion-based techniques provide a high degree of flexibility, making it easier to control and adjust the style of an image. By providing reference style images in combination with a text prompt, we can condition Imagen 2 to generate new imagery that follows the same style.

A visualization of how Imagen 2 makes it easier to control the output style by using reference images alongside a text prompt.

A visualization of how Imagen 2 makes it easier to control the output style by using reference images alongside a text prompt.



Advanced inpainting and outpainting​

Imagen 2 also enables image editing capabilities like ‘inpainting’ and ‘outpainting’. By providing a reference image and an image mask, users can generate new content directly into the original image with a technique called inpainting, or extend the original image beyond its borders with outpainting. This technology is planned for Google Cloud’s Vertex AI in the new year.

Example of how Imagen 2 can generate new content directly into the original image with inpainting.

Imagen 2 can generate new content directly into the original image with inpainting.

Example of how Imagen 2 can extend the original image beyond its borders with outpainting.

Imagen 2 can extend the original image beyond its borders with outpainting.



Responsible by design​

To help mitigate the potential risks and challenges of our text-to-image generative technology, we set robust guardrails in place, from design and development to deployment in our products.

Imagen 2 is integrated with SynthID, our cutting-edge toolkit for watermarking and identifying AI-generated content, enabling allowlisted Google Cloud customers to add an imperceptible digital watermark directly into the pixels of the image, without compromising image quality. This allows the watermark to remain detectable by SynthID, even after applying modifications like filters, cropping, or saving with lossy compression schemes.

Before we release capabilities to users, we conduct robust safety testing to minimize the risk of harm. From the outset, we invested in training data safety for Imagen 2, and added technical guardrails to limit problematic outputs like violent, offensive, or sexually explicit content. We apply safety checks to training data, input prompts, and system-generated outputs at generation time. For example, we’re applying comprehensive safety filters to avoid generating potentially problematic content, such as images of named individuals. As we are expanding the capabilities and launches of Imagen 2, we are also continuously evaluating them for safety.



How Imagen 2 is powering text-to-image products across Google​




Resources​




Acknowledgements​

This work was made possible by key research and engineering contributions from:

Aäron van den Oord, Ali Razavi, Benigno Uria, Çağlar Ünlü, Charlie Nash, Chris Wolff, Conor Durkan, David Ding, Dawid Górny, Evgeny Gladchenko, Felix Riedel, Hang Qi, Jacob Kelly, Jakob Bauer, Jeff Donahue, Junlin Zhang, Mateusz Malinowski, Mikołaj Bińkowski, Pauline Luc, Robert Riachi, Robin Strudel, Sander Dieleman, Tobenna Peter Igwe, Yaroslav Ganin, Zach Eaton-Rosen.

Thanks to: Ben Bariach, Dawn Bloxwich, Ed Hirst, Elspeth White, Gemma Jennings, Jenny Brennan, Komal Singh, Luis C. Cobo, Miaosen Wang, Nick Pezzotti, Nicole Brichtova, Nidhi Vyas, Nina Anderson, Norman Casagrande, Sasha Brown, Sven Gowal, Tulsee Doshi, Will Hawkins, Yelin Kim, Zahra Ahmed for driving delivery; Douglas Eck, Nando de Freitas, Oriol Vinyals, Eli Collins, Demis Hassabis for their advice.

Thanks also to many others who contributed across Google DeepMind, including our partners in Google.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,835

AI as good as doctors at checking X-rays - study


11th December 2023, 01:30 EST

Lee Bottomley
BBC News West Midlands

f39cbee0-9847-11ee-b916-4f014e91bb33.png.webp

University of Warwick
The software for checking X-rays was trained using 2.8m images and highly accurate, researchers said


Artificial Intelligence (AI) can analyse X-rays and diagnose medical issues just as well as doctors, a study has claimed.

Software was trained using chest X-rays from more than 1.5m patients, and scanned for 37 possible conditions.

It was just as accurate or more accurate than doctors' analysis at the time the image was taken for 35 out of 37 conditions, the University of Warwick said.

The AI could reduce doctors' workload and delays in diagnosis, and offer radiologists the "ultimate second opinion", researchers added.

The software understood that some abnormalities for which it scanned were more serious than others, and could flag the most urgent to medics, the university said.

To check the results were accurate, more than 1,400 X-rays analysed by the software were cross-examined by senior radiologists.

They then compared the diagnoses made by the AI with those made by radiologists at the time.


'Future of medicine'


The software, called X-Raydar, removed human error and bias, said lead author, Dr Giovanni Montana, Professor of Data Science at Warwick University.

"If a patient is referred for an X-ray with a heart problem, doctors will inevitably focus on the heart over the lungs," he said.

“This is totally understandable but runs the risk of undetected problems in other areas".

AI such as this would be the "future of medicine" and act as a "co-pilot for busy doctors", said co-author, Professor Vicky Goh of King’s College London.

The AI X-ray tool was a collaboration between Warwick University, King’s College London and the NHS, and funded by the Wellcome Trust.

The software was available open source for non-commercial use to increase the pace of research development, the university added.

Follow BBC West Midlands on Facebook, X and Instagram. Send your story ideas to: newsonline.westmidlands@bbc.co.uk
 
Top