bnew

Veteran
Joined
Nov 1, 2015
Messages
51,004
Reputation
7,865
Daps
147,240


Intron Health gets backing for its speech recognition tool that recognizes African accents​





Annie Njanja



4:04 AM PDT • July 25, 2024



Comment




Intron health raises $1.6 million pre-seed funding
Image Credits: Intron Health

Voice recognition is getting integrated in nearly all facets of modern living, but there remains a big gap: speakers of minority languages, and those with thick accents or speech disorders like stuttering are typically less able to use speech recognition tools that control applications, transcribe or automate tasks, among other functions.

Tobi Olatunji, founder and CEO of clinical speech recognition startup Intron Health, wants to bridge this gap. He claims that Intron is Africa’s largest clinical database, with its algorithm trained on 3.5 million audio clips (16,000 hours) from over 18,000 contributors, mainly healthcare practitioners, representing 29 countries and 288 accents. Olatunji says that drawing most of its contributors from the healthcare sector ensures that medical terms are pronounced and captured correctly for his target markets.


“Because we’ve already trained on many African accents, it’s very likely that the baseline performance of their access will be much better than any other service they use,” he said, adding that data from Ghana, Uganda and South Africa is growing, and that the startup is confident about deploying the model there.

Olatunji’s interest in health-tech stems from two strands of his experience. First, he got training and practiced as a medical doctor in Nigeria, where he saw first-hand the inefficiencies of the systems in that market, including how much paperwork needed to be filled out, and how hard it was to track all of it.

“When I was a doctor in Nigeria a couple years ago, even during medical school and even now, I get irritated easily doing a repetitive task that is not deserving of human efforts,” he said. “An easy example is we had to write a patient’s name on every lab order you do. And just something that’s simple, let’s say I’m seeing the patients, and they need to get some prescriptions, they need to get some labs. I have to manually write out every order for them. It’s just frustrating for me to have to repeat the patient name over and over on each form, the age, the date, and all that… I’m always asking, how can we do things better? How can we make life easier for doctors? Can we take some tasks away and offload them to another system so that the doctor can spend their time doing things that are very valuable?”

Those questions propelled him to the next phase of his life. Olatunji moved to the U.S. to pursue a initially a masters degree in medical informatics from the University of San Francisco and then another in computer science at Georgia Tech.

He then cut his teeth at a number of tech companies. As a clinical natural language programming (NLP) scientist and researcher at Enlitic, a San Francisco Bay Area company, he built models to automate the extraction of information from radiology text reports. He also served Amazon Web Services as a machine learning scientist. At both Enlitic and Amazon, he focused on natural language processing for healthcare, shaping systems that enable hospitals to run better.



Throughout those experiences, he started to form ideas around how what was being developed and used in the U.S. could be used to improve healthcare in Nigeria and other emerging markets like it.



The original aim of Intron Health, launched in 2020, was to digitize hospital operations in Africa through an Electronic Medical Record (EMR) System. But take-up was challenging: it turned out physicians preferred writing to typing, said Olatunji.

That led him to explore how to improve that more basic problem: how to make physicans’ basic data entry, writing, work better. At first the company looked at third-party solutions for automating tasks such as note taking, and embedding existing speech to text technologies into his EMR program.

There were a lot of issues, however, because of constant mis-transcription. It became clear to Olatunji that thick African accents and the pronunciation of complicated medical terms and names made the adoption of existing foreign transcription tools impractical.

This marked the genesis of Intron Health’s speech recognition technology, which can recognize African accents, and can also be integrated in existing EMRs. The tool has to date been adopted in 30 hospitals across five markets, including Kenya and Nigeria.

There have been some immediate positive outcomes. In one case, Olatunji said, Intron Health has helped reduce the waiting time for radiology results at one of West Africa’s largest hospitals from 48 hours to 20 minutes. Such efficiencies are critical in healthcare provision, especially in Africa, where the doctor to patient ratio remains one of the lowest in the world.


“Hospitals have already spent so much on equipment and technology…Ensuring that they apply these tech is important. We’re able to provide value to help them improve the adoption of the EMR system,” he said.

Looking ahead, the startup is exploring new growth frontiers backed by a $1.6 million pre-seed round, led by Microtraction, with participation from Plug and Play Ventures, Jaza Rift Ventures, Octopus Ventures, Africa Health Ventures, OpenseedVC, Pi Campus, Alumni Angel, Baker Bridge Capital and several angel investors.



In terms of technology, Intron Health is working to perfect noise cancelation, as well as ensuring that the platform works well even in low bandwidths. This is in addition to enabling the transcription of multi-speaker conversations, and integrating text-to-speech capabilities.

The plan, Olatunji says, is to add intelligence systems or decision support tools for tasks such as prescription or lab tests. These tools, he adds, can help reduce doctor errors, and ensure adequate patient care besides speeding up their work.

Intron Health is among the growing number of generative AI startups in the medical space, including Microsoft’s DAX Express, which are reducing administrative tasks for clinicians by generating notes within seconds. The emergence and adoption of these technologies come as the global speech and voice recognition market is projected to be valued at $84.97 billion by 2032, following a CAGR of 23.7% from 2024, according to Fortune Business Insights.



Beyond building voice technologies, Intron is also playing a pivotal role in speech research in Africa, having recently partnered with Google Research, the Bill & Melinda Gates Foundation, and Digital Square at PATH to evaluate popular Large Language Models (LLMs) such as OpenAI’s GPT-4o, Google’s Gemini, and Anthropic’s Claude across 15 countries, to identify strengths, weaknesses, and risks of bias or harm in LLMs. This is all in a bid to ensure that culturally attuned models are available for African clinics and hospitals.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,004
Reputation
7,865
Daps
147,240


Oversight Board wants Meta to refine its policies around AI-generated explicit images​


Ivan Mehta

3:00 AM PDT • July 25, 2024


Comment

Meta housing
Image Credits: Bryce Durbin / TechCrunch

Following investigations into how Meta handles AI-generated explicit images, the company’s semi-independent observer body, the Oversight Board, is now urging the company to refine its policies around such images. The Board wants Meta to change the terminology it uses from “derogatory” to “non-consensual,” and move its policies on such images to the “Sexual Exploitation Community Standards” section from the “Bullying and Harassment” section.

Right now, Meta’s policies around explicit images generated by AI-generated branch out from a “derogatory sexualized photoshop” rule in its Bullying and Harassment section. The Board also urged Meta to replace the word “photoshop” with a generalized term for manipulated media.



Additionally, Meta prohibits non-consensual imagery if it is “non-commercial or produced in a private setting.” The Board suggested that this clause shouldn’t be mandatory to remove or ban images generated by AI-generated or manipulated without consent.

These recommendations come in the wake of two high-profile cases where explicit, AI-generated images of public figures posted on Instagram and Facebook landed Meta in hot water.

One of these cases involved an AI-generated nude image of an Indian public figure that was posted on Instagram. Several users reported the image but Meta did not take it down, and in fact closed the ticket within 48 hours with no further review. Users appealed that decision but the ticket was closed again. The company only acted after the Oversight Board took up the case, removed the content, and banned the account.

The other AI-generated image resembled a public figure from the U.S. and was posted on Facebook. Meta already had the image in its Media Matching Service (MMS) repository (a bank of images that violate its terms of service that can be used to detect similar images) due to media reports, and it quickly removed the picture when another user uploaded it on Facebook.

Notably, Meta only added the image of the Indian public figure to the MMS bank after the Oversight Board nudged it to. The company apparently told the Board the repository didn’t have the image before then because there were no media reports around the issue.


“This is worrying because many victims of deepfake intimate images are not in the public eye and are either forced to accept the spread of their non-consensual depictions or report every instance,” the Board said in its note.



Breakthrough Trust, an Indian organization that campaigns to reduce online gender-based violence, noted that these issues and Meta’s policies have cultural implications. In comments submitted to the Oversight Board, Breakthrough said non-consensual imagery is often trivialized as an identity theft issue rather than gender-based violence.

“Victims often face secondary victimization while reporting such cases in police stations/courts (“why did you put your picture out etc.” even when it’s not their pictures such as deepfakes). Once on the internet, the picture goes beyond the source platform very fast, and merely taking it down on the source platform is not enough because it quickly spreads to other platforms,” Barsha Charkorborty, the head of media at the organization, wrote to the Oversight Board.

Over a call, Charkorborty told TechCrunch that users often don’t know that their reports have been automatically marked as “resolved” in 48 hours, and Meta shouldn’t apply the same timeline for all cases. Plus, she suggested that the company should also work on building more user awareness around such issues.

Devika Malik, a platform policy expert who previously worked in Meta’s South Asia policy team, told TechCrunch earlier this year that platforms largely rely on user reporting for taking down non-consensual imagery, which might not be a reliable approach when tackling AI-generated media.

“This places an unfair onus on the affected user to prove their identity and the lack of consent (as is the case with Meta’s policy). This can get more error-prone when it comes to synthetic media, and to say, the time taken to capture and verify these external signals enables the content to gain harmful traction,” Malik said.



Aparajita Bharti, Founding Partner of Delhi-based think tank The Quantum Hub (TQH), said that Meta should allow users to provide more context when reporting content, as they might not be aware of the different categories of rule violations under Meta’s policy.

“We hope that Meta goes over and above the final ruling [of the Oversight Board] to enable flexible and user-focused channels to report content of this nature,” she said.

“We acknowledge that users cannot be expected to have a perfect understanding of the nuanced difference between different heads of reporting, and advocated for systems that prevent real issues from falling through the cracks on account of technicalities of Meta content moderation policies.’
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,004
Reputation
7,865
Daps
147,240


‘Model collapse’: Scientists warn against letting AI eat its own tail​


Devin Coldewey

8:01 AM PDT • July 24, 2024


Comment


Ouroboros
Image Credits: mariaflaya / Getty Images

When you see the mythical Ouroboros, it’s perfectly logical to think, “Well, that won’t last.” A potent symbol — swallowing your own tail — but difficult in practice. It may be the case for AI as well, which, according to a new study, may be at risk of “model collapse” after a few rounds of being trained on data it generated itself.

In a paper published in Nature, British and Canadian researchers led by Ilia Shumailov at Oxford show that today’s machine learning models are fundamentally vulnerable to a syndrome they call “model collapse.” As they write in the paper’s introduction:



We discover that indiscriminately learning from data produced by other models causes “model collapse” — a degenerative process whereby, over time, models forget the true underlying data distribution …

How does this happen, and why? The process is actually quite easy to understand.

AI models are pattern-matching systems at heart: They learn patterns in their training data, then match prompts to those patterns, filling in the most likely next dots on the line. Whether you ask, “What’s a good snickerdoodle recipe?” or “List the U.S. presidents in order of age at inauguration,” the model is basically just returning the most likely continuation of that series of words. (It’s different for image generators, but similar in many ways.)

But the thing is, models gravitate toward the most common output. It won’t give you a controversial snickerdoodle recipe but the most popular, ordinary one. And if you ask an image generator to make a picture of a dog, it won’t give you a rare breed it only saw two pictures of in its training data; you’ll probably get a golden retriever or a Lab.

Now, combine these two things with the fact that the web is being overrun by AI-generated content and that new AI models are likely to be ingesting and training on that content. That means they’re going to see a lot of goldens!



And once they’ve trained on this proliferation of goldens (or middle-of-the road blogspam, or fake faces, or generated songs), that is their new ground truth. They will think that 90% of dogs really are goldens, and therefore when asked to generate a dog, they will raise the proportion of goldens even higher — until they basically have lost track of what dogs are at all.



This wonderful illustration from Nature’s accompanying commentary article shows the process visually:

dogs.png
Image Credits:Nature

A similar thing happens with language models and others that, essentially, favor the most common data in their training set for answers — which, to be clear, is usually the right thing to do. It’s not really a problem until it meets up with the ocean of chum that is the public web right now.

Basically, if the models continue eating each other’s data, perhaps without even knowing it, they’ll progressively get weirder and dumber until they collapse. The researchers provide numerous examples and mitigation methods, but they go so far as to call model collapse “inevitable,” at least in theory.

Though it may not play out as the experiments they ran show it, the possibility should scare anyone in the AI space. Diversity and depth of training data is increasingly considered the single most important factor in the quality of a model. If you run out of data, but generating more risks model collapse, does that fundamentally limit today’s AI? If it does begin to happen, how will we know? And is there anything we can do to forestall or mitigate the problem?

The answer to the last question at least is probably yes, although that should not alleviate our concerns.



Qualitative and quantitative benchmarks of data sourcing and variety would help, but we’re far from standardizing those. Watermarks of AI-generated data would help other AIs avoid it, but so far no one has found a suitable way to mark imagery that way (well … I did).

In fact, companies may be disincentivized from sharing this kind of information, and instead hoard all the hyper-valuable original and human-generated data they can, retaining what Shumailov et al. call their “first mover advantage.”

[Model collapse] must be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web. Indeed, the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of LLM-generated content in data crawled from the Internet.

t may become increasingly difficult to train newer versions of LLMs without access to data that were crawled from the Internet before the mass adoption of the technology or direct access to data generated by humans at scale.



Add it to the pile of potentially catastrophic challenges for AI models — and arguments against today’s methods producing tomorrow’s superintelligence.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,004
Reputation
7,865
Daps
147,240


After AgentGPT’s success, Reworkd pivots to web-scraping AI agents​


Maxwell Zeff

8:00 AM PDT • July 24, 2024

Comment


Reworkd2.jpg
Image Credits: Reworkd

Reworkd’s founders went viral on GitHub last year with AgentGPT, a free tool to build AI agents that acquired more than 100,000 daily users in a week. This earned them a spot in Y Combinator’s summer 2023 cohort, but the co-founders quickly realized building general AI agents was too broad. So now Reworkd is a web-scraping company, specifically building AI agents to extract structured data from the public web.

AgentGPT provided a simple interface in a browser where users could create autonomous AI agents. Soon, everyone was raving about how agents were the future of computing.



When the tool took off, Asim Shrestha, Adam Watkins, and Srijan Subedi were still living in Canada and Reworkd didn’t exist. The massive user influx caught them off guard; Subedi, now Reworkd’s COO, said the tool was costing them $2,000 a day in API calls. For that reason, they had to create Reworkd and get funded fast. One of the most popular use cases for AgentGPT was creating web scrapers, a relatively simple but high-volume task, so Reworkd made this its singular focus.

Web scrapers have become invaluable in the AI era. The number one reason organizations use public web data in 2024 is to build AI models, according to Bright Data’s latest report. The problem is that web scrapers are traditionally built by humans and must be customized for specific web pages, making them expensive. But Reworkd’s AI agents can scrape more of the web with fewer humans in the loop.

Customers can give Reworkd a list of hundreds, or even thousands, of websites to scrape and then specify the types of data they’re interested in. Then Reworkd’s AI agents use multimodal code generation to turn this into structured data. Agents generate unique code to scrape each website and extract that data for customers to use as they please.

For example, say you want stats on every NFL player, but every team’s website has a different layout. Instead of building a scraper for each website, Reworkd’s agents do that for you given just links and a description of the data you want to extract. With 32 teams, that could save you hours — but if there were 1,000 teams, it could save you weeks.

Reworkd raised a fresh $2.75 million in seed funding from Paul Graham, AI Grant (Nat Friedman and Daniel Gross’ startup accelerator), SV Angel, General Catalyst and Panache Ventures, among others, the startup exclusively told TechCrunch. Combined with a $1.25 million pre-seed investment last year from Panache Ventures and Y Combinator, this brings Reworkd’s total funding raised to date to $4 million.




AI that can use the internet​


Shortly after forming Reworkd and moving to San Francisco, the team hired Rohan Pandey as a founding research engineer. He currently lives in AGI House SF, one of the Bay Area’s most popular hacker houses for the AI era. One investor described Pandey as a “one person research lab within Reworkd.”


“We see ourselves as the culmination of this 30-year dream of the Semantic Web,” said Pandey in an interview with TechCrunch, referring to a vision of world wide web inventor Tim Berners-Lee in which computers can read the entire internet. “Even though some websites don’t have markup, LLMs can understand the websites in the same ways that humans can, in such that we can expose basically any website as an API. So in some sense, Reworkd is like the universal API layer for the internet.”

Reworkd says it’s able to capture the long tail end of customer data needs, meaning its AI agents are specifically good for scraping thousands of smaller public websites that large competitors often skip over. Others, such as Bright Data, have scrapers for large websites like LinkedIn or Amazon already built out, but it may not be worth the trouble for a human to build a scraper for every small website. Reworkd addresses this concern, but potentially raises others.


What exactly is “public” web data?​


Though web scrapers have existed for decades, they have attracted controversy in the AI era. Unfettered scraping of huge swathes of data has thrown OpenAI and Perplexity into legal trouble: News and media organizations allege the AI companies extracted intellectual property from behind a paywall, reproducing it widely without payment. Reworkd is taking precautions to avoid these issues.

“We look at it as uplifting the accessibility of publicly available information,” said Shrestha, co-founder and CEO of Reworkd, in an interview with TechCrunch. “We’re only allowing information that’s publicly available; we’re not going through sign-in walls or anything like that.”

To go a step further, Reworkd says it’s avoiding scraping news altogether, and being selective about who they work with. Watkins, the company’s CTO, says there are better tools for aggregating news content elsewhere, and it is not their focus.



As an example of what is, Reworkd described their work with Axis, a company that helps policy teams comply with government regulations. Axis uses Reworkd’s AI to extract data from thousands of government regulation documents for many countries across the European Union. Axis then trains and fine-tunes an AI model based on this data and offers it to clients as a product.

Starting a web-scraping company these days could be considered wading into dangerous territory, according to Aaron Fiske, partner at Silicon-Valley based law firm Gunderson Dettmer. The landscape is somewhat fluid right now, and the jury is still out on how “public” web data really is for AI models. However, Fiske says Reworkd’s approach, where customers decide what websites to scrape, may insulate them from legal liability.

“It’s like they invented the copying machine, and there’s this one use case for making copies that turned out to be hugely economically valuable, but also legally, really questionable,” said Fiske in an interview with TechCrunch. “It’s not like web scrapers servicing AI companies is necessarily risky, but working with AI companies that are really interested in harvesting copyrighted content is maybe an issue.”

That’s why Reworkd is being careful about who it works with. Web scrapers have obfuscated much of the blame in potential copyright infringement cases related to AI thus far. In the OpenAI case, Fiske points out that The New York Times did not sue the web scraper that collected its articles, but rather the company that allegedly reproduced its work. But even there, it’s yet to be decided if what OpenAI did was truly copyright infringement.

There’s more evidence that web scrapers are legally in the clear during the AI boom. A court recently ruled in favor of Bright Data after it scraped Facebook and Instagram profiles via the web. One example in the court case was a dataset of 615 million records of Instagram user data, which Bright Data sells for $860,000. Meta sued the company, alleging this violated its terms of service. But a court ruled that this data is public and therefore available to scrape.


Investors think Reworkd scales with the big guys​


Reworkd has attracted big names as early investors, from Y Combinator and Paul Graham to Daniel Gross and Nat Friedman. Some investors say this is because Reworkd’s technology stands to improve, and get cheaper, alongside new models. The startup says OpenAI’s GPT-4o is currently the best for its multimodal code generation and that a lot of Reworkd’s technology wasn’t possible until just a few months ago.


“If you try to compete with the rate of technology progress — not building on top of it — then I think that you’ll have a hard time as a founder,” General Catalyst’s Viet Le told TechCrunch. “Reworkd has the mindset of basing its solution on the rate of progress.”



Reworkd is creating AI agents that address a particular gap in the market; companies need more data because AI is advancing quickly. As more companies build custom AI models specific to their business, Reworkd stands to gain more customers. Fine-tuning models necessitates quality, structured data, and lots of it.

Reworkd says its approach is “self-healing,” meaning that its web scrapers won’t break down due to a web page update. The startup claims to avoid hallucination issues traditionally associated with AI models because Reworkd’s agents are generating code to scrape a website. It’s possible the AI could make a mistake and grab the wrong data from a website, but Reworkd’s team created Banana-lyzer, an open source evaluation framework, to regularly assess its accuracy.

Reworkd doesn’t have a large payroll — the team is just four people — but it does have to take on considerable inference costs for running its AI agents. The startup expects its pricing to get increasingly competitive as these costs trend downward. OpenAI just released GPT-4o mini, a smaller version of its industry-leading model with competitive benchmarks. Innovations like these could make Reworkd more competitive.

Paul Graham and AI Grant did not respond to TechCrunch’s request for comment.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,004
Reputation
7,865
Daps
147,240



Bing previews its answer to Google’s AI Overviews​


Kyle Wiggers

12:06 PM PDT • July 24, 2024

Comment

Bing logo
Image Credits: Microsoft/OpenAI

Microsoft this afternoon previewed its answer to Google’s AI-powered search experiences: Bing generative search.

Available for only a “small percentage” of users at the moment, Bing generative search, underpinned by a combo of large and small generative AI models (mum’s the word on which models exactly), aggregates info from around the web and generates a summary in response to search queries.



For example, if a user searches “What is a spaghetti western?” Bing generative search will show information about the film subgenre’s history and origin and top examples, along with links and sources that show where those details came from. As with Google’s similar AI Overviews feature, there’s an option to dismiss AI-generated summaries for traditional search from the same results page.

“This is another important step in evolving the search experience on Bing and we’re eager to get feedback throughout this journey,” Microsoft writes in a post on its official blog. “We are slowly rolling this out and will take our time, garner feedback, test and learn and work to create a great experience before making this more broadly available … We look forward to sharing more updates in the coming months.”

Bing generative search
Image Credits:Bing

Microsoft insists that Bing generative search, which evolves the AI-generated chat answers it launched on Bing in February 2023, “fulfill the intent of the user’s query more effectively.” But much has been written about AI-generated search results gone wrong.

Google’s AI Overviews infamously suggested putting glue on a pizza. Arc Search told one reporter that cut-off toes will eventually grow back. Genspark recommends a few weapons that might be used to kill someone. And Perplexity ripped off news articles written by other outlets, including CNBC, Bloomberg, and Forbes, without giving credit or attribution.

Bing generative search
Image Credits:Bing

AI-generated overviews threaten to cannibalize traffic to the sites from which they source their info. Indeed, they already are, with one study finding that AI Overviews could negatively affect about 25% of publisher traffic due to the de-emphasis on article links.

For its part, Microsoft insists that it’s “maintaining the number of clicks to websites” and “look[ing] closely at how generative search impacts traffic to publishers.” The company volunteers no stats to back this up, however, alluding only to “early data” that it’s choosing to keep private for the time being.



That doesn’t instill a ton of confidence.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,004
Reputation
7,865
Daps
147,240


Researchers are training home robots in simulations based on iPhone scans​


Brian Heater

1:00 PM PDT • July 24, 2024

Comment

home-robot-7-24-2024.jpg
Image Credits: Screenshot / YouTube

There’s a long list of reasons why you don’t see a lot of non-vacuum robots in the home. At the top of the list is the problem of unstructured and semi-structured environments. No two homes are the same, from layout to lighting to surfaces to humans and pets. Even if a robot can effectively map each home, the spaces are always in flux.

Researchers at MIT CSAIL this week are showcasing a new method for training home robots in simulation. Using an iPhone, someone can scan a part of their home, which can then be uploaded into a simulation.

Simulation has become a bedrock element of robot training in recent decades. It allows robots to try and fail at tasks thousands — or even millions — of times in the same amount of time it would take to do it once in the real world.

The consequences of failing in simulation are also significantly lower than in real life. Imagine for a moment that teaching a robot to put a mug in a dishwasher required it to break 100 real-life mugs in the process.



“Training in the virtual world in simulation is very powerful, because the robot can practice millions and millions of times,” researcher Pulkit Agrawal says in a video tied to the research. “It might have broken a thousand dishes, but it doesn’t matter, because everything was in the virtual world.”

Much like the robots themselves, however, simulation can only go so far when it comes to dynamic environments like the home. Making simulations as accessible as an iPhone scan can dramatically enhance the robot’s adaptability to different environments.

In fact, creating a robust enough database of environments such as these ultimately makes the system more adaptable when something is inevitably out of place, be it moving a piece of furniture or leaving a dish on the kitchen counter.
 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,004
Reputation
7,865
Daps
147,240

TTT models might be the next frontier in generative AI​


Kyle Wiggers

1:47 PM PDT • July 17, 2024

Comment

Many interacting with chat interface.
Image Credits: Natee127 / Getty Images

After years of dominance by the form of AI known as the transformer, the hunt is on for new architectures.

Transformers underpin OpenAI’s video-generating model Sora, and they’re at the heart of text-generating models like Anthropic’s Claude, Google’s Gemini and GPT-4o. But they’re beginning to run up against technical roadblocks — in particular, computation-related roadblocks.

Transformers aren’t especially efficient at processing and analyzing vast amounts of data, at least running on off-the-shelf hardware. And that’s leading to steep and perhaps unsustainable increases in power demand as companies build and expand infrastructure to accommodate transformers’ requirements.

A promising architecture proposed this month is test-time training (TTT), which was developed over the course of a year and a half by researchers at Stanford, UC San Diego, UC Berkeley and Meta. The research team claims that TTT models can not only process far more data than transformers, but that they can do so without consuming nearly as much compute power.

The hidden state in transformers​


A fundamental component of transformers is the “hidden state,” which is essentially a long list of data. As a transformer processes something, it adds entries to the hidden state to “remember” what it just processed. For instance, if the model is working its way through a book, the hidden state values will be things like representations of words (or parts of words).

“If you think of a transformer as an intelligent entity, then the lookup table — its hidden state — is the transformer’s brain,” Yu Sun, a post-doc at Stanford and a co-contributor on the TTT research, told TechCrunch. “This specialized brain enables the well-known capabilities of transformers such as in-context learning.”

The hidden state is part of what makes transformers so powerful. But it also hobbles them. To “say” even a single word about a book a transformer just read, the model would have to scan through its entire lookup table — a task as computationally demanding as rereading the whole book.

So Sun and team had the idea of replacing the hidden state with a machine learning model — like nested dolls of AI, if you will, a model within a model.

It’s a bit technical, but the gist is that the TTT model’s internal machine learning model, unlike a transformer’s lookup table, doesn’t grow and grow as it processes additional data. Instead, it encodes the data it processes into representative variables called weights, which is what makes TTT models highly performant. No matter how much data a TTT model processes, the size of its internal model won’t change.

Sun believes that future TTT models could efficiently process billions of pieces of data, from words to images to audio recordings to videos. That’s far beyond the capabilities of today’s models.

“Our system can say X words about a book without the computational complexity of rereading the book X times,” Sun said. “Large video models based on transformers, such as Sora, can only process 10 seconds of video, because they only have a lookup table ‘brain.’ Our eventual goal is to develop a system that can process a long video resembling the visual experience of a human life.”

Skepticism around the TTT models​


So will TTT models eventually supersede transformers? They could. But it’s too early to say for certain.

TTT models aren’t a drop-in replacement for transformers. And the researchers only developed two small models for study, making TTT as a method difficult to compare right now to some of the larger transformer implementations out there.

“I think it’s a perfectly interesting innovation, and if the data backs up the claims that it provides efficiency gains then that’s great news, but I couldn’t tell you if it’s better than existing architectures or not,” said Mike Cook, a senior lecturer in King’s College London’s department of informatics who wasn’t involved with the TTT research. “An old professor of mine used to tell a joke when I was an undergrad: How do you solve any problem in computer science? Add another layer of abstraction. Adding a neural network inside a neural network definitely reminds me of that.”

Regardless, the accelerating pace of research into transformer alternatives points to growing recognition of the need for a breakthrough.

This week, AI startup Mistral released a model, Codestral Mamba, that’s based on another alternative to the transformer called state space models (SSMs). SSMs, like TTT models, appear to be more computationally efficient than transformers and can scale up to larger amounts of data.

AI21 Labs is also exploring SSMs. So is Cartesia, which pioneered some of the first SSMs and Codestral Mamba’s namesakes, Mamba and Mamba-2.

Should these efforts succeed, it could make generative AI even more accessible and widespread than it is now — for better or worse.


]/hr]

Computer Science > Machine Learning​


[Submitted on 5 Jul 2024]

Learning to (Learn at Test Time): RNNs with Expressive Hidden States​


Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, Tatsunori Hashimoto, Carlos Guestrin

Self-attention performs well in long context but has quadratic complexity. Existing RNN layers have linear complexity, but their performance in long context is limited by the expressive power of their hidden state. We propose a new class of sequence modeling layers with linear complexity and an expressive hidden state. The key idea is to make the hidden state a machine learning model itself, and the update rule a step of self-supervised learning. Since the hidden state is updated by training even on test sequences, our layers are called Test-Time Training (TTT) layers. We consider two instantiations: TTT-Linear and TTT-MLP, whose hidden state is a linear model and a two-layer MLP respectively. We evaluate our instantiations at the scale of 125M to 1.3B parameters, comparing with a strong Transformer and Mamba, a modern RNN. Both TTT-Linear and TTT-MLP match or exceed the baselines. Similar to Transformer, they can keep reducing perplexity by conditioning on more tokens, while Mamba cannot after 16k context. With preliminary systems optimization, TTT-Linear is already faster than Transformer at 8k context and matches Mamba in wall-clock time. TTT-MLP still faces challenges in memory I/O, but shows larger potential in long context, pointing to a promising direction for future research.

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:arXiv:2407.04620 [cs.LG]
(or arXiv:2407.04620v1 [cs.LG] for this version)
[2407.04620] Learning to (Learn at Test Time): RNNs with Expressive Hidden States
[TABLE width="100%"]
[TR]
[td][/td]
[/TR]
[/TABLE]

Submission history​

From: Yu Sun [view email]

[v1] Fri, 5 Jul 2024 16:23:20 UTC (897 KB)

 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,004
Reputation
7,865
Daps
147,240

Meta releases its biggest ‘open’ AI model yet​


Kyle Wiggers

8:00 AM PDT • July 23, 2024

Comment

people walking past Meta signage
Image Credits: TOBIAS SCHWARZ/AFP / Getty Images

Meta’s latest open source AI model is its biggest yet.

Today, Meta said it is releasing Llama 3.1 405B, a model containing 405 billion parameters. Parameters roughly correspond to a model’s problem-solving skills, and models with more parameters generally perform better than those with fewer parameters.

At 405 billion parameters, Llama 3.1 405B isn’t the absolute largest open source model out there, but it’s the biggest in recent years. Trained using 16,000 Nvidia H100 GPUs, it also benefits from newer training and development techniques that Meta claims makes it competitive with leading proprietary models like OpenAI’s GPT-4o and Anthropic’s Claude 3.5 Sonnet (with a few caveats).

As with Meta’s previous models, Llama 3.1 405B is available to download or use on cloud platforms like AWS, Azure and Google Cloud. It’s also being used on WhatsApp and Meta.ai, where it’s powering a chatbot experience for U.S.-based users.

New and improved​


Like other open and closed source generative AI models, Llama 3.1 405B can perform a range of different tasks, from coding and answering basic math questions to summarizing documents in eight languages (English, German, French, Italian, Portuguese, Hindi, Spanish and Thai). It’s text-only, meaning that it can’t, for example, answer questions about an image, but most text-based workloads — think analyzing files like PDFs and spreadsheets — are within its purview.

Meta wants to make it known that it is experimenting with multimodality. In a paper published today, researchers at the company write that they’re actively developing Llama models that can recognize images and videos, and understand (and generate) speech. Still, these models aren’t yet ready for public release.

To train Llama 3.1 405B, Meta used a dataset of 15 trillion tokens dating up to 2024 (tokens are parts of words that models can more easily internalize than whole words, and 15 trillion tokens translates to a mind-boggling 750 billion words). It’s not a new training set per se, since Meta used the base set to train earlier Llama models, but the company claims it refined its curation pipelines for data and adopted “more rigorous” quality assurance and data filtering approaches in developing this model.

The company also used synthetic data (data generated by other AI models) to fine-tune Llama 3.1 405B. Most major AI vendors, including OpenAI and Anthropic, are exploring applications of synthetic data to scale up their AI training, but some experts believe that synthetic data should be a last resort due to its potential to exacerbate model bias.

For its part, Meta insists that it “carefully balance[d]” Llama 3.1 405B’s training data, but declined to reveal exactly where the data came from (outside of webpages and public web files). Many generative AI vendors see training data as a competitive advantage and so keep it and any information pertaining to it close to the chest. But training data details are also a potential source of IP-related lawsuits, another disincentive for companies to reveal much.

Meta Llama 3.1
Image Credits:Meta

In the aforementioned paper, Meta researchers wrote that compared to earlier Llama models, Llama 3.1 405B was trained on an increased mix of non-English data (to improve its performance on non-English languages), more “mathematical data” and code (to improve the model’s mathematical reasoning skills), and recent web data (to bolster its knowledge of current events).

Recent reporting by Reuters revealed that Meta at one point used copyrighted e-books for AI training despite its own lawyers’ warnings. The company controversially trains its AI on Instagram and Facebook posts, photos and captions, and makes it difficult for users to opt out. What’s more, Meta, along with OpenAI, is the subject of an ongoing lawsuit brought by authors, including comedian Sarah Silverman, over the companies’ alleged unauthorized use of copyrighted data for model training.

“The training data, in many ways, is sort of like the secret recipe and the sauce that goes into building these models,” Ragavan Srinivasan, VP of AI program management at Meta, told TechCrunch in an interview. “And so from our perspective, we’ve invested a lot in this. And it is going to be one of these things where we will continue to refine it.”

Bigger context and tools​


Llama 3.1 405B has a larger context window than previous Llama models: 128,000 tokens, or roughly the length of a 50-page book. A model’s context, or context window, refers to the input data (e.g. text) that the model considers before generating output (e.g. additional text).

One of the advantages of models with larger contexts is that they can summarize longer text snippets and files. When powering chatbots, such models are also less likely to forget topics that were recently discussed.

Two other new, smaller models Meta unveiled today, Llama 3.1 8B and Llama 3.1 70B — updated versions of the company’s Llama 3 8B and Llama 3 70B models released in April — also have 128,000-token context windows. The previous models’ contexts topped out at 8,000 tokens, which makes this upgrade fairly substantial — assuming the new Llama models can effectively reason across all that context.

Meta Llama 3.1
Image Credits:Meta

All of the Llama 3.1 models can use third-party tools, apps and APIs to complete tasks, like rival models from Anthropic and OpenAI. Out of the box, they’re trained to tap Brave Search to answer questions about recent events, the Wolfram Alpha API for math- and science-related queries, and a Python interpreter for validating code. In addition, Meta claims the Llama 3.1 models can use certain tools they haven’t seen before — to an extent.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,004
Reputation
7,865
Daps
147,240

Building an ecosystem​


If benchmarks are to be believed (not that benchmarks are the end-all be-all in generative AI), Llama 3.1 405B is a very capable model indeed. That’d be a good thing, considering some of the painfully obvious limitations of previous-generation Llama models.

Llama 3 405B performs on par with OpenAI’s GPT-4, and achieves “mixed results” compared to GPT-4o and Claude 3.5 Sonnet, per human evaluators that Meta hired, the paper notes. While Llama 3 405B is better at executing code and generating plots than GPT-4o, its multilingual capabilities are overall weaker, and Llama 3 405B trails Claude 3.5 Sonnet in programming and general reasoning.

And because of its size, it needs beefy hardware to run. Meta recommends at least a server node.

That’s perhaps why Meta’s pushing its smaller new models, Llama 3.1 8B and Llama 3.1 70B, for general-purpose applications like powering chatbots and generating code. Llama 3.1 405B, the company says, is better reserved for model distillation — the process of transferring knowledge from a large model to a smaller, more efficient model — and generating synthetic data to train (or fine-tune) alternative models.

To encourage the synthetic data use case, Meta said it has updated Llama’s license to let developers use outputs from the Llama 3.1 model family to develop third-party AI generative models (whether that’s a wise idea is up for debate). Importantly, the license still constrains how developers can deploy Llama models: App developers with more than 700 million monthly users must request a special license from Meta that the company will grant on its discretion.

Meta Llama 3.1
Image Credits:Meta

That change in licensing around outputs, which allays a major criticism of Meta’s models within the AI community, is a part of the company’s aggressive push for mindshare in generative AI.

Alongside the Llama 3.1 family, Meta is releasing what it’s calling a “reference system” and new safety tools — several of these block prompts that might cause Llama models to behave in unpredictable or undesirable ways — to encourage developers to use Llama in more places. The company is also previewing and seeking comment on the Llama Stack, a forthcoming API for tools that can be used to fine-tune Llama models, generate synthetic data with Llama and build “agentic” applications — apps powered by Llama that can take action on a user’s behalf.

“[What] We have heard repeatedly from developers is an interest in learning how to actually deploy [Llama models] in production,” Srinivasan said. “So we’re trying to start giving them a bunch of different tools and options.”

Play for market share​


In an open letter published this morning, Meta CEO Mark Zuckerberg lays out a vision for the future in which AI tools and models reach the hands of more developers around the world, ensuring people have access to the “benefits and opportunities” of AI.

It’s couched very philanthropically, but implicit in the letter is Zuckerberg’s desire that these tools and models be of Meta’s making.

Meta’s racing to catch up to companies like OpenAI and Anthropic, and it is employing a tried-and-true strategy: give tools away for free to foster an ecosystem and then slowly add products and services, some paid, on top. Spending billions of dollars on models that it can then commoditize also has the effect of driving down Meta competitors’ prices and spreading the company’s version of AI broadly. It also lets the company incorporate improvements from the open source community into its future models.

Llama certainly has developers’ attention. Meta claims Llama models have been downloaded over 300 million times, and more than 20,000 Llama-derived models have been created so far.

Make no mistake, Meta’s playing for keeps. It is spending millions on lobbying regulators to come around to its preferred flavor of “open” generative AI. None of the Llama 3.1 models solve the intractable problems with today’s generative AI tech, like its tendency to make things up and regurgitate problematic training data. But they do advance one of Meta’s key goals: becoming synonymous with generative AI.

There are costs to this. In the research paper, the co-authors — echoing Zuckerberg’s recent comments — discuss energy-related reliability issues with training Meta’s ever-growing generative AI models.

“During training, tens of thousands of GPUs may increase or decrease power consumption at the same time, for example, due to all GPUs waiting for checkpointing or collective communications to finish, or the startup or shutdown of the entire training job,” they write. “When this happens, it can result in instant fluctuations of power consumption across the data center on the order of tens of megawatts, stretching the limits of the power grid. This is an ongoing challenge for us as we scale training for future, even larger Llama models.”

One hopes that training those larger models won’t force more utilities to keep old coal-burning power plants around.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,004
Reputation
7,865
Daps
147,240


With Google in its sights, OpenAI unveils SearchGPT​


Kyle Wiggers

11:52 AM PDT • July 25, 2024

Comment

OpenAI logo with spiraling pastel colors (Image Credits: Bryce Durbin / TechCrunch)
Image Credits: Bryce Durbin / TechCrunch

OpenAI may have designs to get into the search game — challenging not only upstarts like Perplexity, but Google and Bing, too.

The company on Thursday unveiled SearchGPT, a search feature designed to give “timely answers” to questions, drawing from web sources.

UI-wise, SearchGPT isn’t too far off from OpenAI’s chatbot platform ChatGPT. You type in a query, and SearchGPT serves up information and photos from the web along with links to relevant sources, at which point you can ask follow-up questions or explore additional, related searches in a sidebar.

Some searches take into account your location. In a support doc, OpenAI writes that SearchGPT “collects and shares” general location information with third-party search providers to improve the accuracy of results (e.g. showing a list of restaurants nearby or the weather forecast). SearchGPT also allows users to share more precise location information via a toggle in the settings menu.

Powered by OpenAI models (specifically GPT-3.5, GPT-4 and GPT-4o), SearchGPT — which OpenAI describes as a prototype — is launching today for “a small group” of users and publishers. (There’s a waitlist here.) OpenAI says that it plans to integrate some features of SearchGPT into ChatGPT in the future.

“Getting answers on the web can take a lot of effort, often requiring multiple attempts to get relevant results,” OpenAI writes in a blog post. “We believe that by enhancing the conversational capabilities of our models with real-time information from the web, finding what you’re looking for can be faster and easier.”

SearchGPT
Image Credits:SearchGPT

It’s long been rumored that OpenAI’s taken an interest in launching a Google killer of sorts. The Information reported in February that a product — or at least a pilot — was in the works. But the launch of SearchGPT comes at an inopportune moment: as AI-powered search tools are under justifiable fire for plagiarism, inaccuracies and content cannibalism.

Google’s take on AI-powered search, AI Overviews, infamously suggested putting glue on a pizza. The Browser Company’s Arc Search told one reporter that cut-off toes will grow back. AI search engine Genspark at one time readily recommended weapons that could be used to kill someone. And Perplexity ripped off news articles written by other outlets, including CNBC, Bloomberg, and Forbes, without giving credit or attribution.

AI-generated overviews threaten to cannibalize traffic to the sites from which they source their info. Indeed, they already are, with one study finding that AI Overviews could negatively affect about 25% of publisher traffic due to the de-emphasis on article links.

OpenAI — taking a page out of Perplexity’s ongoing apology tour — is positioning SearchGPT as a more responsible, measured deployment.

OpenAI says that SearchGPT “prominently cites and links” to publishers in searches with “clear, in-line, named attribution.” It also says that it’s working with publishers to design the experience and providing a way for website owners to manage how their content appears in search results.

“Importantly, SearchGPT is about search and is separate from training OpenAI’s generative AI foundation models. Sites can be surfaced in search results even if they opt out of generative AI training,” OpenAI clarifies in the blog post. “We are committed to a thriving ecosystem of publishers and creators.”

It’s a bit tough to take a company that once scraped millions of YouTube transcripts without permission to train its models at their word. But we’ll see how the SearchGPT saga unfolds.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,004
Reputation
7,865
Daps
147,240


A new Chinese video-generating model appears to be censoring politically sensitive topics​


Kyle Wiggers

4:07 PM PDT • July 24, 2024

Comment

GettyImages-1305264243.jpg
Image Credits: Photo by VCG/VCG via Getty Images

A powerful new video-generating AI model became widely available today — but there’s a catch: The model appears to be censoring topics deemed too politically sensitive by the government in its country of origin, China.

The model, Kling, developed by Beijing-based company Kuaishou, launched in waitlisted access earlier in the year for users with a Chinese phone number. Today, it rolled out for anyone willing to provide their email. After signing up, users can enter prompts to have the model generate five-second videos of what they’ve described.

Kling works pretty much as advertised. Its 720p videos, which take a minute or two to generate, don’t deviate too far from the prompts. And Kling appears to simulate physics, like the rustling of leaves and flowing water, about as well as video-generating models like AI startup Runway’s Gen-3 and OpenAI’s Sora.

But Kling outright won’t generate clips about certain subjects. Prompts like “Democracy in China,” “Chinese President Xi Jinping walking down the street” and “Tiananmen Square protests” yield a nonspecific error message.

Kling AI
Image Credits:Kuaishou

The filtering appears to be happening only at the prompt level. Kling supports animating still images, and it’ll uncomplainingly generate a video of a portrait of Jinping, for example, as long as the accompanying prompt doesn’t mention Jinping by name (e.g., “This man giving a speech”).

We’ve reached out to Kuaishou for comment.

Kling AI
Image Credits:Kuaishou

Kling’s curious behavior is likely the result of intense political pressure from the Chinese government on generative AI projects in the region.

Earlier this month, the Financial Times reported that AI models in China will be tested by China’s leading internet regulator, the Cyberspace Administration of China (CAC), to ensure that their responses on sensitive topics “embody core socialist values.” Models are to be benchmarked by CAC officials for their responses to a variety of queries, per the Financial Times report — many related to Jinping and criticism of the Communist Party.

Reportedly, the CAC has gone so far as to propose a blacklist of sources that can’t be used to train AI models. Companies submitting models for review must prepare tens of thousands of questions designed to test whether the models produce “safe” answers.

The result is AI systems that decline to respond on topics that might raise the ire of Chinese regulators. Last year, the BBC found that Ernie, Chinese company Baidu’s flagship AI chatbot model, demurred and deflected when asked questions that might be perceived as politically controversial, like “Is Xinjiang a good place?” or “Is Tibet a good place?”

The draconian policies threaten to slow China’s AI advances. Not only do they require scouring data to remove politically sensitive info, but they also necessitate investing an enormous amount of dev time in creating ideological guardrails — guardrails that might still fail, as Kling exemplifies.

From a user perspective, China’s AI regulations are already leading to two classes of models: some hamstrung by intensive filtering and others decidedly less so. Is that really a good thing for the broader AI ecosystem?
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,004
Reputation
7,865
Daps
147,240


Google makes its Gemini chatbot faster and more widely available​


Kyle Wiggers

9:00 AM PDT • July 25, 2024

Comment

In this photo illustration a Gemini logo and a welcome message on Gemini website are displayed on two screens.
Image Credits: Lorenzo Di Cola/NurPhoto / Getty Images

In its bid to maintain pace with generative AI rivals like Anthropic and OpenAI, Google is rolling out updates to the no-fee tier of Gemini, its AI-powered chatbot. The updates are focused on making the platform more performant — and more widely available.

Starting Thursday, Gemini 1.5 Flash — a lightweight multimodal model Google announced in May — will be available on the web and mobile in 40 languages and around 230 countries. Google claims that Gemini 1.5 Flash delivers upgrades in quality and latency, with especially noticeable improvements in reasoning and image understanding.

In a boon for Google, it might also be cheaper to run on the back end.

At Gemini 1.5 Flash’s unveiling, Google emphasized that the model was a “distilled” and highly efficient version of Gemini 1.5 Pro, built for what the company described as “narrow,” “high-frequency” generative AI workloads. Given the overhead of serving a chatbot platform such as Gemini (see: OpenAI’s ChatGPT bills), Google’s no doubt eager to jump on cost-reducing opportunities, particularly if those opportunities have the fortunate side effect of boosting performance in other areas.

Beyond the new base model, Google says that it’s expanding Gemini’s context window to 32,000 tokens, which amounts to roughly 24,000 words (or 48 pages of text).

Gemini 1.5 Flash
Image Credits:Google

Context, or context window, refers to the input data (e.g., text) that a model considers before generating output (e.g., additional text). A few of the advantages of models with larger contexts are that they can summarize and reason over longer text snippets and files (at least in theory), and that — in a chatbot context — they’re less likely to forget topics that were recently discussed.

The ability to upload files to Gemini for analysis previously required Gemini Advanced, the paid edition of Gemini gated behind Google’s $20-per-month Google One AI Premium Plan. But Google says that it’ll soon enable file uploads from Google Drive and local devices for all Gemini users.

“You’ll be able to do things like upload your economics study guide and ask Gemini to create practice questions,” Amar Subramanya, VP of engineering at Google, wrote in a blog post shared with TechCrunch. “Gemini will also soon be able to analyze data files for you, allowing you to uncover insights and visualize them through charts and graphics.”

To attempt to combat hallucinations — instances where a generative AI model like Gemini 1.5 Flash makes things up — Google is previewing a feature that displays links to related web content beneath certain Gemini-generated answers. English-language Gemini users in select territories will see a “chip” icon at the end of a Gemini-generated paragraph with a link to websites — or emails, if you’ve given Gemini permission to access your Gmail inbox — where you can dive deeper.

The move comes after revelations that Google’s generative AI models are prone to hallucinating quite badly — for example, suggesting nontoxic glue in a pizza recipe and inventing fake book reviews attributed to real people. Google earlier this year released a “double check” feature in Gemini designed to highlight Gemini-originated statements that other online sources corroborate or contradict. But the related content links appear to be an effort to make more transparent which sources of info Gemini might be drawing from.

The question in this reporter’s mind is how often and accurately Gemini will surface related links. TBD.

Google’s not waiting to flood the channels, though.

After debuting Gemini in Messages for select devices earlier in the year, Google is rolling the feature in the European Economic Area (EEA), U.K. and Switzerland, with the ability to chat in newly added languages such as French, Polish and Spanish. Users can pull up Gemini in Messages by tapping the “Start chat” button and selecting Gemini as a chat partner.

Google’s also launching the Gemini mobile app in more countries, and expanding Gemini access to teenagers globally.

The company introduced a teen-focused Gemini experience in June, allowing students to sign up using their school accounts — though not in all countries. In the coming week, that’ll change as Gemini becomes available to teens in every country and region that Gemini is normally available to adults.

Coinciding with the rollout, Google says that it’s putting “additional policies and safeguards” in place to protect teens — without going into detail. A new teen-tailored onboarding process is also in tow, along with an “AI literacy guide” to — as Google phrases it — “help teens use AI responsibly.”

It’s the subject of great debate whether kids are leveraging generative AI tools in the ways they were intended, or abusing them. Google is surely eager to avoid headlines suggesting Gemini is a plagiaristic essay generator or capable of giving teens poorly conceived advice on personal problems, and thus taking what steps it can to prevent the worst from occurring.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,004
Reputation
7,865
Daps
147,240


Stability AI steps into a new gen AI dimension with Stable Video 4D​


Sean Michael Kerner@TechJournalist

July 24, 2024 8:00 AM

Credit: Image generated by VentureBeat using Stable Diffusion 3


Credit: Image generated by VentureBeat using Stable Diffusion 3

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


Stability AI is expanding its growing roster of generative AI models, quite literally adding a new dimension with the debut of Stable Video 4D.

While there is a growing set of gen AI tools for video generation, including OpenAI’s Sora, Runway, Haiper and Luma AI among others, Stable Video 4D is something a bit different. Stable Video 4D builds on the foundation of Stability AI’s existing Stable Video Diffusion model, which converts images into videos. The new model takes this concept further by accepting video input and generating multiple novel-view videos from 8 different perspectives.

“We see Stable Video 4D being used in movie production, gaming, AR/VR, and other use cases where there is a need to view dynamically moving 3D objects from arbitrary camera angles,” Varun Jampani, team lead, 3D Research at Stability AI told VentureBeat.


Stable Video 4D is different than just 3D for gen AI​


This isn’t Stability AI’s first foray beyond the flat world of 2D space.

In March, Stable Video 3D was announced, enabling users to generate short 3D video from an image or text prompt. Stable Video 4D is going a significant step further. While the concept of 3D, that is 3 dimensions, is commonly understood as a type of image or video with depth, 4D isn’t perhaps as universally understood.

Jampani explained that the four dimensions include width (x), height (y), depth (z) and time (t). That means Stable Video 4D is able to view a moving 3D object from various camera angles as well as at different timestamps.

“The key aspects that enabled Stable Video 4D are that we combined the strengths of our previously-released Stable Video Diffusion and Stable Video 3D models, and fine-tuned it with a carefully curated dynamic 3D object dataset,” Jampani explained.

Jampani noted that Stable Video 4D is a first-of-its-kind network where a single network does both novel view synthesis and video generation. Existing works leverage separate video generation and novel view synthesis networks for this task.

He also explained that Stable Video 4D is different from Stable Video Diffusion and Stable Video 3D, in terms of how the attention mechanisms work.

“We carefully design attention mechanisms in the diffusion network which allow generation of each video frame to attend to its neighbors at different camera views or timestamps, thus resulting in better 3D coherence and temporal smoothness in the output videos,” Jampani said.


How Stable Video 4D works differently than gen AI infill​


With gen AI tools for 2D image generation the concept of infill and outfill, to fill in gaps, is well established. The infill/outfill approach however is not how Stable Video 4D works.

Jampani explained that the approach is different from generative infill/outfill, where the networks typically complete the partially given information. That is, the output is already partially filled by the explicit transfer of information from the input image.

“Stable Video 4D completely synthesizes the 8 novel view videos from scratch by using the original input video as guidance,” he said. “There is no explicit transfer of pixel information from input to output, all of this information transfer is done implicitly by the network.”

Stable Video 4D is currently available for research evaluation on Hugging Face. Stability AI has not yet announced what commercial options will be available for it in the future.

“Stable Video 4D can already process single-object videos of several seconds with a plain background,” Jampani said. “We plan to generalize it to longer videos and also to more complex scenes.”
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,004
Reputation
7,865
Daps
147,240


Mistral shocks with new open model Mistral Large 2, taking on Llama 3.1​


Shubham Sharma@mr_bumss

July 24, 2024 11:21 AM


A futuristic grey craft hovers in front of the Eiffel Tower and city of Paris


Credit: VentureBeat made with Midjourney V6

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More



The AI race is picking up pace like never before. Following Meta’s move just yesterday to launch its new open source Llama 3.1 as a highly competitive alternative to leading closed-source “frontier” models, French AI startup Mistral has also thrown its hat in the ring.

The startup announced the next generation of its flagship open source model with 123 billion parameters: Mistral Large 2. However, in an important caveat, the model is only licensed as “open” for non-commercial research uses, including open weights, allowing third-parties to fine-tune it to their liking.

For those seeking to use it for commercial/enterprise-grade applications, they will need to obtain a separate license and usage agreement from Mistral, as the company states in its blog post and in an X post from research scientist Devendra Singh Chaplot.

Super excited to announce Mistral Large 2
– 123B params – fits on a single H100 node
– Natively Multilingual
– Strong code & reasoning
– SOTA function calling
– Open-weights for non-commercial usage

Blog: Large Enough

Weights: mistralai/Mistral-Large-Instruct-2407 · Hugging Face

1/N pic.twitter.com/k2o7FbmYiE

— Devendra Chaplot (@dchaplot) July 24, 2024

While having a lower number of parameters — or internal model settings that guide its performance — than Llama 3.1’s 405 billion, it still nears the former’s performance.


Available on the company’s main platform and via cloud partners, Mistral Large 2 builds on the original Large model and brings advanced multilingual capabilities with improved performance across reasoning, code generation and mathematics.

It is being hailed as a GPT-4 class model with performance closely matching GPT-4o, Llama 3.1-405 and Anthropic’s Claude 3.5 Sonnet across several benchmarks.

Mistral notes the offering continues to “push the boundaries of cost efficiency, speed and performance” while giving users new features, including advanced function calling and retrieval, to build high-performing AI applications.

However, it’s important to note that this isn’t a one-off move designed to cut off the AI hype stirred by Meta or OpenAI. Mistral has been moving aggressively in the domain, raising large rounds, launching new task-specific models (including those for coding and mathematics) and partnering with industry giants to expand its reach.


Mistral Large 2: What to expect?​


Back in February, when Mistral launched the original Large model with a context window of 32,000 tokens, it claimed that the offering had “a nuanced understanding of grammar and cultural context” and could reason with and generate text with native fluency across different languages, including English, French, Spanish, German and Italian.

The new version of the model builds on this with a larger 128,000 context window — matching OpenAI’s GPT-4o and GPT-4o mini and Meta’s Llama 3.1.

It further boasts support for dozens of new languages, including the original ones as well as Portuguese, Arabic, Hindi, Russian, Chinese, Japanese and Korean.

Mistral says the model has large reasoning capabilities and can easily handle complex tasks like synthetic text generation, code generation and RAG.


High performance on third-party benchmarks and improved coding capability​


On the Multilingual MMLU benchmark covering different languages, Mistral Large 2 performed on par with Meta’s all-new Llama 3.1-405B while delivering more significant cost benefits due to its smaller size.

“Mistral Large 2 is designed for single-node inference with long-context applications in mind – its size of 123 billion parameters allows it to run at large throughput on a single node,” the company noted in a blog post.

mistral-large-2407-multiple.png


mistral-large-2407-language-diversity.png
Mistral Large 2 on Multilingual MMLU

But, that’s not the only benefit.

The original Large model did not do well on coding tasks, which Mistral seems to have remediated after training the latest version on large chunks of code.

The new model can generate code in 80+ programming languages, including Python, Java, C, C++, JavaScript and Bash, with a very high level of accuracy (according to the average from MultiPL-E benchmark).

On HumanEval and HumanEval Plus benchmarks for code generation, it outperformed Claude 3.5 Sonnet and Claude 3 Opus, while sitting just behind GPT-4o. Similarly, across Mathematics-focused benchmarks – GSM8K and Math Instruct – it grabbed the second spot.

mistral-large-2407-code-generation.png

mistral-large-2407-code-generation-2.png



Focus on instruction-following with minimized hallucinations​


Given the rise of AI adoption by enterprises, Mistral has also focused on minimizing the hallucination of Mistral Large by fine-tuning the model to be more cautious and selective when responding. If it doesn’t have sufficient information to back an answer, it will simply tell that to the user, ensuring full transparency.

Further, the company has improved the model’s instruction-following capabilities, making it better at following user guidelines and handling long multi-turn conversations. It has even been tuned to provide succinct and to-the-point answers wherever possible — which can come in handy in enterprise settings.

Currently, the company is providing access to Mistral Large 2 through its API endpoint platform as well as via cloud platforms such as Google Vertex AI, Amazon Bedrock, Azure AI Studio and IBM WatsonX. Users can even test it via the company’s chatbot to see how it works in the world.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,004
Reputation
7,865
Daps
147,240


Microsoft unveils serverless fine-tuning for its Phi-3 small language model​


Carl Franzen@carlfranzen

July 25, 2024 3:31 PM

Line art style drawing AI image of a female presenting personw ith dark black hair smiling and typing colorful developer code on a laptop in front of a larger monitor with similar code beside a black desk lamp illuminating light yellow


Credit: VentureBeat made with Midjourney V6



Microsoft is a major backer and partner of OpenAI, but that doesn’t mean it wants to let the latter company run away with the generative AI ballgame.

As proof of that, today Microsoft announced a new way to fine-tune its Phi-3 small language model without developers having to manage their own servers, and for free (initially).

Fine-tuning refers to the process of adapting an AI model through system prompts or adjusting its underlying weights (parameters) to make it behave in different and more optimal ways for specific use cases and end users, even adding new capabilities.


What is Phi-3?​


The company unveiled Phi-3, a 3 billion parameter model, back in April as a low-cost, enterprise grade option for third-party developers to build new applications and software atop of.

While significantly smaller than most other leading language models (Meta’s Llama 3.1 for instance, comes in a 405 billion parameter flavor — parameters being the “settings” that guide the neural network’s processing and responses), Phi-3 performed on the level of OpenAI’s GPT-3.5 model, according to comments provided at that time to VentureBeat by Sébastien Bubeck, Vice President of Microsoft generative AI.

Specifically, Phi-3 was designed to offer affordable performance on coding, common sense reasoning, and general knowledge.

It’s now a whole family consisting of 6 separate models with different numbers of parameters and context lengths (the amount of tokens, or numerical representations of data) the user can provide in a single input, the latter ranging from 4,000 to 128,000 — with costs ranging from $0.0003 USD per 1,000 input tokens to $0.0005 USD/1K input tokens.

Screenshot-2024-07-25-at-5.56.56%E2%80%AFPM.png


However, put into the more typical “per million” token pricing, it comes out to $0.3/$0.9 per 1 million tokens to start, exactly double OpenAI’s new GPT-4o mini pricing for input and about 1.5 times as expensive for output tokens.

Phi-3 was designed to be safe for enterprises to use with guardrails to reduce bias and toxicity. Even back when it was first announced, Microsoft’s Bubeck promoted its capability to be fine-tuned for specific enterprise use cases.

“You can bring in your data and fine-tune this general model, and get amazing performance on narrow verticals,” he told us.

But at that point, there was no serverless option to fine-tune it: if you wanted to do it, you had to set up your own Microsoft Azure server or download the model and run it on your own local machine, which may not have enough space.


Serverless fine-tuning unlocks new options​


Today, however, Microsoft announced the general public availability of its “Models-as-a-Service (serverless endpoint)” in its Azure AI development platform.

It also announced that “Phi-3-small is now available via a serverless endpoint so developers can quickly and easily get started with AI development without having to manage underlying infrastructure.”

Phi-3-vision, which can handle imagery inputs “will soon be available via a serverless endpoint” as well, according to Microsoft’s blog post.

But those models are simply available “as is” through Microsoft’s Azure AI development platform. Developers can build apps atop them, but they can’t create their own versions of the models tuned to their own use cases.

For developers looking to do that, Microsoft says they should turn to the Phi-3-mini and Phi-3-medium, which can be fine-tuned with third-party “data to build AI experiences that are more relevant to their users, safely, and economically.”

“Given their small compute footprint, cloud and edge compatibility, Phi-3 models are well suited for fine-tuning to improve base model performance across a variety of scenarios including learning a new skill or a task (e.g. tutoring) or enhancing consistency and quality of the response (e.g. tone or style of responses in chat/Q&A),” the company writes.

Specifically, Microsoft states that the educational software company Khan Academy is already using a fine-tuned Phi-3 to benchmark the performance of its Khanmigo for Teachers powered by Microsoft’s Azure OpenAI Service.


A new price and capability war for enterprise AI developers​


The pricing for serverless fine-tuning of Phi-3-mini-4k-instruct starts at $0.004 per 1,000 tokens ($4 per 1 million tokens), while no pricing has been listed yet for the medium model.

Screenshot-2024-07-25-at-5.57.03%E2%80%AFPM-1.png


While it’s a clear win for developers looking to stay in the Microsoft ecosystem, it’s also a notable competitor to Microsoft’s own ally OpenAI’s efforts to capture enterprise AI developers.

And OpenAI just days ago announced free fine-tuning of GPT-4o mini up to 2 million tokens per day through September 23rd, for so-called “Tier 4 and 5” users of its application programming interface (API), or those who spend at least $250 or $1000 on API credits.

Coming also on the heels of Meta’s release of the open source Llama 3.1 family and Mistral’s new Mistral Large 2 model, both of which can also be fine tuned for different uses, it’s clear the race to offer compelling AI options for enterprise development is in full swing — and AI providers are courting developers with both small and big models.
 
Top