bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,556


3. The gardener​


Midjourney


Gardening image created with Midjourney (Image credit: Midjourney/Future AI image)

Flux AI image


Gardening image created by Flux (Image credit: Flux AI image/Future)

Generating images of older people can always be a struggle for AI image generators because of the more complex skin texture. Here we want a woman in her 80s caring for plants in a rooftop garden.

The image depicts elements of the scene including climbing vines and a golden evening light with the city skyline looming large behind our gardener.
An elderly woman in her early 80s is tenderly caring for plants in her rooftop garden, set against a backdrop of a crowded city. Her silver hair is tied back in a loose bun, with wispy strands escaping to frame her kind, deeply wrinkled face. Her blue eyes twinkle with contentment as she smiles at a ripe tomato cradled gently in her soil-stained gardening gloves. She's wearing a floral print dress in soft pastels, protected by a well-worn, earth-toned apron. Comfortable slip-on shoes and a wide-brimmed straw hat complete her gardening outfit. A pair of reading glasses hangs from a beaded chain around her neck, ready for when she needs to consult her gardening journal. The rooftop around her is transformed into a green oasis. Raised beds burst with a variety of vegetables and flowers, creating a colorful patchwork. Trellises covered in climbing vines stand tall, and terracotta pots filled with herbs line the edges. A small greenhouse is visible in one corner, its glass panels reflecting the golden evening light. In the background, the city skyline looms large - a forest of concrete and glass that stands in stark contrast to this vibrant garden. The setting sun casts a warm glow over the scene, highlighting the lush plants and the serenity on the woman's face as she finds peace in her urban Eden.

Winner: Midjourney

Once again Midjourney wins because of the texture quality. It struggled a little with the gloved fingers but it was better than Flux. That doesn't mean Flux isn't a good image but it isn't as good as Midjourney.


4. Paramedic in an emergency​


Midjourney


Paramedic image generated by Midjourney (Image credit: Midjourney/Future AI image)

Flux AI image


Paramedic image generated by Flux (Image credit: Flux AI image/Future)

For this prompt I went with something more action heavy, focusing on a paramedic in the moment of rushing to the ambulance on a rainy day. This included a description of water droplets clinging to eyelashes and reflective strips.

This was a more challenging prompt for AI image generators as it has to capture the darker environment. 'Golden hour' light is easier for AI than night and twilight.
A young paramedic in her mid-20s is captured in a moment of urgent action as she rushes out of an ambulance on a rainy night. Her short blonde hair is plastered to her forehead by the rain, and droplets cling to her eyelashes. Her blue eyes are sharp and focused, reflecting the flashing lights of the emergency vehicles. Her expression is one of determination and controlled urgency. She's wearing a dark blue uniform with reflective strips that catch the light, the jacket partially unzipped to reveal a light blue shirt underneath. A stethoscope hangs around her neck, bouncing slightly as she moves. Heavy-duty black boots splash through puddles, and a waterproof watch is visible on her wrist, its face illuminated for easy reading in the darkness. In her arms, she carries a large red medical bag, gripping it tightly as she navigates the wet pavement. Behind her, the ambulance looms large, its red and blue lights casting an eerie glow over the rain-slicked street. Her partner can be seen in the background, wheeling out a gurney from the back of the vehicle. In the foreground, blurred by the rain and motion, concerned onlookers gather under umbrellas near what appears to be a car accident scene just out of frame. The wet street reflects the emergency lights, creating a dramatic kaleidoscope of color against the dark night. The entire scene pulses with tension and the critical nature of the unfolding emergency.

Winner: Draw

I don't think either AI image generator won this round. Both have washed out and over 'plastic' face textures likely caused by the lighting issues. Midjourney does a slightly better job matching the description of the scene.


5. The retired astronaut​


Midjourney


Retired astronaut image by Midjourney (Image credit: Midjourney/Future AI image)

Flux AI image


Retired astronaut image by Flux (Image credit: Flux AI image/Future)

Finally we have a scene in a school. Here I've asked the AI models to generate a retired astronaut in his late 60s giving a presentation about space.

He is well presented in good health depicting a NASA logo. The background is well described with posters, quotes and people watching as he speaks.
A retired astronaut in his late 60s is giving an animated presentation at a science museum. His silver hair is neatly trimmed, and despite his age, he stands tall and straight, a testament to years of rigorous physical training. His blue eyes sparkle with enthusiasm as he gestures towards a large scale model of the solar system suspended from the ceiling. He's dressed in a navy blue blazer with a small, subtle NASA pin on the lapel. Underneath, he wears a light blue button-up shirt and khaki slacks. On his left wrist is a watch that looks suspiciously like the ones worn on space missions. His hands, though showing signs of age, move with the precision and control of someone used to operating in zero gravity. Around him, a diverse group of students listen with rapt attention. Some furiously scribble notes, while others have their hands half-raised, eager to ask questions. The audience is a mix of ages and backgrounds, all united by their fascination with space exploration. The walls of the presentation space are adorned with large, high-resolution photographs of galaxies, nebulae, and planets. Inspirational quotes about exploration and discovery are interspersed between the images. In one corner, a genuine space suit stands in a glass case, adding authenticity to the presenter's words. Sunlight streams through large windows, illuminating particles of dust floating in the air, reminiscent of stars in the night sky. The entire scene is bathed in a sense of wonder and possibility, as the retired astronaut bridges the gap between Earth and the cosmos for his eager audience.

Winner: Flux

I am giving this one to Flux. It won because it had skin texture and human realism on par or slightly better than Midjourney but with a much better overall image structure including more realistic background people.

Flux vs Midjourney: Which model wins​

Header Cell - Column 0MidjourneyFlux
A chef in the kitchen🌅
A street musician🌅
The gardener🌅
Paramedic in an emergency🌅🌅
The retired astronaut🌅

This was almost a clean sweep for Midjourney and it was mainly driven by the improvements Midjourney has made in skin texture rendering with v6.1.

I don't think it was as clear as it looks on paper though as in many images Flux had a better overall image structure and was better at backgrounds. I've also found Flux is more consistent with text rendering than Midjourney — but this test was about people and creating realistic digital humans.

What it does show is that even at the bleeding edge of AI image generation there are still tells in every image that sell it as AI generated.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,556


TikTok could get an AI-powered video generator after ByteDance drops new AI model — here's what we know​

News

By Ryan Morrison
published August 8, 2024

Only in China for now

When you purchase through links on our site, we may earn an affiliate commission. Here’s how it works.

Jimeng AI

(Image credit: Jimeng AI/ByteDance)

ByteDance, the Chinese company behind TikTok and viral video editor CapCut, has released its first AI text-to-video model designed to compete with the yet-to-be-released Sora from OpenAI — but for now, it's only available in China.

Jimeng AI was built by Faceu Technology, a company owned by ByteDance that produces the CaptCut video editing app and is available for iPhone and Android as well as online.

To get access you have to log in with a Douyin account, the Chinese version of TikTok which suggests if it does come to other regions it will be linked to TikTok or CapCut. It is possible, but purely speculative, that a version of Jimeng will be built into CapCut in the future.

ByteDance isn’t the only Chinese company building out AI video models. Kuaishou is one of China's largest video apps and last month it made Kling AI video available outside of China for the first time. It is one of my favorite AI tools with impressive motion quality and video realism.


What is Jimeng AI?​


Jimeng AI

(Image credit: Jimeng AI/ByteDance)

Jimeng AI is a text-to-video model trained and operated by Faceu Technology, the Chinese company behind the CapCut video editor. Like Kling, Sora, Runway and Luma Labs Dream Machine it takes a text input and generates a few seconds of realistic video content.

Branding itself the "one-stop AI creation platform" you can generate video from text or images and it gives you control over camera movement and first and last frame input. This is something most modern AI video generators offer where you give it two images and it fills in the moments between them.

The focus for Faceu has been on ensuring its model can understand and accurately follow Chinese text prompts and convert abstract ideas into visual works.

How does Jimeng AI compare?[/HEADING]

From the video clips I’ve seen on social media and the Jimeng website, it appears to be closer to Runway Gen-2 or Pika Labs than Sora, Gen-3 or even Kling. Video motion appears slightly blurred or shaky and output is more comic than realism.

What I haven't been able to confirm, as it isn't available outside of China, is how long each video clip is at initial generation or whether you can extend a clip.

Most tools including Kling start at 5 seconds where Runway is 10 seconds and Sora is reportedly 15 seconds. Many of them also allow for multiple extensions to that initial clip.

I think Jimeng being mobile-first and tied to apps like Douyin and CapCut put it in a different category to the likes of Kling and Dream Machine. It is better compared to the likes of the Captions App or Diffuse in that its content is primarily aimed at social video than production.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,556


Midjourney has competition — I got access to Google Imagen 3 and it is impressive​

Features

By Ryan Morrison
last updated August 2, 2024

Available in ImageFX

When you purchase through links on our site, we may earn an affiliate commission. Here’s how it works.

ImageFX

(Image credit: ImageFX)

Jump to:


Imagen 3 is a text-to-image artificial intelligence model built by Google's advanced AI lab DeepMind. It was announced at Google I/O and is finally rolling out to users.

The model is currently only available through the Google AI Test Kitchen experiment ImageFX and only to a small group of “trusted users” but that pool is being expanded regularly.

With Imagen3 Google promises better detail, richer lighting and fewer artifacts than the previous generations. It also has better prompt understanding and text rendering.

ImageFX is available for any Google user in the U.S., Kenya, New Zealand and Australia. I’ve been given access to Imagen 3 and created a series of prompts to put it to the test.


Creating Imagen 3 prompts​


Google DeepMind promises higher-quality images across a range of styles including photorealism, oil paintings and graphic art. It can also understand natural language prompts and complex camera angles.

I fed all this into Claude and had it come up with a bullet list of promised features. I then refined each bullet into a prompt to cover as many areas as possible. The one I’m most excited for is its ability to accurately render text on an image — something few do very well.

1. The wildcard (I’m feeling lucky)​


ImageFX Imagen 3

(Image credit: ImageFX Imagen 3/AI)

The first prompt is one that ImageFX generated on its own. It automatically suggests an idea and you hit ‘tab’ to see the prompt in full and either adapt or use it to make an image.

This is also a good way to test one of the most powerful features of ImageFX — its chips. These turn keywords or phrases into menu items where you can quickly adapt elements of an image.

It offered me: “A macro photograph of a colorful tiny gnome riding a snail through a thick green forest, magical, fantasy.” Once generated you can edit any single element of an image with inpainting, this will generate four new versions but only change the area you selected.

I love the way it rendered the background and captured the concept of a macro photograph. It was also incredibly easy to adapt the color of the hat.

2. Dewdrop Web Macro​


ImageFX Imagen 3

(Image credit: ImageFX Imagen 3/AI)

This prompt aims to test Imagen 3's ability to render microscopic details and complex light interactions in a natural setting. This is similar to the first test but with an additional degree of complexity in the foreground with the dew dropr.

Prompt: "A macro photograph of a dewdrop on a spider's web, capturing the intricate details of the web and the refraction of light through the water droplet. The background should be a soft focus of a lush green forest."

As an arachnophobe, I was worried it would generate a spider but it followed the prompt well enough to just show a portion of the web.

3. Hummingbird Style Contrast​


ImageFX Imagen 3

(Image credit: ImageFX Imagen 3/AI)

Here the aim is to test the model's versatility in generating contrasting artistic styles within a single image. I initially used the prompt: "Create a split-screen image: on the left, a photorealistic close-up of a hummingbird feeding from a flower; on the right, the same scene reimagined as a vibrant, stylized oil painting in the style of Van Gogh."

This would have worked with Midjourney or similar but not Google. I had to revise this prompt as ImageFX won’t create work in the style of a named artist, even one whose work is long out of copyright.

So I used: “Create a split-screen image: on the left, a photorealistic close-up of a hummingbird feeding from a flower; on the right, the same scene reimagined as a vibrant, stylized painting with bold, swirling brushstrokes, intense colors, and a sense of movement and emotion in every element. The sky should have a turbulent, dream-like quality with exaggerated stars or swirls.”

This is a style I plan to experiment with more as it looks stunning. I'd have adapted the background on one side to better match the 'Van Gogh' style but otherwise it was what I asked for and did a good job at integrating the conflicting styles.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,556

4. Steampunk Market Scene​


ImageFX Imagen 3

(Image credit: ImageFX Imagen 3/AI)

With this prompt, the aim is to challenge Imagen 3's ability to compose a complex, detailed scene with multiple elements and specific lighting conditions. I gave it some complex elements and descriptions to see how many it would produce.

Prompt: "A bustling steampunk-themed marketplace at dusk. In the foreground, a merchant is demonstrating a brass clockwork automaton to amazed onlookers. The background should feature airships docking at floating platforms, with warm lantern light illuminating the scene."

The first of the four images it generated exactly matched the prompt and the lighting is what you'd expect, suggesting Imagen has a goo understanding of the real world.

5. Textured Reading Nook​


ImageFX Imagen 3

(Image credit: ImageFX Imagen 3/AI)

Generating accurate or at least compelling textures can be a challenge for models, sometimes resulting in a plastic-style effect. Here we test the model's proficiency in accurately rendering a variety of textures and materials.

Prompt: "A cozy reading nook with a plush velvet armchair, a chunky knit blanket draped over it, and a weathered leather-bound book on the seat. Next to it, a rough-hewn wooden side table holds a delicate porcelain teacup with intricate floral patterns."

Not much to say about this beyond the fact it looks great. What I loved about this one were the options I was offered in 'chips'. It allowed me to easily swap cozy for spacious, airy and bright. I could even change the reading nook to a study, library and living room.

Obviously, you can just re-write the whole thing but these are ideas that work as subtle changes to fit the style.

6. Underwater Eclipse Diorama​


ImageFX Imagen 3

(Image credit: ImageFX Imagen 3/AI)

The idea behind this prompt was to test Imagen 3's ability to interpret and execute a long, detailed prompt with multiple complex elements and lighting effects.

Prompt: "An underwater scene of a vibrant coral reef during a solar eclipse. The foreground shows diverse marine life reacting to the dimming light, including bioluminescent creatures beginning to glow. In the background, the eclipsed sun is visible through the water's surface, creating eerie light rays that illuminate particles floating in the water."

This is probably the worst of the outputs. It looks fine but the solar eclipse feels out of place and the texture feels 'aquarium' rather than ocean.

7. Lunar Resort Poster​


ImageFX Imagen 3

(Image credit: ImageFX Imagen 3/AI)

The last few tests all target the model's improved text rendering capabilities. This one asks for a poster and requires Imagen 3 to generate an image within a stylized graphic design context.

Prompt: "Design a vintage-style travel poster for a fictional lunar resort. The poster should feature retro-futuristic art deco styling with the text 'Visit Luna Luxe: Your Gateway to the Stars' prominently displayed. Include imagery of a gleaming moon base with Earth visible in the starry sky above."

Text rendering was as good as I've seen, especially across multiple elements rather than just the headline. The style was OK but not perfect. It did fit the requirement but fell more art-deco than futuristic.

8. Eco-Tech Product Launch​


ImageFX Imagen 3

(Image credit: ImageFX Imagen 3/AI)

Midjourney is very good at creating product images in the real world. DALL-E is also doing that to a degree with its recent update. Here I'm asking Imagen 3 to create a modern, sleek advertisement with integrated product information.

Prompt: "Design a cutting-edge digital billboard for the launch of 'EcoCharge', a new eco-friendly wireless charging pad. The design should feature a minimalist, high-tech aesthetic with a forest green and silver color scheme. Include a 3D render of the slim, leaf-shaped device alongside the text 'EcoCharge: Power from Nature' and 'Charge your device, Save the planet - 50% more efficient'. Incorporate subtle leaf patterns and circuit board designs in the background."

It did exactly what we asked on the prompt in terms of the style, render, and design. It gt the title and subhead perfectly, and even rendered most of the rest of the text but that wasn't nearly as clear.

9. Retro Gaming Festival​


ImageFX Imagen 3

(Image credit: ImageFX Imagen 3/AI)

The final test is something I've actively used AI for — making a poster or flyer. Here we're testing its ability to handle a range of styles with multiple text elements.

Prompt: "Create a vibrant poster for 'Pixel Blast: Retro Gaming Festival'. The design should feature a collage of iconic 8-bit and 16-bit era video game characters and elements. The title 'PIXEL BLAST' should be in large, colorful pixel art font at the top. Include the text 'Retro Gaming Festival' in a chrome 80s style font below. At the bottom, add 'June 15-17 • City Arena • Tickets at pixelblast.com' in a simple, readable font. Incorporate scan lines and a CRT screen effect over the entire image."

I'd give it an 8 out of 10. It was mostly accurate with the occasional additional word but every single word was rendered correctly, there were sometimes just too many of them.


Final thoughts​


Google Imagen 3's text rendering and realism have matched Midjourney levels. It does refuse to generate more often than Midjourney but that is understandable from a Google product.

Imagen 3 is a huge step up from Imagen 2, which was already a good model. The biggest upgrade seems to be in overall image quality. Generated pictures are better looking, and have fewer artifacts and a lot more detail than I’ve seen from Imagen 2 or other company models.

It will be interesting to see how this works once it is rolled out to other platforms such as Gemini or built into third-party software as a developer API.

However it is finally deployed, DeepMind has done it again with an impressive real application of advanced generative AI and created it in a way that is user-friendly, adaptable and powerful enough for even the pickiest of users.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,556


Runway just dropped image-to-video in Gen3 — I tried it and it changes everything​

Features

By Ryan Morrison
published July 30, 2024

Character consistency is now possible

When you purchase through links on our site, we may earn an affiliate commission. Here’s how it works.

Runway Gen-3

(Image credit: Runway/Future AI)

Runway’s Gen-3 is one of the best artificial intelligence video models currently available and it just got a lot better with the launch of the highly anticipated image-to-video feature.

While Gen-3 has a surprisingly good image generation model, making its text-to-video one of the best available, it struggles with character consistency or hyperrealism. Both problems are solved by giving it a starting image instead of just using text.

Image-to-video using Gen-3 also allows for motion or text prompts to steer how the AI model should generate the 10-second initial video, starting with the image. This can be AI-generated or a photo taken with a camera — the AI can then make it move.

Gen-3 also works with Runway’s lip-sync feature, meaning you can give it an image of a character, animate that image and then add accurate speech to the animated clip.

Why is image-to-video significant?​



Until AI video tools get the same character consistency features found in tools like Leonardo, Midjourney, and Ideogram their use for longer storytelling is limited. This doesn’t just apply to people but also to environments and objects.

While you can in theory use text-to-video to create a short film, using descriptive language to get as close to consistency across frames as possible, there will also be discrepancies.

Starting with an image ensures, at least for the most part, that the generated video follows your aesthetic and keeps the same scenes and characters across multiple videos. It also means you can make use of different AI video tools and keep the same visual style.

In my own experiments, I’ve also found that when you start with an image the overall quality of both the image and the motion is better than if you just use text. The next step is for Runway to upgrade its video-to-video model to allow for motion transfer with style changes.


Putting Runway Gen-3 mage-to-video to the test​

Get started with Gen-3 Alpha Image to Video.Learn how with today’s Runway Academy. pic.twitter.com/Mbw0eqOjtoJuly 30, 2024


To put Runway’s Gen-3 image-to-video to the test I used Midjourney to create a character. In this case a middle-aged geek.

I then created a series of images of our geek doing different activities using the Midjourney consistent character feature. I then animated each image using Runway.

Some of the animations were made without a text prompt, others did use a prompt to steer the motion but it didn’t always make a massive difference. In the one video, I needed to work to properly steer the motion — my character playing basketball — adding a text prompt made it worse.

Runway Gen-3

(Image credit: Runway Gen-3/Future AI)

Overall, Gen-3 image-to-video worked incredibly well. Its understanding of motion was as close to realistic as I've seen and one video, where the character is giving a talk at a conference made me do a double take it was so close to real.

Gen-3 is still in Alpha and there will be continual improvements before its general release. We haven't even seen video-to-video yet and it is already generating near-real video.

I love how natural the camera motion feels, and the fact it seems to have solved some of the human movement issues, especially when you start with an image.

Other models put the characters in slow motion when they move, including previous versions of Runway. This solves some of that problem.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,556

1/1
Researchers at MIT's CSAIL and The AI Institute have created a new algorithm called "Estimate, Extrapolate, and Situate" (EES). This algorithm helps robots adapt to different environments by enhancing their ability to learn autonomously.

The EES algorithm improves robot efficiency in settings like factories, homes, hospitals, and coffee shops by using a vision system to monitor surroundings and assist in task performance.

The EES algorithm assesses how well a robot is performing a task and decides if more practice is needed. It was tested on Boston Dynamics's Spot robot at The AI Institute, where it successfully completed tasks after a few hours of practice.

For example, the robot learned to place a ball and ring on a slanted table in about three hours and improved its toy-sweeping skills within two hours.

/search?q=#MIT /search?q=#algorithm /search?q=#ees /search?q=#robot /search?q=#TechNews /search?q=#AI


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GUlGXaIXMAAszHe.jpg








1/8
The phrase "practice makes perfect" is great advice for humans — and also a helpful maxim for robots 🧵

Instead of requiring a human expert to guide such improvement, MIT & The AI Institute’s "Estimate, Extrapolate, and Situate" (EES) algorithm enables these machines to practice on their own, potentially helping them improve at useful tasks in factories, households, and hospitals: Helping robots practice skills independently to adapt to unfamiliar environments

2/8
EES first works w/a vision system that locates & tracks the machine’s surroundings. Then, the algorithm estimates how reliably the robot executes an action & if it should practice more.

3/8
EES forecasts how well the robot could perform the overall task if it refines that skill & finally it practices. The vision system then checks if that skill was done correctly after each attempt.

4/8
This algorithm could come in handy in places like a hospital, factory, house, or coffee shop. For example, if you wanted a robot to clean up your living room, it would need help practicing skills like sweeping.

EES could help that robot improve w/o human intervention, using only a few practice trials.

5/8
EES's knack for efficient learning was evident when implemented on Boston Dynamics’ Spot quadruped during research trials at The AI Institute.

In one demo, the robot learned how to securely place a ball and ring on a slanted table in ~3 hours.

6/8
In another, the algorithm guided the machine to improve at sweeping toys into a bin w/i about 2 hours.

Both results appear to be an upgrade from previous methods, which would have likely taken >10 hours per task.

7/8
Featured authors in article: Nishanth Kumar (@nishanthkumar23), Tom Silver (@tomssilver), Tomás Lozano-Pérez, and Leslie Pack Kaelbling
Paper: Practice Makes Perfect: Planning to Learn Skill Parameter Policies
MIT research group: @MITLIS_Lab

8/8
Is there transfer learning? Each robot’s optimisation for the same task performance may be different,how to they share experiences and optimise learning?


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GUjNWxOXQAAdisD.jpg




Helping robots practice skills independently to adapt to unfamiliar environments​


A new algorithm helps robots practice skills like sweeping and placing objects, potentially helping them improve at important tasks in houses, hospitals, and factories.

Alex Shipps | MIT CSAIL
Publication Date:
August 8, 2024
Press Inquiries

The phrase “practice makes perfect” is usually reserved for humans, but it’s also a great maxim for robots newly deployed in unfamiliar environments.

Picture a robot arriving in a warehouse. It comes packaged with the skills it was trained on, like placing an object, and now it needs to pick items from a shelf it’s not familiar with. At first, the machine struggles with this, since it needs to get acquainted with its new surroundings. To improve, the robot will need to understand which skills within an overall task it needs improvement on, then specialize (or parameterize) that action.

A human onsite could program the robot to optimize its performance, but researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and The AI Institute have developed a more effective alternative. Presented at the Robotics: Science and Systems Conference last month, their “Estimate, Extrapolate, and Situate” (EES) algorithm enables these machines to practice on their own, potentially helping them improve at useful tasks in factories, households, and hospitals.

Sizing up the situation

To help robots get better at activities like sweeping floors, EES works with a vision system that locates and tracks the machine’s surroundings. Then, the algorithm estimates how reliably the robot executes an action (like sweeping) and whether it would be worthwhile to practice more. EES forecasts how well the robot could perform the overall task if it refines that particular skill, and finally, it practices. The vision system subsequently checks whether that skill was done correctly after each attempt.

EES could come in handy in places like a hospital, factory, house, or coffee shop. For example, if you wanted a robot to clean up your living room, it would need help practicing skills like sweeping. According to Nishanth Kumar SM ’24 and his colleagues, though, EES could help that robot improve without human intervention, using only a few practice trials.

“Going into this project, we wondered if this specialization would be possible in a reasonable amount of samples on a real robot,” says Kumar, co-lead author of a paper describing the work, PhD student in electrical engineering and computer science, and a CSAIL affiliate. “Now, we have an algorithm that enables robots to get meaningfully better at specific skills in a reasonable amount of time with tens or hundreds of data points, an upgrade from the thousands or millions of samples that a standard reinforcement learning algorithm requires.”

See Spot sweep

EES’s knack for efficient learning was evident when implemented on Boston Dynamics’ Spot quadruped during research trials at The AI Institute. The robot, which has an arm attached to its back, completed manipulation tasks after practicing for a few hours. In one demonstration, the robot learned how to securely place a ball and ring on a slanted table in roughly three hours. In another, the algorithm guided the machine to improve at sweeping toys into a bin within about two hours. Both results appear to be an upgrade from previous frameworks, which would have likely taken more than 10 hours per task.

“We aimed to have the robot collect its own experience so it can better choose which strategies will work well in its deployment,” says co-lead author Tom Silver SM ’20, PhD ’24, an electrical engineering and computer science (EECS) alumnus and CSAIL affiliate who is now an assistant professor at Princeton University. “By focusing on what the robot knows, we sought to answer a key question: In the library of skills that the robot has, which is the one that would be most useful to practice right now?”

EES could eventually help streamline autonomous practice for robots in new deployment environments, but for now, it comes with a few limitations. For starters, they used tables that were low to the ground, which made it easier for the robot to see its objects. Kumar and Silver also 3D printed an attachable handle that made the brush easier for Spot to grab. The robot didn’t detect some items and identified objects in the wrong places, so the researchers counted those errors as failures.

Giving robots homework

The researchers note that the practice speeds from the physical experiments could be accelerated further with the help of a simulator. Instead of physically working at each skill autonomously, the robot could eventually combine real and virtual practice. They hope to make their system faster with less latency, engineering EES to overcome the imaging delays the researchers experienced. In the future, they may investigate an algorithm that reasons over sequences of practice attempts instead of planning which skills to refine.

“Enabling robots to learn on their own is both incredibly useful and extremely challenging,” says Danfei Xu, an assistant professor in the School of Interactive Computing at Georgia Tech and a research scientist at NVIDIA AI, who was not involved with this work. “In the future, home robots will be sold to all sorts of households and expected to perform a wide range of tasks. We can't possibly program everything they need to know beforehand, so it’s essential that they can learn on the job. However, letting robots loose to explore and learn without guidance can be very slow and might lead to unintended consequences. The research by Silver and his colleagues introduces an algorithm that allows robots to practice their skills autonomously in a structured way. This is a big step towards creating home robots that can continuously evolve and improve on their own.”

Silver and Kumar’s co-authors are The AI Institute researchers Stephen Proulx and Jennifer Barry, plus four CSAIL members: Northeastern University PhD student and visiting researcher Linfeng Zhao, MIT EECS PhD student Willie McClinton, and MIT EECS professors Leslie Pack Kaelbling and Tomás Lozano-Pérez. Their work was supported, in part, by The AI Institute, the U.S. National Science Foundation, the U.S. Air Force Office of Scientific Research, the U.S. Office of Naval Research, the U.S. Army Research Office, and MIT Quest for Intelligence, with high-performance computing resources from the MIT SuperCloud and Lincoln Laboratory Supercomputing Center.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,556

1/1
🎨✨Just read about VideoDoodles, a new system that lets artists easily create video doodles—hand-drawn animations integrated with video! It uses 3D scene-aware canvases and custom tracking to make animations look natural. Game-changer for creatives! /search?q=#SIGGRAPH2023 /search?q=#AI /search?q=#GROK 2


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/7
VideoDoodles: a great combination of modern CV techniques and clever HCI by @emxtyu & al @adobe (+ is now open source)

project: VideoDoodles: Hand-Drawn Animations on Videos with Scene-Aware Canvases
paper: https://www-sop.inria.fr/reves/Basilic/2023/YBNWKB23/VideoDoodles.pdf
code: GitHub - adobe-research/VideoDoodles

2/7
interesting & fun paper/repo by adobe💪

adding it to http://microlaunch.net august spotlights

3/7
boop @kelin_online

4/7
Oh this one is from last year, fantastic that is now oss!

5/7
yes we have VIDEO DOODLES!

https://invidious.poast.org/watch?v=hsykpStD2yw

6/7
Haha très cool !

7/7



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GUpG626XcAASWrH.jpg


VideoDoodles: Hand-Drawn Animations on Videos with Scene-Aware Canvases​

Emilie Yu Kevin Blackburn-Matzen Cuong Nguyen Oliver Wang Rubaiat Habib Kazi Adrien Bousseau
ACM Transactions on Graphics (SIGGRAPH) - 2023
VideoDoodles: Hand-Drawn Animations on Videos with Scene-Aware Canvases/

Video doodles combine hand-drawn animations with video footage. Our interactive system eases the creation of this mixed media art by letting users place planar canvases in the scene which are then tracked in 3D. In this example, the inserted rainbow bridge exhibits correct perspective and occlusions, and the character’s face and arms follow the tram as it runs towards the camera.
Paper Code Supplemental webpage Video

Abstract​

We present an interactive system to ease the creation of so-called video doodles – videos on which artists insert hand-drawn animations for entertainment or educational purposes. Video doodles are challenging to create because to be convincing, the inserted drawings must appear as if they were part of the captured scene. In particular, the drawings should undergo tracking, perspective deformations and occlusions as they move with respect to the camera and to other objects in the scene – visual effects that are difficult to reproduce with existing 2D video editing software. Our system supports these effects by relying on planar canvases that users position in a 3D scene reconstructed from the video. Furthermore, we present a custom tracking algorithm that allows users to anchor canvases to static or dynamic objects in the scene, such that the canvases move and rotate to follow the position and direction of these objects. When testing our system, novices could create a variety of short animated clips in a dozen of minutes, while professionals praised its speed and ease of use compared to existing tools.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,556



1/11
Deep-Live-Cam is trending #1 on github. It enables anyone to convert a single image into a LIVE stream deepfake, instant and immediately.

2/11
The github link states it was created to "help artists with tasks such as animating a custom character or using the character as a model for clothing etc"

GitHub - hacksider/Deep-Live-Cam: real time face swap and one-click video deepfake with only a single image (uncensored)

3/11
It’s so over

4/11
When it's a simple cellphone app, that's when we're really fukked.

5/11
whats the purpose of this shyt

6/11
The Laughing Man already knows the answer to that

7/11
This needs to be banned

8/11
@kyledunnigan

9/11
That's not Billy Butcher that's Chilli Chopper

10/11
So everything you’ll see online, everything you’ll read online is going to be fake. That’s a good way to know how to avoid things from now on.

Focus on your life instead of online

11/11
This is going to cause chaos.

Members of The Real World know what this means.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196





1/11
#1 trending github repo right now looks INSANE

Single image to live stream deep fake.

Look at that quality!!

It's called Deep-Live-Cam (link in replies)

2/11
GitHub - hacksider/Deep-Live-Cam: real time face swap and one-click video deepfake with only a single image (uncensored)

3/11
Some experiments - it works almost flawlessly and it's totally real-time. Took me 5 minutes to install.

4/11
Incredible

5/11
Gonna do a tutorial?

6/11
Oh yeah

7/11
I can finally not attend meetings

8/11
Chat is this real?

9/11
I hope photorealistic ai-generated video becomes common and easily accessible. If every man can claim video of them to be fake an unprecedented age of privacy would follow. Totalitarian surveillance states will become impossible. Real human witnesses will become essential again.

10/11
Everyone will live in their own little 3D Dopaverse.

Bye bye shared reality

11/11
Election season about to get interesting. lol


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,556

Artists celebrate as copyright infringement case against AI image generators moves forward​


Carl Franzen@carlfranzen

August 12, 2024 6:45 PM

Cartoon of artists celebrating outside a courthouse


Credit: VentureBeat made with ChatGPT



Visual artists who joined together in a class action lawsuit against some of the most popular AI image and video generation companies are celebrating today after a judge ruled their copyright infringement case against the AI companies can move forward toward discovery.

Disclosure: VentureBeat regularly uses AI art generators to create article artwork, including some named in this case.

The case, recorded under the number 3:23-cv-00201-WHO, was originally filed back in January of 2023. It has since been amended several times and parts of it struck down, including today.


Which artists are involved?​


Artists Sarah Andersen, Kelly McKernan, Karla Ortiz, Hawke Southworth, Grzegorz Rutkowski, Gregory Manchess, Gerald Brom, Jingna Zhang, Julia Kaye, and Adam Ellis have, on behalf of all artists, accused Midjourney, Runway, Stability AI, and DeviantArt of copying their work by offering AI image generator products based on the open source Stable Diffusion AI model, which Runway and Stability AI collaborated on and which the artists alleged was trained on their copyrighted works in violation of the law.


What the judge ruled today​


While Judge William H. Orrick of the Northern District Court of California, which oversees San Francisco and the heart of the generative AI boom, didn’t yet rule on the final outcome of the case, he wrote in his decision issued today that the “the allegations of induced infringement are sufficient,” for the case to move forward toward a discovery phase — which could allow the lawyers for the artists to peer inside and examine documents from within the AI image generator companies, revealing to the world more details about their training datasets, mechanisms, and inner workings.

“This is a case where plaintiffs allege that Stable Diffusion is built to a significant extent on copyrighted works and that the way the product operates necessarily invokes copies or protected elements of those works,” Orrick’s decision states. “Whether true and whether the result of a glitch (as Stability contends) or by design (plaintiffs’ contention) will be tested at a later date. The allegations of induced infringement are sufficient.”


Artists react with applause​

“The judge is allowing our copyright claims through & now we get to find out allll the things these companies don’t want us to know in Discovery,” wrote one of the artists filing the suit, Kelly McKernan, on her account on the social network X. “This is a HUGE win for us. I’m SO proud of our incredible team of lawyers and fellow plaintiffs!”

Very exciting news on the AI lawsuit! The judge is allowing our copyright claims through & now we get to find out allll the things these companies don’t want us to know in Discovery. This is a HUGE win for us. I’m SO proud of our incredible team of lawyers and fellow plaintiffs! pic.twitter.com/jD6BjGWMoQ

— Kelly McKernan (@Kelly_McKernan) August 12, 2024
“Not only do we proceed on our copyright claims, this order also means companies who utilize SD [Stable Diffusion] models for and/or LAION like datasets could now be liable for copyright infringement violations, amongst other violations,” wrote another plaintiff artist in the case, Karla Ortiz, on her X account.

1/3 HUGE update on our case!

We won BIG as the judge allowed ALL of our claims on copyright infringement to proceed and we historically move on The Lanham Act (trade dress) claims! We can now proceed onto discovery!

The implications on this order is huge on so many fronts! pic.twitter.com/ZcoeFtPtQb

— Karla Ortiz (@kortizart) August 12, 2024


Technical and legal background​


Stable Diffusion was allegedly trained on LAION-5B, a dataset of more than 5 billion images scraped from across the web by researchers and posted online back in 2022.

However, as the case itself notes, that database only contained URLs or links to the images and text descriptions, meaning that the AI companies would have had to separately go and scrape or screenshot copies of the images to train Stable Diffusion or other derivative AI model products.


A silver lining for the AI companies?​


Orrick did hand the AI image generator companies a victory by denying and tossing out with prejudice claims filed against them by the artists under the Digital Millennium Copyright Act of 1998, which prohibits companies from offering products designed to circumvent controls on copyrighted materials offered online and through software (also known as “digital rights management” or DRM).

Midjourney tried to reference older court cases “addressing jewelry, wooden cutouts, and keychains” which found that resemblances between different jewelry products and those of prior artists could not constitute copyright infringement because they were “functional” elements, that is, necessary in order to display certain features or elements of real life or that the artist was trying to produce, regardless of their similarity to prior works.

The artists claimed that “Stable Diffusion models use ‘CLIP-guided diffusion” that relies on prompts including artists’ names to generate an image.

CLIP, an acronym for “Contrastive Language-Image Pre-training,” is a neural network and AI training technique developed by OpenAI back in 2021, more than a year before ChatGPT was unleashed on the world, which can identify objects in images and label them with natural language text captions — greatly aiding in compiling a dataset for training a new AI model such as Stable Diffusion.

“The CLIP model, plaintiffs assert, works as a trade dress database that can recall and recreate the elements of each artist’s trade dress,” writes Orrick in a section of the ruling about Midjourney, later stating: “the combination of identified elements and images, when considered with plaintiffs’ allegations regarding how the CLIP model works as a trade dress database, and Midjourney’s use of plaintiffs’ names in its Midjourney Name List and showcase, provide sufficient description and plausibility for plaintiffs’ trade dress claim.”

In other words: the fact that Midjourney used artists name as well as labeled elements of their works to train its model may constitute copyright infringement.

But, as I’ve argued before — from my perspective as a journalist, not a copyright lawyer nor expert on the subject — it’s already possible and legally permissible for me to commission a human artist to create a new work in the style of a copyrighted artists’ work, which would seem to undercut the plaintiff’s claims.

We’ll see how well the AI art generators can defend their training practices and model outputs as the case moves forward. Read the full document embedded below:

gov.uscourts.cand_.407208.223.0_2Download
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,556


LongWriter AI breaks 10,000-word barrier, challenging human authors​


Michael Nuñez@MichaelFNunez

August 15, 2024 6:00 AM

Credit: VentureBeat made with Midjourney


Credit: VentureBeat made with Midjourney



Researchers at Tsinghua University in Beijing have created a new artificial intelligence system that can produce coherent texts of more than 10,000 words, a significant advance that could transform how long-form writing is approached across various fields.

The system, described in a paper called “LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs,” tackles a persistent challenge in AI technology: the ability to generate lengthy, high-quality written content. This development could have far-reaching implications for tasks ranging from academic writing to fiction, potentially altering the landscape of content creation in the digital age.

The research team, led by Yushi Bai, discovered that an AI model’s output length directly correlates with the length of texts it encounters during training. “We find that the model’s effective generation length is inherently bounded by the sample it has seen during supervised fine-tuning,” the researchers explain. This insight led them to create “LongWriter-6k,” a dataset of 6,000 writing samples ranging from 2,000 to 32,000 words.

By feeding this data-rich diet to their AI model during training, the team scaled up the maximum output length from around 2,000 words to over 10,000 words. Their 9-billion parameter model outperformed even larger proprietary models in long-form text generation tasks.

LongWriter-glm4-9b from @thukeg is capable of generating 10,000+ words at once!?

Paper identifies a problem with current long context LLMs — they can process inputs up to 100,000 tokens, yet struggle to generate outputs exceeding lengths of 2,000 words.

Paper proposes that an… pic.twitter.com/2jfKyIpShK

— Gradio (@Gradio) August 14, 2024


A double-edged pen: Opportunities and challenges​


This breakthrough could transform industries reliant on long-form content. Publishers might use AI to generate first drafts of books or reports. Marketing agencies could create in-depth white papers or case studies more efficiently. Education technology companies might develop AI tutors capable of producing comprehensive study materials.

However, the technology also raises significant challenges. The ability to generate vast amounts of human-like text could exacerbate issues of misinformation and spam. Content creators and journalists may face increased competition from AI-generated articles. Academic institutions will need to refine plagiarism detection tools to identify AI-written papers.

357298685-8dbb6c02-09c4-4319-bd38-f1135457cd25.png
Comparative performance of leading AI language models, including proprietary and open-source options, alongside Tsinghua University’s new LongWriter models. The table shows LongWriter-9B-DPO outperforming other models in overall scores and excelling in generating longer texts of 4,000 to 20,000 words. (credit: github.com)

The ethical implications are equally profound. As AI-generated text becomes indistinguishable from human-written content, questions of authorship, creativity, and intellectual property become more complex. The development of long-form AI writing capabilities may also influence human language skills, potentially enhancing creativity or leading to atrophy of writing abilities.


Rewriting the future: Implications for society and industry​


The researchers have open-sourced their code and models on GitHub, enabling other developers to build on their work. They’ve also released a demonstration video showing their model generating a coherent 10,000-word travel guide to China from a simple prompt, highlighting the technology’s potential for producing detailed, structured content.


A side-by-side comparison shows the output of two AI language models. On the left, LongWriter generates a 7,872-word story, while on the right, the standard GLM-4-9B-Chat model produces 1,896 words. (credit: github.com)

As AI continues to advance, the line between human and machine-generated text blurs further. This breakthrough in long-form text generation represents not just a technical achievement, but a turning point that may reshape our relationship with written communication.

The challenge now lies in harnessing this technology responsibly. Policymakers, ethicists, and technologists must collaborate to develop frameworks for the ethical use of AI-generated content. Education systems may need to evolve, emphasizing skills that complement rather than compete with AI capabilities.

As we enter this new era of AI-assisted writing, the written word, long considered a uniquely human domain, ventures into uncharted territory. The implications of this shift will likely resonate across society, influencing how we create, consume, and value written content in the years to come.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,556

Pindrop claims to detect AI audio deepfakes with 99% accuracy​


Shubham Sharma@mr_bumss

August 15, 2024 5:39 AM

How adversarial AI is creating shallow trust in deepfake world




Today, Pindrop, a company offering solutions for voice security, identity verification and fraud detection, announced the release of Pulse Inspect, a web-based tool for detecting AI-generated speech in any digital audio or video file with what it claims is a significantly high degree of accuracy: 99%.

The feature is available in preview as part of Pindrop’s Pulse suite of products, and offers detection regardless of the tool or AI model the audio was generated from.

This is a notable and ambitious offering from general industry practice where AI vendors release AI classifiers only to detect synthetic content generated from their tools.

Pindrop is offering Pulse Inspect on a yearly subscription to organizations looking to combat the risk of audio deepfakes at scale. However, CEO Vijay Balasubramaniyan tells VentureBeat that they may launch more affordable pricing tiers – with a limited number of media checks – for consumers as well.

“Our pricing is designed for organizations with a recurring need for deepfake detection. However, based on future market demand, we may consider launching pricing options better suited for casual users in the future,” he said.


Pindrop addressing the rise of audio deepfakes​


While deepfakes have been around for a long time, the rise of text-based generative AI systems has made them more prevalent on the internet. Popular gen AI tools, like those from Microsoft and ElevenLabs, have been exploited to mimic the audio and video of celebrities, business persons and politicians to spread widespread misinformation/scams — affecting their public image.

According to Pindrop’s internal report, over 12 million American adults know someone who has personally had deepfakes created without their consent. These duplicates could be anything from images to video to audio, but they all have one thing in common: they thrive on virality, spreading like wildfires on social media.

To address this evolving problem, Pindrop announced the Pulse suite of products earlier this year. The first offering in the portfolio helped enterprises detect deepfake calls coming to their call centers. Now, with Pulse Inspect, the company is going beyond calls to help organizations check any audio/video file for AI-generated synthetic artifacts.


Upload questionable audio files for analysis​


At the core, the offering comes as a web application, where an enterprise user can upload the questionable file for analysis.

Previously, the whole process of checking for synthetic artifacts in existing media files required time-consuming forensic examination. However, in this case, the tool processes the audio in a matter of seconds and comes up with a “deepfake score,” complete with sections that contain AI-generated speech.

This quick response can then enable organizations to take proactive actions to prevent the spread of misinformation and maintain their brand credibility.


Training and analysis process​


Pindrop says it has trained a proprietary deepfake detection model on more than 350 deepfake generation tools, 20 million unique utterances and over 40 languages, resulting in a rate of detecting deepfake audio at 99% based on the company’s internal analysis of a dataset of about 200k samples.

The model checks media files for synthetic artifacts every four seconds, ensuring it classifies deepfakes accurately, especially in the cases of mixed media containing both AI-generated and genuine elements.

“Pindrop’s technology leverages recent breakthroughs in deep neural networks (DNN) and sophisticated spectro-temporal analysis to identify synthetic artifacts using multiple approaches,” Balasubramaniyan explained.


No vendor-specific detection limits​


Since Pindrop has trained its detection model on over several hundred generation tools, Pulse Inspect has no tool-specific restriction for detection.

“There are over 350 deepfake generator systems, with many prolific audio deepfakes on social media likely coming from open-source tools rather than commercial ones like ElevenLabs. Customers need comprehensive tools like Pindrop’s, which are not limited to detecting deepfakes from a single system but can identify synthetic audio across all generation systems,” Balasubramaniyan added.

However, it is important to note that there may be cases where the tool might fail to identify deepfakes, especially when the file has less than two seconds of net speech or a very high level of background noise. The CEO said the company is working continuously to address these gaps and further improve detection accuracy.

Currently, Pindrop is targeting Pulse Inspect at organizations such as media companies, non-profits, government agencies, celebrity management firms, legal firms and social media networks. Balasubramaniyan did not share the exact number of customers using the tool but he did say that “a number of partners” are using the product by paying for a volume-based annual subscription. This includes TrueMedia.org, a free-use product that allows critical election audiences to detect deepfakes.

In addition to the web app supporting manual uploads, Pulse Inspect can also be integrated into custom forensic workflows via an API. This can power bulk use cases such as that of a social media network flagging and removing harmful AI-generated videos.

Moving ahead, Balasubramaniyan said, the company plans to bolster the Pulse suite by improving the explainability aspect of the tools – with a feature to trace back to the source of deepfake generations – and supporting more modalities.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,556


Google’s AI surprise: Gemini Live speaks like a human, taking on ChatGPT Advanced Voice Mode​


Carl Franzen@carlfranzen

August 13, 2024 11:17 AM

Hand holding smartphone displaying video call with robot as other robot stands in background


Credit: VentureBeat made with ChatGPT



Google sometimes feels like it’s playing catchup in the generative AI race to rivals such as Meta, OpenAI, Anthropic and Mistral — but not anymore.

Today, the company leapfrogged most others by announcing Gemini Live, a new voice mode for its AI model Gemini through the Gemini mobile app, which allows users to speak to the model in plain, conversational language and even interrupt it and have it respond back with the AI’s own humanlike voice and cadence. Or as Google put it in a post on X: “You can now have a free-flowing conversation, and even interrupt or change topics just like you might on a regular phone call.”

We’re introducing Gemini Live, a more natural way to interact with Gemini. You can now have a free-flowing conversation, and even interrupt or change topics just like you might on a regular phone call. Available to Gemini Advanced subscribers. #MadeByGoogle pic.twitter.com/eNjlNKubsv

— Google (@Google) August 13, 2024

If that sounds familiar, it’s because OpenAI in May demoed its own “Advanced Voice Mode” for ChatGPT which it openly compared to the talking AI operating system from the movie Her, only to delay the feature and begin to roll it out only selectively to alpha participants late last month.

Gemini Live is now available in English on the Google Gemini app for Android devices through a Gemini Advanced subscription ($19.99 USD per month), with an iOS version and support for more languages to follow in the coming weeks.

In other words: even though OpenAI showed off a similar feature first, Google is set to make it more available to a much wider potential audience (more than 3 billion active users on Android and 2.2 billion iOS devices) much sooner than ChatGPT’s Advanced Voice Mode.



Yet part of the reason OpenAI may have delayed ChatGPT Advanced Voice Mode was due to its own internal “red-teaming” or controlled adversarial security testing that showed the voice mode in particular sometimes engaged in odd, disconcerting, and even potentially dangerous behavior such as mimicking the user’s own voice without consent — which could be used for fraud or malicious purposes.

How is Google addressing the potential harms caused by this type of tech? We don’t really know yet, but VentureBeat reached out to the company to ask and will update when we hear back.


What is Gemini Live good for?​


Google pitches Gemini Live as offering free-flowing, natural conversation that’s good for brainstorming ideas, preparing for important conversations, or simply chatting casually about “various topics.” Gemini Live is designed to respond and adapt in real-time.

Additionally, this feature can operate hands-free, allowing users to continue their interactions even when their device is locked or running other apps in the background.

Google further announced that the Gemini AI model is now fully integrated into the Android user experience, providing more context-aware assistance tailored to the device.

Users can access Gemini by long-pressing the power button or saying, “Hey Google.” This integration allows Gemini to interact with the content on the screen, such as providing details about a YouTube video or generating a list of restaurants from a travel vlog to add directly into Google Maps.

In a blog post, Sissie Hsiao, Vice President and General Manager of Gemini Experiences and Google Assistant, emphasized that the evolution of AI has led to a reimagining of what it means for a personal assistant to be truly helpful. With these new updates, Gemini is set to offer a more intuitive and conversational experience, making it a reliable sidekick for complex tasks.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,556


Move over, Devin: Cosine’s Genie takes the AI coding crown​


Carl Franzen@carlfranzen

August 12, 2024 9:53 AM

Oil lamp abstract artwork emitting code cloud


Credit: VentureBeat made with Midjourney V6



It wasn’t long ago that the startup Cognition was blowing minds with its product Devin, an AI-based software engineer powered by OpenAI’s GPT-4 foundation large language model (LLM) on the backend that could autonomously write and edit code when given instructions in natural language text.

But Devin emerged in March 2024 — five months ago — an eternity in the fast-moving generative AI space.

Now, another “C”-named startup, Cosine, which was founded through the esteemed Y Combinator startup accelerator in San Francisco, has announced its own new autonomous AI-powered engineer Genie, which it says handily outperforms Devin, scoring 30% on third-party benchmark test SWE-Bench compared to Devin’s 13.8%, and even surpassing the 19% scored by Amazon’s Q and Factory’s Code Droid.

Screenshot-2024-08-12-at-12.12.46%E2%80%AFPM.png
Screenshot from Cosine’s website showing Genie’s performance on SWE-Bench compared to other AI coding engineer models. Credit: Cosine
“This model is so much more than a benchmark score: it was trained from the start to think and behave like a human SWE [software engineer],” wrote Cosine’s co-founder and CEO Alistair Pullen in a post on his account on the social network X.

I'm excited to share that we've built the world's most capable AI software engineer, achieving 30.08% on SWE-Bench – ahead of Amazon and Cognition. This model is so much more than a benchmark score: it was trained from the start to think and behave like a human SWE. pic.twitter.com/OyvqKLxcGV

— Alistair (@AlistairPullen) August 12, 2024

I'm excited to share that we've built the world's most capable AI software engineer, achieving 30.08% on SWE-Bench – ahead of Amazon and Cognition. This model is so much more than a benchmark score: it was trained from the start to think and behave like a human SWE.


What is Genie and what can it do?​


Genie is an advanced AI software engineering model designed to autonomously tackle a wide range of coding tasks, from bug fixing to feature building, code refactoring and validation through comprehensive testing, as instructed by human engineers or managers.

It operates either fully autonomously or in collaboration with users and aims to provide the experience of working alongside a skilled colleague.

“We’ve been chasing the dream of building something that can genuinely automatically perform end-to-end programming tasks with no intervention and a high degree of reliability – an artificial colleague. Genie is the first step in doing exactly that,” wrote Pullen in the Cosine blog post announcing Genie’s performance and limited, invitation-only availability.

www.ycombinator.gif


The AI can write software in a multitude of languages — there are 15 listed in its technical report as being sources of data, including:


  1. JavaScript

  2. Python
  3. TypeScript
  4. TSX
  5. Java
  6. C#
  7. C++
  8. C
  9. Rust
  10. Scala
  11. Kotlin
  12. Swift
  13. Golang
  14. PHP
  15. Ruby

Cosine claims Genie can emulate the cognitive processes of human engineers.

“My thesis on this is simple: make it watch how a human engineer does their job, and mimic that process,” Pullen explained in the blog post.

The code Genie generates is stored in a user’s GitHub repo, meaning Cosine does not retain a copy, nor any of the attendant security risks.

Furthermore, Cosine’s software platform is already integrated with Slack and system notifications, which it can use to alert users of its state, ask questions, or flag issues as a good human colleague would.

”Genie also can ask users clarifying questions as well as respond to reviews/comments on the PRs [pull requests] it generates,” Pullen wrote to VentureBeat. “We’re trying to get Genie to behave like a colleague, so getting the model to use the channels a colleague would makes the most sense.”


Powered by a long context OpenAI model​


Unlike many AI models that rely on foundational models supplemented with a few tools, Genie was developed through a proprietary process that involves training and fine-tuning a long token output AI model from OpenAI .

“In terms of the model we’re using, it’s a (currently) non-general availability GPT-4o variant that OpenAI have allowed us to train as part of the experimental access program,” Pullen wrote to VentureBeat via email. “The model has performed well and we’ve shared our learnings with the OpenAI finetuning team and engineering leadership as a result. This was a real turning point for us as it convinced them to invest resources and attention in our novel techniques.”

While Cosine doesn’t specify the particular model, OpenAI just recently announced the limited availability of a new GPT-4o Long Output Context model which can spit out up to 64,000 tokens of output instead of GPT-4o’s initial 4,000 — a 16-fold increase.


The training data was key​

“For its most recent training run Genie was trained on billions of tokens of data, the mix of which was chosen to make the model as competent as possible on the languages our users care about the most at the current time,” wrote Pullen in Cosine’s technical report on the agent.

With its extensive context window and a continuous loop of improvement, Genie iterates and refines its solutions until they meet the desired outcome.

Cosine says in its blog post that it spent nearly a year curating a dataset with a wide range of software development activities from real engineers.

“In practice, however, getting such and then effectively utilising that data is extremely difficult, because essentially it doesn’t exist,” Pullen elaborated in his blog post, adding. “Our data pipeline uses a combination of artefacts, static analysis, self-play, step-by-step verification, and fine-tuned AI models trained on a large amount of labelled data to forensically derive the detailed process that must have happened to have arrived at the final output. The impact of the data labelling can’t be understated, getting hold of very high-quality data from competent software engineers is difficult, but the results were worth it as it gave so much insight as to how developers implicitly think about approaching problems.”

In an email to VentureBeat, Pullen clarified that: “We started with artefacts of SWEs doing their jobs like PRs, commits, issues from OSS repos (MIT licensed) and then ran that data through our pipeline to forensically derive the reasoning, to reconstruct how the humans came to the conclusions they did. This proprietary dataset is what we trained the v1 on, and then we used self-play and self-improvement to get us the rest of the way.”

This dataset not only represents perfect information lineage and incremental knowledge discovery but also captures the step-by-step decision-making process of human engineers.

“By actually training our models with this dataset rather than simply prompting base models which is what everyone else is doing, we have seen that we’re no longer just generating random code until some works, it’s tackling problems like a human,” Pullen asserted.


Pricing​


In a follow-up email, Pullen described how Genie’s pricing structure will work.

He said it will initially be broken into two tiers:

“1. An accessible option priced competitively with existing AI tools, around the $20 mark. This tier will have some feature and usage limitations but will showcase Genie’s capabilities for individuals and small teams.

2. An enterprise-level offering with expanded features, virtually unlimited usage and the ability to create a perfect AI colleague who’s an expert in every line code ever written internally. This tier will be priced more substantially, reflecting its value as a full AI engineering colleague.”


Implications and Future Developments​


Genie’s launch has far-reaching implications for software development teams, particularly those looking to enhance productivity and reduce the time spent on routine tasks. With its ability to autonomously handle complex programming challenges, Genie could potentially transform the way engineering resources are allocated, allowing teams to focus on more strategic initiatives.

“The idea of engineering resource no longer being a constraint is a huge driver for me, particularly since starting a company,” wrote Pullen. “The value of an AI colleague that can jump into an unknown codebase and solve unseen problems in timeframes orders of magnitude quicker than a human is self-evident and has huge implications for the world.”

Cosine has ambitious plans for Genie’s future development. The company intends to expand its model portfolio to include smaller models for simpler tasks and larger models capable of handling more complex challenges. Additionally, Cosine plans to extend its work into open-source communities by context-extending one of the leading open-source models and pre-training on a vast dataset.


Availability and Next Steps​


While Genie is already being rolled out to select users, broader access is still being managed.

Interested parties can apply for early access to try Genie on their projects by filling out a web form on the Cosine website.

Cosine remains committed to continuous improvement, with plans to ship regular updates to Genie’s capabilities based on customer feedback.

“SWE-Bench recently changed their submission requirements to include the full working process of AI models, which poses a challenge for us as it would require revealing proprietary methodologies,” noted Pullen. “For now, we’ve decided to keep these internal processes confidential, but we’ve made Genie’s final outputs publicly available for independent verification on GitHub.”


More on Cosine​


Cosine is a human reasoning lab focused on researching and codifying how humans perform tasks, intending to teach AI to mimic, excel at, and expand on these tasks.

Founded in 2022 by Pullen, Sam Stenner, and Yang Li, the company’s mission is to push the boundaries of AI by applying human reasoning to solve complex problems, starting with software engineering.

Cosine has already raised $2.5 million in seed funding from Uphonest and SOMA Capital, with participation from Lakestar, Focal and others.

With a small but highly skilled team, Cosine has already made significant strides in the AI field, and Genie is just the beginning.

“We truly believe that we’re able to codify human reasoning for any job and industry,” Pullen stated in the announcement blog post. “Software engineering is just the most intuitive starting point, and we can’t wait to show you everything else we’re working on.”
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,556

Falcon Mamba 7B’s powerful new AI architecture offers alternative to transformer models​


Shubham Sharma@mr_bumss

August 12, 2024 1:02 PM

a-captivating-painting-of-an-ai-falcon-perched-con-G-gbgTSiTmugale2Cj_J5A-t3mrx3ivTJSxhucUMuA0_w.jpeg


Image Credit: Venturebeat, via Ideogram



Today, Abu Dhabi-backed Technology Innovation Institute (TII), a research organization working on new-age technologies across domains like artificial intelligence, quantum computing and autonomous robotics, released a new open-source model called Falcon Mamba 7B.

Available on Hugging Face, the casual decoder-only offering uses the novel Mamba State Space Language Model (SSLM) architecture to handle various text-generation tasks and outperform leading models in its size class, including Meta’s Llama 3 8B, Llama 3.1 8B and Mistral 7B, on select benchmarks.

It comes as the fourth open model from TII after Falcon 180B, Falcon 40B and Falcon 2 but is the first in the SSLM category, which is rapidly emerging as a new alternative to transformer-based large language models (LLMs) in the AI domain.

The institute is offering the model under ‘Falcon License 2.0,’ which is a permissive license based on Apache 2.0.


What does the Falcon Mamba 7B bring to the table?​


While transformer models continue to dominate the generative AI space, researchers have noted that the architecture can struggle when dealing with longer pieces of text.

Essentially, transformers’ attention mechanism, which works by comparing every word (or token) with other every word in the text to understand context, demands more computing power and memory to handle growing context windows.

If the resources are not scaled accordingly, the inference slows down and reaches a point where it can’t handle texts beyond a certain length.

To overcome these hurdles, the state space language model (SSLM) architecture that works by continuously updating a “state” as it processes words has emerged as a promising alternative. It has already been deployed by some organizations — with TII being the latest adopter.

According to TII, its all-new Falcon model uses the Mamba SSM architecture originally proposed by researchers at Carnegie Mellon and Princeton Universities in a paper dated December 2023.

The architecture uses a selection mechanism that allows the model to dynamically adjust its parameters based on the input. This way, the model can focus on or ignore particular inputs, similar to how attention works in transformers, while delivering the ability to process long sequences of text – such as an entire book – without requiring additional memory or computing resources.

The approach makes the model suitable for enterprise-scale machine translation, text summarization, computer vision and audio processing tasks as well as tasks like estimation and forecasting, TII noted.


Taking on Meta, Google and Mistral​


To see how Falcon Mamba 7B fares against leading transformer models in the same size class, the institute ran a test to determine the maximum context length the models can handle when using a single 24GB A10GPU.

The results revealed Falcon Mamba can “fit larger sequences than SoTA transformer-based models while theoretically being able to fit infinite context length if one processes the entire context token by token, or by chunks of tokens with a size that fits on the GPU, denoted as sequential parallel.”

Falcon Mamba 7B
Falcon Mamba 7B

In a separate throughput test, it outperformed Mistral 7B’s efficient sliding window attention architecture to generate all tokens at a constant speed and without any increase in CUDA peak memory.

Even in standard industry benchmarks, the new model’s performance was better than or nearly similar to that of popular transformer models as well as pure and hybrid state space models.

For instance, in the Arc, TruthfulQA and GSM8K benchmarks, Falcon Mamba 7B scored 62.03%, 53.42% and 52.54%, and convincingly outperformed Llama 3 8B, Llama 3.1 8B, Gemma 7B and Mistral 7B.

However, in the MMLU and Hellaswag benchmarks, it sat closely behind all these models.

That said, this is just the beginning. As the next step, TII plans to further optimize the design of the model to improve its performance and cover more application scenarios.

“This release represents a significant stride forward, inspiring fresh perspectives and further fueling the quest for intelligent systems. At TII, we’re pushing the boundaries of both SSLM and transformer models to spark further innovation in generative AI,” Dr. Hakim Hacid, the acting chief researcher of TII’s AI cross-center unit, said in a statement.

Overall, TII’s Falcon family of language models has been downloaded more than 45 million times — dominating as one of the most successful LLM releases from the UAE.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,556


Study suggests that even the best AI models hallucinate a bunch​


Kyle Wiggers

11:29 AM PDT • August 14, 2024

Comment

Robots work on a contract and review a legal book to illustrate AI usage in law.
Image Credits: mathisworks / Getty Images

All generative AI models hallucinate, from Google’s Gemini to Anthropic’s Claude to the latest stealth release of OpenAI’s GPT-4o. The models are unreliable narrators in other words — sometimes to hilarious effect, other times problematically so.

But not all models make things up at the same rate. And the kinds of mistruths they spout depend on which sources of info they’ve been exposed to.

A recent study from researchers at Cornell, the universities of Washington and Waterloo and the nonprofit research institute AI2 sought to benchmark hallucinations by fact-checking models like GPT-4o against authoritative sources on topics ranging from law and health to history and geography. They found that no model performed exceptionally well across all topics, and that models that hallucinated the least did so partly because they refused to answer questions they’d otherwise get wrong.

“The most important takeaway from our work is that we cannot yet fully trust the outputs of model generations,” Wenting Zhao, a doctorate student at Cornell and a co-author on the research, told TechCrunch. “At present, even the best models can generate hallucination-free text only about 35% of the time.”

There’s been other academic attempts at probing the “factuality” of models, including one by a separate AI2-affiliated team. But Zhao notes that these earlier tests asked models questions with answers easily found on Wikipedia — not exactly the toughest ask, considering most models are trained on Wikipedia data.

To make their benchmark more challenging — and to more accurately reflect the types of questions people ask of models — the researchers identified topics around the web that don’t have a Wikipedia reference. Just over half the questions in their test can’t be answered using Wikipedia (they included some Wikipedia-sourced ones for good measure), and touch on topics including culture, geography, astronomy, pop culture, finance, medicine, computer science and celebrities.

For their study, the researchers evaluated over a dozen different popular models, many of which were released in the past year. In addition to GPT-4o, they tested “open” models such as Meta’s Llama 3 70B, Mistral’s Mixtral 8x22B and Cohere’s Command R+, as well as gated-behind-API models like Perplexity’s Sonar Large (which is based on Llama), Google’s Gemini 1.5 Pro and Anthropic’s Claude 3 Opus.

The results suggest that models aren’t hallucinating much less these days, despite claims to the contrary from OpenAI, Anthropic and the other big generative AI players.

GPT-4o and OpenAI’s much older flagship GPT-3.5 performed about the same in terms of the percentage of questions they answered factually correctly on the benchmark. (GPT-4o was marginally better.) OpenAI’s models were the least hallucinatory overall, followed by Mixtral 8x22B, Command R and Perplexity’s Sonar models.

Questions pertaining to celebrities and finance gave the models the hardest time, but questions about geography and computer science were easiest for the models to answer (perhaps because their training data contained more references to these). In cases where the source of an answer wasn’t Wikipedia, every model answered less factually on average (but especially GPT-3.5 and GPT-4o), suggesting that they’re all informed heavily by Wikipedia content.

Even models that can search the web for information, like Command R and Perplexity’s Sonar models, struggled with “non-Wiki” questions in the benchmark. Model size didn’t matter much; smaller models (e.g. Anthropic’s Claude 3 Haiku) hallucinated roughly as frequently as larger, ostensibly more capable models (e.g. Claude 3 Opus).

So what does all this mean — and where are the improvements that vendors promised?

Well, we wouldn’t put it past vendors to exaggerate their claims. But a more charitable take is the benchmarks they’re using aren’t fit for this purpose. As we’ve written about before, many, if not most, AI evaluations are transient and devoid of important context, doomed to fall victim to Goodhart’s law.

Regardless, Zhao says that she expects the issue of hallucinations to “persist for a long time.”

“Empirical results in our paper indicate that, despite the promise of certain methods to reduce or eliminate hallucinations, the actual improvement achievable with these methods is limited,” she said. “Additionally, our analysis reveals that even the knowledge found on the internet can often be conflicting, partly because the training data — authored by humans — can also contain hallucinations.”

An interim solution could be simply programming models to refuse to answer more often — the technical equivalent to telling a know-it-all to knock it off.

In the researchers’ testing, Claude 3 Haiku answered only around 72% of the questions it was asked, choosing to abstain from the rest. When accounting for the abstentions, Claude 3 Haiku was in fact the most factual model of them all — at least in the sense that it lied least often.

But will people use a model that doesn’t answer many questions? Zhao thinks not and says vendors should focus more of their time and efforts on hallucination-reducing research. Eliminating hallucinations entirely may not be possible, but they can be mitigated through human-in-the-loop fact-checking and citation during a model’s development, she asserts.

“Policies and regulations need to be developed to ensure that human experts are always involved in the process to verify and validate the information generated by generative AI models,” Zhao added. “There are still numerous opportunities to make significant impacts in this field, such as developing advanced fact-checking tools for any free text, providing citations for factual content and offering corrections for hallucinated texts.”
 
Top