bnew

Veteran
Joined
Nov 1, 2015
Messages
55,219
Reputation
8,195
Daps
156,117

[Submitted on 9 Jan 2024]

Large Language Models for Robotics: Opportunities, Challenges, and Perspectives​

Jiaqi Wang, Zihao Wu, Yiwei Li, Hanqi Jiang, Peng Shu, Enze Shi, Huawen Hu, Chong Ma, Yiheng Liu, Xuhui Wang, Yincheng Yao, Xuan Liu, Huaqin Zhao, Zhengliang Liu, Haixing Dai, Lin Zhao, Bao Ge, Xiang Li, Tianming Liu, Shu Zhang
Large language models (LLMs) have undergone significant expansion and have been increasingly integrated across various domains. Notably, in the realm of robot task planning, LLMs harness their advanced reasoning and language comprehension capabilities to formulate precise and efficient action plans based on natural language instructions. However, for embodied tasks, where robots interact with complex environments, text-only LLMs often face challenges due to a lack of compatibility with robotic visual perception. This study provides a comprehensive overview of the emerging integration of LLMs and multimodal LLMs into various robotic tasks. Additionally, we propose a framework that utilizes multimodal GPT-4V to enhance embodied task planning through the combination of natural language instructions and robot visual perceptions. Our results, based on diverse datasets, indicate that GPT-4V effectively enhances robot performance in embodied tasks. This extensive survey and evaluation of LLMs and multimodal LLMs across a variety of robotic tasks enriches the understanding of LLM-centric embodied intelligence and provides forward-looking insights toward bridging the gap in Human-Robot-Environment interaction.
Subjects:Robotics (cs.RO); Artificial Intelligence (cs.AI)
Cite as:arXiv:2401.04334 [cs.RO]
(or arXiv:2401.04334v1 [cs.RO] for this version)
[2401.04334] Large Language Models for Robotics: Opportunities, Challenges, and Perspectives

Focus to learn more

Submission history​


From: Yiwei Li [view email]

[v1] Tue, 9 Jan 2024 03:22:16 UTC (5,034 KB)

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,219
Reputation
8,195
Daps
156,117

NIST releases a tool for testing AI model risk​

Kyle Wiggers

8:25 AM PDT • July 27, 2024
Comment

Futuristic digital blockchain background. Abstract connections technology and digital network. 3d illustration of the Big data and communications technology.
Image Credits: v_alex / Getty Images

The National Institute of Standards and Technology (NIST), the U.S. Commerce Department agency that develops and tests tech for the U.S. government, companies and the broader public, has re-released a testbed designed to measure how malicious attacks — particularly attacks that “poison” AI model training data — might degrade the performance of an AI system.

Called Dioptra (after the classical astronomical and surveying instrument), the modular, open source web-based tool, first released in 2022, seeks to help companies training AI models — and the people using these models — assess, analyze and track AI risks. Dioptra can be used to benchmark and research models, NIST says, as well as to provide a common platform for exposing models to simulated threats in a “red-teaming” environment.

“Testing the effects of adversarial attacks on machine learning models is one of the goals of Dioptra,” NIST wrote in a press release. “The open source software, like generating child available for free download, could help the community, including government agencies and small to medium-sized businesses, conduct evaluations to assess AI developers’ claims about their systems’ performance.”

NIST Dioptra

A screenshot of Diatropa’s interface.Image Credits:NIST

Dioptra debuted alongside documents from NIST and NIST’s recently created AI Safety Institute that lay out ways to mitigate some of the dangers of AI, like how it can be abused to generate nonconsensual pornography. It follows the launch of the U.K. AI Safety Institute’s Inspect, a toolset similarly aimed at assessing the capabilities of models and overall model safety. The U.S. and U.K. have an ongoing partnership to jointly develop advanced AI model testing, announced at the U.K.’s AI Safety Summit in Bletchley Park in November of last year.

Dioptra is also the product of President Joe Biden’s executive order (EO) on AI, which mandates (among other things) that NIST help with AI system testing. The EO, relatedly, also establishes standards for AI safety and security, including requirements for companies developing models (e.g. Apple) to notify the federal government and share results of all safety tests before they’re deployed to the public.

As we’ve written about before, AI benchmarks are hard — not least of which because the most sophisticated AI models today are black boxes whose infrastructure, training data and other key details are kept under wraps by the companies creating them. A report out this month from the Ada Lovelace Institute, a U.K.-based nonprofit research institute that studies AI, found that evaluations alone aren’t sufficient to determine the real-world safety of an AI model in part because current policies allow AI vendors to selectively choose which evaluations to conduct.

NIST doesn’t assert that Dioptra can completely de-risk models. But the agency does propose that Dioptra can shed light on which sorts of attacks might make an AI system perform less effectively and quantify this impact to performance.

In a major limitation, however, Dioptra only works out-of-the-box on models that can be downloaded and used locally, like Meta’s expanding Llama family. Models gated behind an API, such as OpenAI’s GPT-4o, are a no-go — at least for the time being.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,219
Reputation
8,195
Daps
156,117







1/11
>>Continuous Learning Model (CLM) by Topology<<

The CLM is a new model that remembers interactions, learns skills autonomously, and thinks in its free time, just like humans.

The CLM just wants to learn.

Try it at Topology Chat

2/11
LLMs are stateless.
>CLM remembers and references all chats

LLMs don’t have an inner-life.
>CLM forms ideas by mulling over memories in its free time

LLMs have no soul.
>CLM actively organizes memories/ideas, granting it an emergent personality

3/11
CLM is a drop-in replacement for existing LLMs. Change one line of code and get continuous learning.

It simply learns content provided in user messages. Topology’s CLM eliminates RAG, GPTs, reranking, simple agents, and fine-tuning.

Prices start at $2/$10 per 1M input/output.

4/11
Unlike ChatGPT’s memory (which stores occasional short snippets), CLM supports billions of memories.

A novel algorithm bridges the gap between pre-training and in-context learning. Not all memories are created equal: CLM prioritizes the most novel, central, and useful ideas.

5/11
Check out our docs to learn more!
CLM Docs | Notion

6/11
Yo what sick! the endpoint is OpenAI compatible?? gonna get this in backrooms

7/11
yes drop-in replacement for gpt-4. you can provide an optional 'partition_id' to control the memory

8/11
sirs you can't just casually drop AGI like this 😭😭😭

9/11
just did 💅🏼

10/11
Tried to sign up it said email rate limit exceeded

11/11
working on it haha. try google for now


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GTsTjWGbMAE5ZuM.jpg

GTsTvvlaAAA1wSm.jpg

GTsT5yJaEAA6NK2.jpg

GTsUE9XaQAQv68x.jpg


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,219
Reputation
8,195
Daps
156,117

Introducing SAM 2: The next generation of Meta Segment Anything Model for videos and images

July 29, 2024•

15 minute read

Takeaways:


  • Following up on the success of the Meta Segment Anything Model (SAM) for images, we’re releasing SAM 2, a unified model for real-time promptable object segmentation in images and videos that achieves state-of-the-art performance.

  • In keeping with our approach to open science, we’re sharing the code and model weights with a permissive Apache 2.0 license.

  • We’re also sharing the SA-V dataset, which includes approximately 51,000 real-world videos and more than 600,000 masklets (spatio-temporal masks).

  • SAM 2 can segment any object in any video or image—even for objects and visual domains it has not seen previously, enabling a diverse range of use cases without custom adaptation.

  • SAM 2 has many potential real-world applications. For example, the outputs of SAM 2 can be used with a generative video model to create new video effects and unlock new creative applications. SAM 2 could also aid in faster annotation tools for visual data to build better computer vision systems.
https://video-lga3-2.xx.fbcdn.net/o1/v/t2/f2/m69/An90OOU7fbi7bvqEA7w4w8jjrjlXuNSMvlHN5J7M1TjlxXchTVHBxhEQQ93goUvnP27BuJgLDN9g5CJDxcg5wCFX.mp4?efg=eyJ2ZW5jb2RlX3RhZyI6Im9lcF9oZCJ9&_nc_ht=video-lga3-2.xx.fbcdn.net&_nc_cat=110&strext=1&vs=35ac8a12e58bf6bb&_nc_vs=HBkcFQIYOnBhc3N0aHJvdWdoX2V2ZXJzdG9yZS9HRWQyQUJ0NmpmRXNQS1FEQU8yUUhHTXZfLXB2Ym1kakFBQUYVAALIAQBLB4gScHJvZ3Jlc3NpdmVfcmVjaXBlATENc3Vic2FtcGxlX2ZwcwAQdm1hZl9lbmFibGVfbnN1YgAgbWVhc3VyZV9vcmlnaW5hbF9yZXNvbHV0aW9uX3NzaW0AKGNvbXB1dGVfc3NpbV9vbmx5X2F0X29yaWdpbmFsX3Jlc29sdXRpb24AHXVzZV9sYW5jem9zX2Zvcl92cW1fdXBzY2FsaW5nABFkaXNhYmxlX3Bvc3RfcHZxcwAVACUAHIwXQAAAAAAAAAAREQAAACaywtvMn6OyDRUCKAJDMxgLdnRzX3ByZXZpZXccF0AjR64UeuFIGBlkYXNoX2gyNjQtYmFzaWMtZ2VuMl83MjBwEgAYGHZpZGVvcy52dHMuY2FsbGJhY2sucHJvZDgSVklERU9fVklFV19SRVFVRVNUGwqIFW9lbV90YXJnZXRfZW5jb2RlX3RhZwZvZXBfaGQTb2VtX3JlcXVlc3RfdGltZV9tcwEwDG9lbV9jZmdfcnVsZQd1bm11dGVkE29lbV9yb2lfcmVhY2hfY291bnQDOTk2EW9lbV9pc19leHBlcmltZW50AAxvZW1fdmlkZW9faWQPODIyMTQxOTcwMDQzNTkzEm9lbV92aWRlb19hc3NldF9pZA84NzExOTUxNjgzOTA4NTEVb2VtX3ZpZGVvX3Jlc291cmNlX2lkEDM3Njk3MzEzOTY2Mjg2MzMcb2VtX3NvdXJjZV92aWRlb19lbmNvZGluZ19pZA85OTcwNjA2MTIwOTQ2ODIOdnRzX3JlcXVlc3RfaWQAJQIcACW%2BARsHiAFzBDg3ODACY2QKMjAyNC0wNy0yNgNyY2IDOTAwA2FwcAVWaWRlbwJjdBFDTVNfTUVESUFfTUFOQUdFUhNvcmlnaW5hbF9kdXJhdGlvbl9zAzkuNgJ0cxVwcm9ncmVzc2l2ZV9lbmNvZGluZ3MA&ccb=9-4&oh=00_AYB6DRIfvHFdWp9ex01FOynM00lhclFgqHoIB6bTl08lvQ&oe=66AA9684&_nc_sid=1d576d&_nc_rid=373586427540719&_nc_store_type=1

A preview of the SAM 2 web-based demo, which allows segmenting and tracking objects in video and applying effects.

Today, we’re announcing the Meta Segment Anything Model 2 (SAM 2), the next generation of the Meta Segment Anything Model, now supporting object segmentation in videos and images. We’re releasing SAM 2 under an Apache 2.0 license, so anyone can use it to build their own experiences. We’re also sharing SA-V, the dataset we used to build SAM 2 under a CC BY 4.0 license and releasing a web-based demo experience where everyone can try a version of our model in action.

Object segmentation—identifying the pixels in an image that correspond to an object of interest—is a fundamental task in the field of computer vision. The Meta Segment Anything Model (SAM) released last year introduced a foundation model for this task on images.

Our latest model, SAM 2, is the first unified model for real-time, promptable object segmentation in images and videos, enabling a step-change in the video segmentation experience and seamless use across image and video applications. SAM 2 exceeds previous capabilities in image segmentation accuracy and achieves better video segmentation performance than existing work, while requiring three times less interaction time. SAM 2 can also segment any object in any video or image (commonly described as zero-shot generalization), which means that it can be applied to previously unseen visual content without custom adaptation.

Before SAM was released, creating an accurate object segmentation model for specific image tasks required highly specialized work by technical experts with access to AI training infrastructure and large volumes of carefully annotated in-domain data. SAM revolutionized this space, enabling application to a wide variety of real-world image segmentation and out-of-the-box use cases via prompting techniques—similar to how large language models can perform a range of tasks without requiring custom data or expensive adaptations.

In the year since we launched SAM, the model has made a tremendous impact across disciplines. It has inspired new AI-enabled experiences in Meta’s family of apps, such as Backdrop and Cutouts on Instagram, and catalyzed diverse applications in science, medicine, and numerous other industries. Many of the largest data annotation platforms have integrated SAM as the default tool for object segmentation annotation in images, saving millions of hours of human annotation time. SAM has also been used in marine science to segment Sonar images and analyze coral reefs, in satellite imagery analysis for disaster relief, and in the medical field, segmenting cellular images and aiding in detecting skin cancer.

As Mark Zuckerberg noted in an open letter last week, open source AI “has more potential than any other modern technology to increase human productivity, creativity, and quality of life,” all while accelerating economic growth and advancing groundbreaking medical and scientific research. We’ve been tremendously impressed by the progress the AI community has made using SAM, and we envisage SAM 2 unlocking even more exciting possibilities.

SAM 2 can be applied out of the box to a diverse range of real-world use cases—for example, tracking objects to create video effects (left) or segmenting moving cells in videos captured from a microscope to aid in scientific research (right).

In keeping with our open science approach, we’re sharing our research on SAM 2 with the community so they can explore new capabilities and use cases. The artifacts we’re sharing today include:


  • The SAM 2 code and weights, which are being open sourced under a permissive Apache 2.0 license. We’re sharing our SAM 2 evaluation code under a BSD-3 license.

  • The SA-V dataset, which has 4.5 times more videos and 53 times more annotations than the existing largest video segmentation dataset. This release includes ~51k real-world videos with more than 600k masklets. We’re sharing SA-V under a CC BY 4.0 license.

  • A web demo, which enables real-time interactive segmentation of short videos and applies video effects on the model predictions.

As a unified model, SAM 2 can power use cases seamlessly across image and video data and be extended to previously unseen visual domains. For the AI research community and others, SAM 2 could be a component as part of a larger AI system for a more general multimodal understanding of the world. In industry, it could enable faster annotation tools for visual data to train the next generation of computer vision systems, such as those used in autonomous vehicles. SAM 2’s fast inference capabilities could inspire new ways of selecting and interacting with objects in real time or live video. For content creators, SAM 2 could enable creative applications in video editing and add controllability to generative video models. SAM 2 could also be used to aid research in science and medicine—for example, tracking endangered animals in drone footage or localizing regions in a laparoscopic camera feed during a medical procedure. We believe the possibilities are broad, and we’re excited to share this technology with the AI community to see what they build and learn.

How we built SAM 2

SAM was able to learn a general notion of what objects are in images. However, images are only a static snapshot of the dynamic real world in which visual segments can exhibit complex motion. Many important real-world use cases require accurate object segmentation in video data, for example in mixed reality, robotics, autonomous vehicles, and video editing. We believe that a universal segmentation model should be applicable to both images and video.

An image can be considered a very short video with a single frame. We adopt this perspective to develop a unified model that supports both image and video input seamlessly. The only difference in handling video is that the model needs to rely on memory to recall previously processed information for that video in order to accurately segment an object at the current timestep.

Successful segmentation of objects in video requires an understanding of where entities are across space and time. Compared to segmentation in images, videos present significant new challenges. Object motion, deformation, occlusion, lighting changes, and other factors can drastically change from frame to frame. Videos are often lower quality than images due to camera motion, blur, and lower resolution, adding to the difficulty. As a result, existing video segmentation models and datasets have fallen short in providing a comparable “segment anything” capability for video. We solved many of these challenges in our work to build SAM 2 and the new SA-V dataset.

Similar to the methodology we used for SAM, our research on enabling video segmentation capabilities involves designing a new task, a model, and a dataset. We first develop the promptable visual segmentation task and design a model (SAM 2) capable of performing this task. We use SAM 2 to aid in creating a video object segmentation dataset (SA-V), which is an order of magnitude larger than anything that exists currently, and use this to train SAM 2 to achieve state-of-the-art performance.

Promptable visual segmentation

SAM 2 supports selecting and refining objects in any video frame.

We design a promptable visual segmentation task that generalizes the image segmentation task to the video domain. SAM was trained to take as input points, boxes, or masks in an image to define the target object and predict a segmentation mask. With SAM 2, we train it to take input prompts in any frame of a video to define the spatio-temporal mask (i.e. a “masklet”) to be predicted. SAM 2 makes an immediate prediction of the mask on the current frame based on the input prompt and temporally propagates it to generate the masklet of the target object across all video frames. Once an initial masklet has been predicted, it can be iteratively refined by providing additional prompts to SAM 2 in any frame. This can be repeated as many times as required until the desired masklet is obtained.

Image and video segmentation in a unified architecture

The evolution of the architecture from SAM to SAM 2.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,219
Reputation
8,195
Daps
156,117
The SAM 2 architecture can be seen as a generalization of SAM from the image to the video domain. SAM 2 can be prompted by clicks (positive or negative), bounding boxes, or masks to define the extent of the object in a given frame. A lightweight mask decoder takes an image embedding for the current frame and encoded prompts to output a segmentation mask for the frame. In the video setting, SAM 2 propagates this mask prediction to all video frames to generate a masklet. Prompts can then be iteratively added on any subsequent frame to refine the masklet prediction.

To predict masks accurately across all video frames, we introduce a memory mechanism consisting of a memory encoder, a memory bank, and a memory attention module. When applied to images, the memory components are empty and the model behaves like SAM. For video, the memory components enable storing information about the object and previous user interactions in that session, allowing SAM 2 to generate masklet predictions throughout the video. If there are additional prompts provided on other frames, SAM 2 can effectively correct its predictions based on the stored memory context of the object.

Memories of frames are created by the memory encoder based on the current mask prediction and placed in the memory bank for use in segmenting subsequent frames. The memory bank consists of both memories from previous frames and prompted frames. The memory attention operation takes the per-frame embedding from the image encoder and conditions it on the memory bank to produce an embedding that is then passed to the mask decoder to generate the mask prediction for that frame. This is repeated for all subsequent frames.

We adopt a streaming architecture, which is a natural generalization of SAM to the video domain, processing video frames one at a time and storing information about the segmented objects in the memory. On each newly processed frame, SAM 2 uses the memory attention module to attend to the previous memories of the target object. This design allows for real-time processing of arbitrarily long videos, which is important not only for annotation efficiency in collecting the SA-V dataset but also for real-world applications—for example, in robotics.

SAM introduced the ability to output multiple valid masks when faced with ambiguity about the object being segmented in an image. For example, when a person clicks on the tire of a bike, the model can interpret this click as referring to only the tire or the entire bike and output multiple predictions. In videos, this ambiguity can extend across video frames. For example, if in one frame only the tire is visible, a click on the tire might relate to just the tire, or as more of the bike becomes visible in subsequent frames, this click could have been intended for the entire bike. To handle this ambiguity, SAM 2 creates multiple masks at each step of the video. If further prompts don’t resolve the ambiguity, the model selects the mask with the highest confidence for further propagation in the video.

The occlusion head in the SAM 2 architecture is used to predict if an object is visible or not, helping segment objects even when they become temporarily occluded.

In the image segmentation task, there is always a valid object to segment in a frame given a positive prompt. In video, it’s possible for no valid object to exist on a particular frame, for example due to the object becoming occluded or disappearing from view. To account for this new output mode, we add an additional model output (“occlusion head”) that predicts whether the object of interest is present on the current frame. This enables SAM 2 to effectively handle occlusions.

SA-V: Building the largest video segmentation dataset

Videos and masklet annotations from the SA-V dataset.

One of the challenges of extending the “segment anything” capability to video is the limited availability of annotated data for training the model. Current video segmentation datasets are small and lack sufficient coverage of diverse objects. Existing dataset annotations typically cover entire objects (e.g., person), but lack object parts (e.g., person’s jacket, hat, shoes), and datasets are often centered around specific object classes, such as people, vehicles, and animals.

To collect a large and diverse video segmentation dataset, we built a data engine, leveraging an interactive model-in-the-loop setup with human annotators. Annotators used SAM 2 to interactively annotate masklets in videos, and then the newly annotated data was used to update SAM 2 in turn. We repeated this cycle many times to iteratively improve both the model and dataset. Similar to SAM, we do not impose semantic constraints on the annotated masklets and focus on both whole objects (e.g., a person) and object parts (e.g., a person’s hat).

With SAM 2, collecting new video object segmentation masks is faster than ever before. Annotation with our tool and SAM 2 in the loop is approximately 8.4 times faster than using SAM per frame and also significantly faster than combining SAM with an off-the-shelf tracker.

Our released SA-V dataset contains over an order of magnitude more annotations and approximately 4.5 times more videos than existing video object segmentation datasets.

Highlights of the SA-V dataset include:


  • More than 600,000 masklet annotations on approximately 51,000 videos.

  • Videos featuring geographically diverse, real-world scenarios, collected across 47 countries.

  • Annotations that cover whole objects, object parts, and challenging instances where objects become occluded, disappear, and reappear.

Results

Both models are initialized with the mask of the t-shirt in the first frame. For the baseline, we use the mask from SAM. SAM 2 is able to track object parts accurately throughout a video, compared to the baseline which over-segments and includes the person’s head instead of only tracking the t-shirt.

To create a unified model for image and video segmentation, we jointly train SAM 2 on image and video data by treating images as videos with a single frame. We leverage the SA-1B image dataset released last year as part of the Segment Anything project, the SA-V dataset, and an additional internal licensed video dataset.

452258162_1278169679835253_6611651848984223695_n.png


SAM 2 (right) improves on SAM’s (left) object segmentation accuracy in images.

Key highlights that we detail in our research paper include:


  • SAM 2 significantly outperforms previous approaches on interactive video segmentation across 17 zero-shot video datasets and requires approximately three times fewer human-in-the-loop interactions.

  • SAM 2 outperforms SAM on its 23 dataset zero-shot benchmark suite, while being six times faster.

  • SAM 2 excels at existing video object segmentation benchmarks (DAVIS, MOSE, LVOS, YouTube-VOS) compared to prior state-of-the-art models.

  • Inference with SAM 2 feels real-time at approximately 44 frames per second.

  • SAM 2 in the loop for video segmentation annotation is 8.4 times faster than manual per-frame annotation with SAM.

It’s important that we work to build AI experiences that work well for everyone. In order to measure the fairness of SAM 2, we conducted an evaluation on model performance across certain demographic groups. Our results show that the model has minimal performance discrepancy in video segmentation on perceived gender and little variance among the three perceived age groups we evaluated: ages 18 – 25, 26 – 50, and 50+.

Limitations

While SAM 2 demonstrates strong performance for segmenting objects in images and short videos, the model performance can be further improved—especially in challenging scenarios.

SAM 2 may lose track of objects across drastic camera viewpoint changes, after long occlusions, in crowded scenes, or in extended videos. We alleviate this issue in practice by designing the model to be interactive and enabling manual intervention with correction clicks in any frame so the target object can be recovered.

SAM 2 can sometimes confuse multiple similar looking objects in crowded scenes.

When the target object is only specified in one frame, SAM 2 can sometimes confuse objects and fail to segment the target correctly, as shown with the horses in the above video. In many cases, with additional refinement prompts in future frames, this issue can be entirely resolved and the correct masklet can be obtained throughout the video.

While SAM 2 supports the ability to segment multiple individual objects simultaneously, the efficiency of the model decreases considerably. Under the hood, SAM 2 processes each object separately, utilizing only shared per-frame embeddings, without inter-object communication. While this simplifies the model, incorporating shared object-level contextual information could aid in improving efficiency.

SAM 2 predictions can miss fine details in fast moving objects.

For complex fast moving objects, SAM 2 can sometimes miss fine details and the predictions can be unstable across frames (as shown in the video of the cyclist above). Adding further prompts to refine the prediction in the same frame or additional frames can only partially alleviate this problem During training we do not enforce any penalty on the model predictions if they jitter between frames, so temporal smoothness is not guaranteed. Improving this capability could facilitate real-world applications that require detailed localization of fine structures.

While our data engine uses SAM 2 in the loop and we’ve made significant strides in automatic masklet generation, we still rely on human annotators for some steps such as verifying masklet quality and selecting frames that require correction. Future developments could include further automating the data annotation process to enhance efficiency.

There’s still plenty more work to be done to propel this research even further. We hope the AI community will join us by building with SAM 2 and the resources we’ve released. Together, we can accelerate open science to build powerful new experiences and use cases that benefit people and society.

Putting SAM 2 to work

While many of Meta FAIR’s models used in public demos are hosted on Amazon SageMaker, the session-based requirements of the SAM 2 model pushed up against the boundaries of what our team believed was previously possible on AWS AI Infra. Thanks to the advanced model deployment and managed inference capabilities offered by Amazon SageMaker, we’ve been able to make the SAM 2 release possible—focusing on building state of the art AI models and unique AI demo experiences.

339964818_688443716371831_1075393517353493400_n.gif


In the future, SAM 2 could be used as part of a larger AI system to identify everyday items via AR glasses that could prompt users with reminders and instructions.

We encourage the AI community to download the model, use the dataset, and try our demo. By sharing this research, we hope to contribute to accelerating progress in universal video and image segmentation and related perception tasks. We look forward to seeing the new insights and useful experiences that will be created by releasing this research to the community.
Download the model

Get the dataset

Read the paper

Try the demo

Visit the SAM 2 website


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,219
Reputation
8,195
Daps
156,117









1/11
Today we are releasing Gen-3 Alpha Image to Video. This update allows you to use any image as the first frame of your video generation, either on its own or with a text prompt for additional guidance.

Image to Video is major update that greatly improves the artistic control and consistency of your generations. See more below.

(1/10)

2/11
2/10

3/11
3/10

4/11
4/10

5/11
5/10

6/11
6/10

7/11
7/10

8/11
8/10

9/11
9/10

10/11
10/10

11/11
Image created with Radiant Creative EasyAi software. DM for more info


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,219
Reputation
8,195
Daps
156,117






1/12
SearchGPT widgets look nice

It tends to focus on the results strongly, so sometimes I need to tell it exactly what I want it to do rather than what I want from the web

2/12
Some random prompts

3/12
Btw - does it support maps and news? 🤔

“Latest news from TestingCatalog” 👀👀👀

4/12
It can give you local knowledge via your IP or precise location (the latter is kept off by default and optional in settings) - stuff like "cinema near me" works nice

5/12
Mind trying out it's agentic-search-explanatory capacity, e.g. like this:

6/12


7/12
Can you try searching for some articles behind paywall, whose companies have signed partnerships with OAI recently

8/12


9/12
Can you try searching "Yandex monthly active users"

I want to see, when it can't find the exact answer I want, if it will acknowledge that it found DAU but not MAU, or if it will just act obtuse like copilot and copy and paste the whole search results and ignore my actual query

10/12


11/12
Why is it searching for best places instead of time?

12/12
tends to focus on the results more from most of my testing, I'm representing actual experience right now as opposed to cherry picking results

It did say "a weekday" at the end I guess lol, that and I did ask for locations


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GTf5oTPX0AA63ym.jpg

GTf5oT5W0AAANc5.jpg

GTf5oTBWcAAZylx.jpg

GTf5oUgW8AA-q8A.jpg

GTgHJJ9WsAAzjUp.jpg

GTaKF_qXwAAQUdF.jpg

GTg0ZPAWAAAl13U.jpg

GTg0ZQTWMAAhYju.jpg

GTg0ZSqXoAAbZFu.jpg

GTf98ZDXYAEjYZP.jpg

GTf98aOXkAAzbJe.jpg

GTgmCxkXYAA5sDG.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,219
Reputation
8,195
Daps
156,117


OpenAI releases ChatGPT’s hyperrealistic voice to some paying users​


Maxwell Zeff

11:30 AM PDT • July 30, 2024

Comment

OpenAI unveils ChatGPT Advanced Voice Mode in May 2024.
Image Credits: OpenAI

OpenAI began rolling out ChatGPT’s Advanced Voice Mode on Tuesday, giving users their first access to GPT-4o’s hyperrealistic audio responses. The alpha version will be available to a small group of ChatGPT Plus users today, and OpenAI says the feature will gradually roll out to all Plus users in the fall of 2024.

When OpenAI first showcased GPT-4o’s voice in May, the feature shocked audiences with quick responses and an uncanny resemblance to a real human’s voice – one in particular. The voice, Sky, resembled that of Scarlett Johansson, the actress behind the artificial assistant in the movie “Her.” Soon after OpenAI’s demo, Johansson said she refused multiple inquiries from CEO Sam Altman to use her voice, and after seeing GPT-4o’s demo, hired legal counsel to defend her likeness. OpenAI denied using Johansson’s voice, but later removed the voice shown in its demo. In June, OpenAI said it would delay the release of Advanced Voice Mode to improve its safety measures.

One month later, and the wait is over (sort of). OpenAI says the video and screensharing capabilities showcased during its Spring Update will not be part of this alpha, launching at a “later date.” For now, the GPT-4o demo that blew everyone away is still just a demo, but some premium users will now have access to ChatGPT’s voice feature shown there.


ChatGPT can now talk and listen​


You may have already tried out the Voice Mode currently available in ChatGPT, but OpenAI says Advanced Voice Mode is different. ChatGPT’s old solution to audio used three separate models: one to convert your voice to text, GPT-4 to process your prompt, and then a third to convert ChatGPT’s text into voice. But GPT-4o is multimodal, capable of processing these tasks without the help of auxiliary models, creating significantly lower latency conversations. OpenAI also claims GPT-4o can sense emotional intonations in your voice, including sadness, excitement or singing.

In this pilot, ChatGPT Plus users will get to see first hand how hyperrealistic OpenAI’s Advanced Voice Mode really is. TechCrunch was unable to test the feature before publishing this article, but we will review it when we get access.

OpenAI says it’s releasing ChatGPT’s new voice gradually to closely monitor its usage. People in the alpha group will get an alert in the ChatGPT app, followed by an email with instructions on how to use it.

In the months since OpenAI’s demo, the company says it tested GPT-4o’s voice capabilities with more than 100 external red teamers who speak 45 different languages. OpenAI says a report on these safety efforts is coming in early August.

The company says Advanced Voice Mode will be limited to ChatGPT’s four preset voices – Juniper, Breeze, Cove and Ember – made in collaboration with paid voice actors. The Sky voice shown in OpenAI’s May demo is no longer available in ChatGPT. OpenAI spokesperson Lindsay McCallum says “ChatGPT cannot impersonate other people’s voices, both individuals and public figures, and will block outputs that differ from one of these preset voices.”

OpenAI is trying to avoid deepfake controversies. In January, AI startup ElevenLabs’s voice cloning technology was used to impersonate President Biden, deceiving primary voters in New Hampshire.

OpenAI also says it introduced new filters to block certain requests to generate music or other copyrighted audio. In the last year, AI companies have landed themselves in legal trouble for copyright infringement, and audio models like GPT-4o unleash a whole new category of companies that can file a complaint. Particularly, record labels, who have a history for being litigious, and have already sued AI song-generators Suno and Udio.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,219
Reputation
8,195
Daps
156,117

[Submitted on 12 Jul 2024]

Human-like Episodic Memory for Infinite Context LLMs​


Zafeirios Fountas, Martin A Benfeghoul, Adnan Oomerjee, Fenia Christopoulou, Gerasimos Lampouras, Haitham Bou-Ammar, Jun Wang

Large language models (LLMs) have shown remarkable capabilities, but still struggle with processing extensive contexts, limiting their ability to maintain coherence and accuracy over long sequences. In contrast, the human brain excels at organising and retrieving episodic experiences across vast temporal scales, spanning a lifetime. In this work, we introduce EM-LLM, a novel approach that integrates key aspects of human episodic memory and event cognition into LLMs, enabling them to effectively handle practically infinite context lengths while maintaining computational efficiency. EM-LLM organises sequences of tokens into coherent episodic events using a combination of Bayesian surprise and graph-theoretic boundary refinement in an on-line fashion. When needed, these events are retrieved through a two-stage memory process, combining similarity-based and temporally contiguous retrieval for efficient and human-like access to relevant information. Experiments on the LongBench dataset demonstrate EM-LLM's superior performance, outperforming the state-of-the-art InfLLM model with an overall relative improvement of 4.3% across various tasks, including a 33% improvement on the PassageRetrieval task. Furthermore, our analysis reveals strong correlations between EM-LLM's event segmentation and human-perceived events, suggesting a bridge between this artificial system and its biological counterpart. This work not only advances LLM capabilities in processing extended contexts but also provides a computational framework for exploring human memory mechanisms, opening new avenues for interdisciplinary research in AI and cognitive science.

Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
Cite as:arXiv:2407.09450 [cs.AI]
(or arXiv:2407.09450v1 [cs.AI] for this version)
[2407.09450] Human-like Episodic Memory for Infinite Context LLMs

Submission history​

From: Zafeirios Fountas PhD [view email]

[v1] Fri, 12 Jul 2024 17:34:03 UTC (313 KB)




A.I generated summary:

### Human-like Episodic Memory for Infinite Context LLMs

Imagine you have a super-smart computer program that can understand and remember lots of information, just like humans do. This program is called a Large Language Model (LLM). However, these LLMs have a big problem: they struggle to remember and use lots of information over long periods of time. This makes it hard for them to stay coherent and accurate when dealing with long sequences of data.

In contrast, humans are great at organizing and recalling personal experiences across their entire lives. Our brains break down continuous experiences into discrete events, which are stored in long-term memory. When we need to remember something, our brains use these event boundaries as access points.

This research introduces a new approach called EM-LLM (Episodic Memory Large Language Model), which integrates key aspects of human episodic memory into LLMs. EM-LLM helps these models handle practically infinite context lengths while staying computationally efficient.

#### How EM-LLM Works

1. **Event Segmentation**: EM-LLM breaks down sequences of tokens (pieces of text) into coherent episodic events using two main steps:
- **Surprise-Based Segmentation**: It identifies important moments where the model is "surprised" by new information.
- **Graph-Theoretic Boundary Refinement**: It refines these boundaries to ensure that each event is cohesive and distinct from others.

2. **Memory Retrieval**: When needed, EM-LLM retrieves these events through a two-stage process:
- **Similarity-Based Retrieval**: It selects relevant events based on how similar they are to the current query.
- **Temporal Contiguity Buffer**: It also considers the temporal order of events to maintain context.

#### Experiments

The researchers tested EM-LLM on various tasks using the LongBench dataset and compared its performance with another state-of-the-art model called InfLLM. The results showed that EM-LLM outperformed InfLLM by an overall relative improvement of 4.3%, with significant improvements in specific tasks like PassageRetrieval (33% improvement).

#### Key Findings

- The segmentation method used in EM-LLM correlates strongly with how humans perceive events.
- The combination of surprise-based segmentation and boundary refinement significantly enhances performance on long-context tasks.
- The use of both similarity-based and temporally contiguous retrieval mechanisms mimics human memory retrieval patterns.

#### Future Directions

This work opens up new avenues for interdisciplinary research between AI and cognitive science:
- Exploring more sophisticated segmentation algorithms for better performance.
- Extending this approach to enable imagination and future thinking in LLMs.
- Integrating modality-specific buffers for multi-modal tasks.

In summary, EM-LLM represents a significant advancement in language models by integrating human-like episodic memory mechanisms, enabling them to handle extended contexts efficiently and effectively.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,219
Reputation
8,195
Daps
156,117

1/7
DeepSeek is the level of cracked when a quant finance firm from China is forced to work on AI cuz the government says they're not contributing to society
Imagine if all the perf guys at US quants worked on useful important stuff instead of making markets 0.1 second more efficient

2/7
Liang Wenfeng(deepseek/High flyer founder) was AGI pilled since 2008
Read this interview:
揭秘DeepSeek:一个更极致的中国技术理想主义故事

They also talk about you

3/7
Ye I read it. P cool. Llama 3 paper is better now though maybe

4/7
Renaissance makes too much. Probably higher ROI for them to keep doing quant math

5/7
do you believe that efficient markets are important for facilitating investment in the US? most of my CN friends just invest into real estate which seems like a massive inefficiency in their private sector.

6/7
Is the American financial industry wasting intelligence?

7/7
lol probably it’s because there’s not much money to make in China’s financial market. Derivatives are limited and the equity market never rises. The government always blames quant firms to cover their incompetence. Maybe they are just better off doing AI research …


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GT-oswIb0AUA6oz.jpg

GT_wDCHb0AAOMx2.png


1/11
🎉Exciting news! DeepSeek API now launches context caching on disk, with no code changes required! This new feature automatically caches frequently referenced contexts on distributed storage, slashing API costs by up to 90%. For a 128K prompt with high reference, the first token latency is cut from 13s to just 500ms.

Benefit Scenarios:
- Multi-turn conversations where subsequent rounds hit the previous context cache.
- Data analysis with recurring queries on the same documents/files.
- Code analysis/debugging with repeated repository references.

DeepSeek API disk caching is now live with unlimited concurrency. Billing is automatic based on actual cache hits. Learn more at DeepSeek API introduces Context Caching on Disk, cutting prices by an order of magnitude | DeepSeek API Docs
/search?q=#DeepSeek /search?q=#ContextCaching

2/11
It seems that a response issue has appeared after the update. 500s is crazy.

3/11
The issue has been fixed, and service is now restored.

4/11
interesting feature for cost savings

i've started using gpt4 mini to reduce costs but would love to stick with more expensive models on http://microlaunch.net

eventually w/ context caching

does deepseek also handle prompt compression?

5/11
This is awesome 👏

6/11
I ❤️‍🔥you (I'm literally just a whale) 🐋

7/11
The DeepSeek V2 and Coder V2 OpenRouter's API version are running the update 0628?

8/11
我没有中国手机号,我该怎么使用它呢?

9/11
wait so... this just automatically happens if you're using the deepseek models now? only through deepseek, or through openrouter/deekseek too? i'd love more clarification - amazing nonetheless!

10/11
国产之光!为你们感到骄傲!🥳🥳🥳

11/11
This is amazing. Will it work for openrouter also?


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GT-oswIb0AUA6oz.jpg

GT-tpgKbkAAdJM-.png
 
Top