Large Language Models News & Discussions

bnew · Feb 26, 2024

Google Colab

colab.research.google.com

𝐖𝐢𝐭𝐡 𝐬𝐨 𝐦𝐚𝐧𝐲 𝐦𝐨𝐝𝐞𝐥𝐬 𝐜𝐨𝐦𝐢𝐧𝐠 𝐨𝐮𝐭 𝐢𝐧 𝐭𝐡𝐞 𝐩𝐚𝐬𝐭 𝐟𝐞𝐰 𝐝𝐚𝐲𝐬, 𝐰𝐞 𝐜𝐨𝐮𝐥𝐝𝐧'𝐭 𝐩𝐚𝐬𝐬 𝐨𝐧 𝐚 𝐜𝐡𝐚𝐧𝐜𝐞 𝐭𝐨 𝐣𝐨𝐢𝐧 𝐭𝐡𝐞 𝐩𝐚𝐫𝐭𝐲!

We’re excited to share DeciLM-7B, Deci’s new Apache 2.0-licensed model, the most accurate and fastest 7B LLM to date. With DeciLM-7B combined with Infery-LLM, you no longer need to decide between good performance and affordability!

Here are 4 things you need to know about DeciLM-7B:

With an average score of 61.55 on the Open LLM Leaderboard, DeciLM-7B outperforms all other base LLMs in the 7 billion-parameter class!

Direct PyTorch benchmarks show DeciLM-7B surpassing Mistral 7B and Llama 2 7B, with 1.83x and 2.39x higher throughput, respectively.

DeciLM-7B + Infery-LLM boosts speed by 4.4x over Mistral 7B with vLLM, enabling cost-effective, high-volume user interactions.

Developed with the assistance of our NAS-powered engine, AutoNAC, DeciLM-7B employs variable Grouped Query Attention.

We can’t wait for the community to try our DeciLM-7B!

Deci/DeciLM-7B-instruct · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

bnew · Feb 26, 2024

bnew · Feb 26, 2024

Introducing YOLO-NAS-Sat: Small Object Detection at the Edge

Tailored for the demands of small object detection, YOLO-NAS-Sat delivers SOTA accuracy and speed, serving a range of uses across industries.

deci.ai

COMPUTER VISION

Introducing YOLO-NAS-Sat for Small Object Detection at the Edge

By Deci
Algorithms Team

February 20, 2024
5 min read

Deci is excited to present YOLO-NAS-Sat, the latest in our lineup of ultra-performant foundation models, which includes YOLO-NAS, YOLO-NAS Pose, and DeciSegs. Tailored for the accuracy demands of small object detection, YOLO-NAS-Sat serves a wide array of vital uses, from monitoring urban infrastructure and assessing changes in the environment to precision agriculture. Available in four sizes—Small, Medium, Large, and X-large—this model is designed for peak performance in accuracy and speed on edge devices like NVIDIA’s Jetson Orin series.

YOLO-NAS-Sat sets itself apart by delivering an exceptional accuracy-latency trade-off, outperforming established models like YOLOv8 in small object detection. For instance, when evaluated on the DOTA 2.0 dataset, YOLO-NAS-Sat L achieves a 2.02x lower latency and a 6.99 higher mAP on the NVIDIA Jetson AGX ORIN with FP16 precision over YOLOV8.

YOLO-NAS-Sat’s superior performance is attributable to its innovative architecture, generated by AutoNAC, Deci’s Neural Architecture Search engine.

If your application requires small object detection on edge devices, YOLO-NAS-Sat provides an off-the-shelf solution that can significantly accelerate your development process. Its specialized design and fine-tuning for small object detection ensure rapid deployment and optimal performance.

Continue reading to explore YOLO-NAS-Sat’s architectural design, training process, and performance.

DOTA 2.0 images with YOLO-NAS-Sat XL predictions

YOLO-NAS-Sat’s Specialized Architecture

The Need for A Specialized Architecture for Small Object Detection

Small object detection is the task of detecting objects that take up minimal space in an image, sometimes a mere handful of pixels. This task inherently involves significant challenges, including scant detail for recognition and increased vulnerability to background noise. To meet these challenges, a specialized architecture is required – one that captures and retains small, local details in an image.

From YOLO-NAS to YOLO-NAS-Sat

YOLO-NAS-Sat is based on the YOLO-NAS architecture, which is renowned for its robust performance in standard object detection tasks. The macro-level architecture remains consistent with YOLO-NAS, but we’ve made strategic modifications to better address small object detection challenges:

Backbone Modifications: The number of layers in the backbone has been adjusted to optimize the processing of small objects, enhancing the model’s ability to discern minute details.
Revamped Neck Design: A newly designed neck, inspired by the U-Net-style decoder, focuses on retaining more small-level details. This adaptation is crucial for preserving fine feature maps that are vital for detecting small objects.
Context Module Adjustment: The original “context” module in YOLO-NAS, intended to capture global context, has been replaced. We discovered that for tasks like processing large satellite images, a local receptive window is more beneficial, improving both accuracy and network latency.

These architectural innovations ensure that YOLO-NAS-Sat is uniquely equipped to handle the intricacies of small object detection, offering an unparalleled accuracy-speed trade-off.

Leveraging AutoNAC for Architecture Optimization

At the heart of YOLO-NAS-Sat’s development is Deci’s AutoNAC, an advanced Neural Architecture Search engine. AutoNAC’s efficiency in navigating the vast search space of potential architectures allows us to tailor models specifically to the requirements of tasks, datasets, and hardware. YOLO-NAS-Sat is part of a broader family of highly efficient foundation models, including YOLO-NAS for standard object detection, YOLO-NAS Pose for pose estimation, and DeciSegs for semantic segmentation, all generated through AutoNAC.

For YOLO-NAS-Sat, we utilized AutoNAC to identify a neural network within the pre-defined search space that achieves our target latency while maximizing accuracy.

YOLO-NAS-Sat’s Training

We trained YOLO-NAS-Sat from scratch on the COCO dataset, followed by fine-tuning on the DOTA 2.0 dataset. The DOTA 2.0 dataset is an extensive collection of aerial images designed for object detection and analysis, featuring diverse objects across multiple categories. For the fine-tuning phase, we segmented the input scenes into 1024×1024 tiles, employing a 512px step for comprehensive coverage. Additionally, we scaled each scene to 75% and 50% of its original size to enhance detection robustness across various scales.

During the training process, we extracted random 640×640 crops from these tiles to introduce variability and enhance model resilience. For the validation phase, we divided the input scenes into uniform 1024×1024 tiles.

YOLO-NAS-Sat’s State-of-the-Art Performance

Accuracy Compared to Fine-tuned YOLOv8

To benchmark YOLO-NAS-Sat against YOLOv8, we subjected YOLOv8 to the same fine-tuning process previously outlined for YOLO-NAS-Sat.

Comparing the accuracy of the fine-tuned models, we see that each YOLO-NAS-Sat variant has a higher mAP@0.5 score than the corresponding YOLOv8 variant. YOLO-NAS-Sat-S has 4.79% higher mAP compared to YOLOv8 S; YOLO-NAS-Sat-M mAP’s is 4.36% higher compared to YOLOv8 M , and YOLO-NAS-Sat-L is 6.99% more accurate than YOLOv8 L.

Latency Compared to YOLOv8

While higher accuracy typically demands a trade-off in speed, the finely tuned YOLO-NAS-Sat models break this convention by achieving lower latency compared to their YOLOv8 counterparts. However, on NVIDIA AGX Orin at TRT FP16 Precision, each YOLO-NAS-Sat variant outpaces the corresponding YOLOv8 variant.

As can be seen in the above graph, YOLO-NAS-Sat S outpaces its YOLOv8 counterpart by 1.52x, YOLO-NAS-Sat M by 2.38x, and YOLO-NAS-Sat L by 2.02x. Notably, YOLO-NAS-Sat XL is not only more accurate but also 1.42x faster than YOLOv8 L.

Adopting YOLO-NAS-Sat for small object detection not only enhances accuracy but also substantially improves processing speeds. Such improvements are especially valuable in real-time applications, where rapid data processing is paramount.

Why We Chose mAP@0.5 for Small Object Detection Evaluation

In assessing these small object detection models, we opted for the mAP@0.5 metric over the broader mAP@0.5-0.95. The commonly adopted mAP@0.5-0.95, which evaluates performance across IOU thresholds from 0.5 to 0.95, may not accurately reflect the nuances of small object detection. Minor discrepancies, even as slight as a 1 or 2-pixel shift between the predicted and actual bounding boxes, can significantly impact the IOU score, plummeting it from perfect to 0.8 and thus affecting the overall mAP score adversely.

Considering the inherent sensitivities of the IOU metric when detecting small objects, it becomes imperative to adopt an alternative evaluation strategy. Utilizing a singular IOU threshold, such as 0.5, presents a viable solution by diminishing the metric’s vulnerability to small prediction errors, thereby offering a more stable and reliable measure of model performance in the context of small object detection.

YOLO-NAS-Sat’s Potential: Beyond Aerial Images

While YOLO-NAS-Sat excels in satellite imagery analysis, its specialized architecture is also ideally tailored for a wide range of applications involving other types of images:

Satellite Images: Used for environmental monitoring, urban development tracking, agricultural assessment, and security applications.
Microscopic Images: Essential in medical research for detecting cells, bacteria, and other microorganisms, as well as in material science.
Radar Images: Applied in meteorology for weather prediction, in aviation for aircraft navigation, and in maritime for ship detection.
Thermal Images: Thermal imaging finds applications in a variety of fields, including security, wildlife monitoring, and industrial maintenance, as well as in building and energy audits. The unique information provided by thermal images, especially in night-time or low-visibility conditions, underlines its importance and the volume of use.

Gaining Access to YOLO-NAS-Sat

YOLO-NAS-Sat is ready to revolutionize your team’s small object detection projects. If you’re looking to leverage this model’s cutting-edge performance for your applications, we invite you to connect with our experts. You’ll learn how you can use Deci’s platform and foundation models to streamline your development process, achieving faster time-to-market and unlocking new possibilities in your computer vision projects.

bnew · Feb 26, 2024

Google presents Genie

Generative Interactive Environments

introduce Genie, the first generative interactive environment trained in an unsupervised manner from unlabelled Internet videos. The model can be prompted to generate an endless variety of action-controllable virtual worlds described through text, synthetic images, photographs, and even sketches. At 11B parameters, Genie can be considered a foundation world model. It is comprised of a spatiotemporal video tokenizer, an autoregressive dynamics model, and a simple and scalable latent action model. Genie enables users to act in the generated environments on a frame-by-frame basis despite training without any ground-truth action labels or other domain-specific requirements typically found in the world model literature. Further the resulting learned latent action space facilitates training agents to imitate behaviors from unseen videos, opening the path for training generalist agents of the future.

🧞 Genie: Generative Interactive Environments

A Foundation Model for Playable Worlds

sites.google.com

Genie: Generative Interactive Environments

Genie Team

We introduce Genie, a foundation world model trained from Internet videos that can generate an endless variety of playable (action-controllable) worlds from synthetic images, photographs, and even sketches.

s1to6kRYXHR40AGFLsKj3IT7hQW96xmB4YffIDj_PJaCc_gZwf5oYY-e8vQycx5OyzHXmbMMs68mSgKbLdBpUjlB1q3446Z91e8Y3XxEFHong5U9XuJiP96maXPYKIWyaA=w1280

Read Paper

A Foundation Model for Playable Worlds

The last few years have seen an emergence of generative AI, with models capable of generating novel and creative content via language, images, and even videos. Today, we introduce a new paradigm for generative AI, generative interactive environments (Genie), whereby interactive, playable environments can be generated from a single image prompt.

Genie can be prompted with images it has never seen before, such as real world photographs or sketches, enabling people to interact with their imagined virtual worlds-–essentially acting as a foundation world model. This is possible despite training without any action labels. Instead, Genie is trained from a large dataset of publicly available Internet videos. We focus on videos of 2D platformer games and robotics but our method is general and should work for any type of domain, and is scalable to ever larger Internet datasets.

Learning to control without action labels

What makes Genie unique is its ability to learn fine-grained controls exclusively from Internet videos. This is a challenge because Internet videos do not typically have labels regarding which action is being performed, or even which part of the image should be controlled. Remarkably, Genie learns not only which parts of an observation are generally controllable, but also infers diverse latent actions that are consistent across the generated environments. Note here how the same latent actions yield similar behaviors across different prompt images.

n0xuvpZdjOGfvF1JdLtvJh6AfhSbCirO5r4pcDFr7KAQIMM3WQa-ACt7b7DO2jlgfD0l4KGNlN9oc5L2n2E7b_BperCGsDg4Htv30_f2pKxsy9E3YG2P89luAia0biRLeA=w1280

He0fz116ry-sXdagesOy6exod2c1NBWQAGY4zUFvdkSPZ5-2qcvEAzqnDKFNKiyibgLlIwxkxzrf4XDph2DpL4WhLUkvEDmPlo7slIzobgIrDtCWtR2nAFrUyAgmca0Oyw=w1280

OrzMPKUFDoN7Uvq_DSZrVO0GylI980jpny1zlzcGvoASc_qW1hB0V7hZYmnAZprnLT7yb9zz4iO2BpbwgVm757-dBBMpy6BXuxdiifsCnLek2lojegy1NI5yL-tHHkXBMw=w1280

latent actions: 6, 6, 7, 6, 7, 6, 5, 5, 2, 7

l_OOKbLOlzZKfhO3WvZV5IgTVa4Bcq3_P9yly0DgvkBinHjGBQpxfavZvLQTAyMZeOxvVmRkxk4Vu8ChXVpIa8SRZvBEFXq6LHIibeQTHi8LXF6YKBxXb5DVq55hPbFkgg=w1280

UXrkppEjvlVHHc2WMVvWcFPmClGZ3N_cDbSDgiVUBUVxThU-Pk_lnjYHEgUdfnd4Cq5dVleMxPPZ1avPWsLQFVY0TdOTyUV2cp_OzDWYG2gM2S0KLjPjbreDruivQKGZ2A=w1280

OENoEZ6NeZtXh1HI1vNc5YbbH2f-C6CmKtGoWumQhZQ-zWDmSAkvO7QE4R6g4LjwozdCZwictyss5hQd0ClC0dv0CEf1uoVMHKYJA8rSWDA26GbbQ-iLpfRGPpq3y6G2nA=w1280

U-gmG1v2otVCBEO5ioZRn_NAq--e78ZSoma_CALNcXqYFSbARi4qBxgoXw8u7MyXmGvGWe9NNTmn4nrKVR7ozyFZ4kxjcIQNGi-MBXh_xkJCXi29zQt0_HAaXzphJfmphA=w1280

latent actions: 5, 6, 2, 2, 6, 2, 5, 7, 7, 7

Enabling a new generation of creators

Amazingly, it only takes a single image to create an entire new interactive environment. This opens the door to a variety of new ways to generate and step into virtual worlds, for instance, we can take a state-of-the-art text-to-image generation model and use it to produce starting frames that we can then bring to life with Genie. Here we generate images with Imagen2 and bring them to life with Genie.

xW6IHNWcVZq9JMn_iEZRDroYNrrYXOjoPQD3-u0DIzItlFV48A4uG5F4nC4lWWSIcNUBwM0BB5gvs_2zALFFc_nEGiK8eH1t5O3FaTPnoJkiCQuyt8LgChUd9Ama20z39w=w1280

zHsIAxHtW9NNCkkNAZAPxjYNYDXISTUcdCVOe5cULGkzduY3Tfg-a1tHsqryG4gUoddagWTwLBjmw9oVwfQOiUgKFiKPrTqj9kaJ51zqT5-6KLzkN1CKjFdEAX2oNNANcQ=w1280

g2UxAYCgi7T3X-ndhe0sNDqRDgYGy_QM9sKTZKxoyBjhIo1FoEGrgizywFmPDehjd1N6_sijJqPHNEfvjO9SpqxYpVJo-W967jCBCdLreolwy5RRQkUAJZTKlMB7fTTHrQ=w1280

zAZXbp-QEdKESxWjUxbarzdWGNahXLLDMlhKkJjixXNXYPX7CFFok-O1odkmZLHwEDnhyNgr_Z_PV5WQfucaofSZpdyttT_eplugAmjajqJ5tjIsOwRRpPgBiZlXYTdQAQ=w1280

TdxA87jerBtOiShjR0GlPiFIzYWfI3hqaKGsfRPcpPkX8seZUZYYutHaixTnn0JFHk3OhHO-FoOQDehEqjGvN7v7qELYjIJcdDNKrr7XNsKBjtKmlQ-xSe7EPbR73w9TFQ=w1280

But it doesn’t stop there, we can even step into human designed creations such as sketches!

bnew · Feb 26, 2024

JUST IN: Mistral announces Mistral Large, a new flagship model.

This is the world's second-ranked model generally available through an API (behind GPT-4).

Here is all you need to know:

• 32K tokens context window
• has native multilingual capacities
• strong abilities in reasoning, knowledge, maths, and coding benchmarks
• function calling and JSON format natively supported
• available through Microsoft Azure
• a low-latency model called Mistral Small was also released

bnew · Feb 26, 2024

bnew · Feb 26, 2024

Ted Xiao
@xiao_ted
I can’t emphasize enough how mind-blowing extremely long token context windows are. For both AI researchers and practitioners, massive context windows will have transformative long-term impact, beyond one or two flashy news cycles.

“More is different”: Just as we saw emergent capabilities when scaling model size, compute, and datasets, I think we’re going to see a similar revolution for in-context learning. The capability shifts we’ll see going from 8k to 32k to 128k to 10M (!!) token contexts and beyond are not just going to be simple X% quantitative improvements, but instead qualitative phase shifts which unlock new abilities altogether and result in rethinking how we approach foundation model reasoning.

Great fundamental research on the relationship between in-context learning (ICL) and in-weight learning is now more relevant than ever, and needs to be extended given that we now operate in an era where the "X-axis" of context length has increased by three orders of magnitude. I highly recommend @scychan_brains 's pioneering work in this area, such as https://arxiv.org/pdf/2205.05055.pdf and https://arxiv.org/pdf/2210.05675.pdf. In fact, there are already data points which suggest our understanding of ICL scaling laws still contains large gaps

(see )

Also exciting is the connection of long-context ICL to alignment and post-training! I'm curious to see how 10M+ contexts disrupt the ongoing debate about whether foundation models truly learn new capabilities and skills during finetuning/RLHF or whether they purely learn stylistic knowledge (the "Superficial Alignment Hypothesis", https://arxiv.org/pdf/2305.11206.pdf and Re-Align |). The Gemini 1.5 technical report brings new evidence to this discussion as well, showing that an entire new language can be learned completely in context. I'm excited to see better empirical understanding of how foundation models can effectively leverage large-context ICL both during inference but also for "learning to learn" during training

And finally, perhaps the most important point: huge context lengths will have a lasting impact because their applications are so broad. There is no part of modern foundation model research that is not changed profoundly in some capacity by huge contexts! From theoretical underpinnings (how we design pre-training and post-training objectives) to system design (how we scale up long-contexts during training and serving) to application domains (such as robotics), massive context ICL is going to have significant impact and move the needle across the board.

bnew · Feb 26, 2024

Meta presents MobileLLM

Optimizing Sub-billion Parameter Language Models for On-Device Use Cases

paper addresses the growing need for efficient large language models (LLMs) on mobile devices, driven by increasing cloud costs and latency concerns. We focus on designing top-quality LLMs with fewer than a billion parameters, a practical choice for mobile deployment. Contrary to prevailing belief emphasizing the pivotal role of data and parameter quantity in determining model quality, our investigation underscores the significance of model architecture for sub-billion scale LLMs. Leveraging deep and thin architectures, coupled with embedding sharing and grouped-query attention mechanisms, we establish a strong baseline network denoted as MobileLLM, which attains a remarkable 2.7%/4.3% accuracy boost over preceding 125M/350M state-of-the-art models. Additionally, we propose an immediate block-wise weight sharing approach with no increase in model size and only marginal latency overhead. The resultant models, denoted as MobileLLM-LS, demonstrate a further accuracy enhancement of 0.7%/0.8% than MobileLLM 125M/350M. Moreover, MobileLLM model family shows significant improvements compared to previous sub-billion models on chat benchmarks, and demonstrates close correctness to LLaMA-v2 7B in API calling tasks, highlighting the capability of small models for common on-device use cases.

Paper page - MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases

Join the discussion on this paper page

huggingface.co

[2402.14905] MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases

https://arxiv.org/pdf/2402.14905

bnew · Feb 26, 2024

[2402.14848] Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models

Computer Science > Computation and Language

[Submitted on 19 Feb 2024]

Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models

Mosh Levy, Alon Jacoby, Yoav Goldberg

This paper explores the impact of extending input lengths on the capabilities of Large Language Models (LLMs). Despite LLMs advancements in recent times, their performance consistency across different input lengths is not well understood. We investigate this aspect by introducing a novel QA reasoning framework, specifically designed to assess the impact of input length. We isolate the effect of input length using multiple versions of the same sample, each being extended with padding of different lengths, types and locations. Our findings show a notable degradation in LLMs' reasoning performance at much shorter input lengths than their technical maximum. We show that the degradation trend appears in every version of our dataset, although at different intensities. Additionally, our study reveals that traditional perplexity metrics do not correlate with performance of LLMs' in long input reasoning tasks. We analyse our results and identify failure modes that can serve as useful guides for future research, potentially informing strategies to address the limitations observed in LLMs.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2402.14848 [cs.CL]
	(or arXiv:2402.14848v1 [cs.CL] for this version)

Submission history

From: Mosh Levy [view email]
[v1] Mon, 19 Feb 2024 16:04:53 UTC (938 KB)

https://arxiv.org/pdf/2402.14848.pdf

bnew · Feb 28, 2024

Bezos, Nvidia Join OpenAI in Funding Humanoid Robot Startup

Jeff Bezos, Nvidia Corp. and other big technology names are investing in a business that’s developing human-like robots, according to people with knowledge of the situation, part of a scramble to find new applications for artificial intelligence.

www.bloomberg.com

Bezos, Nvidia Join OpenAI in Funding Humanoid Robot Startup

Figure AI also gets investments from Intel, Samsung and Amazon

Company, valued at roughly $2 billion, is raising $675 million

Robots have emerged as a critical new frontier for the AI industry, letting it apply cutting-edge technology to real-world tasks.
Source: Figure AI Inc.

In this Article
FIGURE AI INC
Private Company
MICROSOFT CORP
410.34USD
–0.32%

By Mark Gurman and Gillian Tan

February 23, 2024 at 12:56 PM EST
Updated on
February 23, 2024 at 1:20 PM EST

Jeff Bezos, Nvidia Corp. and other big technology names are investing in a business that’s developing human-like robots, according to people with knowledge of the situation, part of a scramble to find new applications for artificial intelligence.

The startup Figure AI Inc. — also backed by OpenAI and Microsoft Corp. — is raising about $675 million in a funding round that carries a pre-money valuation of roughly $2 billion, said the people, who asked not to be identified because the matter is private. Through his firm Explore Investments LLC, Bezos has committed $100 million. Microsoft is investing $95 million, while Nvidia and an Amazon.com Inc.-affiliated fund are each providing $50 million.

Robots have emerged as a critical new frontier for the AI industry, letting it apply cutting-edge technology to real-world tasks. At Figure, engineers are working on a robot that looks and moves like a human. The company has said it hopes its machine, called Figure 01, will be able to perform dangerous jobs that are unsuitable for people and that its technology will help alleviate labor shortages.

Other technology companies are involved as well. Intel Corp.’s venture capital arm is pouring in $25 million, and LG Innotek is providing $8.5 million. Samsung’s investment group, meanwhile, committed $5 million. Backers also include venture firms Parkway Venture Capital, which is investing $100 million, and Align Ventures, which is providing $90 million.

The ARK Venture Fund is participating as well, putting in $2.5 million, while Aliya Capital Partners is investing $20 million. Other investors include Tamarack, at $27 million; Boscolo Intervest Ltd., investing $15 million; and BOLD Capital Partners, at $2.5 million.

OpenAI, which at one point considered acquiring Figure, is investing $5 million. Bloomberg News reported in January on the funding round, which kicked off with Microsoft and OpenAI as the initial lead investors. Those big names helped attract the influx of cash from the other entities. The $675 million raised is a significant increase over the $500 million initially sought by Figure.

Representatives for Figure and its investors declined to comment or didn’t immediately respond to requests for comment.

People with knowledge of the matter expect the investors to wire the funds to Figure AI and sign formal agreements on Monday, but the numbers could change as final details are worked out. The roughly $2 billion valuation is pre-money, meaning it doesn’t account for the capital that Figure is raising.

Last May, Figure raised $70 million in funding round led by Parkway. At the time, Chief Executive Officer Brett Adcock said, “We hope that we’re one of the first groups to bring to market a humanoid that can actually be useful and do commercial activities.”

The AI robotics industry has been busy lately. Earlier this year, OpenAI-backed Norwegian robotics startup 1X Technologies AS raised $100 million. Vancouver-based Sanctuary AI is developing a humanoid robot called Phoenix. And Tesla Inc. is working on a robot called Optimus, with Elon Musk calling it one of his most important projects.

Agility Robotics, which Amazon backed in 2022, has bots in testing at one of the retailer’s warehouses. Bezos — the world’s second-richest person, according to the Bloomberg Billionaires Ranking — was Amazon’s chief executive officer until 2021 and remains chairman. His net worth is estimated at $197.1 billion.

— With assistance from Matt Day

(Updates with more on Bezos in final paragraph. A previous version of the story corrected a figure to show it was in millions.)

bnew · Feb 28, 2024

EMO

EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions

humanaigc.github.io

EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions

Linrui Tian, Qi Wang, Bang Zhang, Liefeng Bo

Institute for Intelligent Computing, Alibaba Group

GitHub arXiv

https://humanaigc.github.io/emote-portrait-alive/content/video/%E8%B5%AB%E6%9C%AC16_9.mp4

Character: Audrey Kathleen Hepburn-Ruston
Vocal Source: Ed Sheeran - Perfect. Covered by Samantha Harvey

https://humanaigc.github.io/emote-portrait-alive/content/video/16%E6%AF%949%E8%A7%86%E9%A2%91%E7%BB%93%E6%9E%9C/talk_sora.mp4

Character: AI Lady from SORA
Vocal Source: Where We Go From Here with OpenAI's Mira Murati

Abstract

We proposed EMO, an expressive audio-driven portrait-video generation framework. Input a single reference image and the vocal audio, e.g. talking and singing, our method can generate vocal avatar videos with expressive facial expressions, and various head poses, meanwhile, we can generate videos with any duration depending on the length of input video.

https://humanaigc.github.io/emote-portrait-alive/content/video/main_page.mp4

Method

Overview of the proposed method. Our framework is mainly constituted with two stages. In the initial stage, termed Frames Encoding, the ReferenceNet is deployed to extract features from the reference image and motion frames. Subsequently, during the Diffusion Process stage, a pretrained audio encoder processes the audio embedding. The facial region mask is integrated with multi-frame noise to govern the generation of facial imagery. This is followed by the employment of the Backbone Network to facilitate the denoising operation. Within the Backbone Network, two forms of attention mechanisms are applied: Reference-Attention and Audio-Attention. These mechanisms are essential for preserving the character's identity and modulating the character's movements, respectively. Additionally, Temporal Modules are utilized to manipulate the temporal dimension, and adjust the velocity of motion.

Various Generated Videos

Singing

Make Portrait Sing

Input a single character image and a vocal audio, such as singing, our method can generate vocal avatar videos with expressive facial expressions, and various head poses, meanwhile, we can generate videos with any duration depending on the length of input audio. Our method can also persist the characters' identifies in a long duration.

Character: AI Mona Lisa generated by dreamshaper XL
Vocal Source: Miley Cyrus - Flowers. Covered by YUQI

https://humanaigc.github.io/emote-portrait-alive/content/video/16%E6%AF%949%E8%A7%86%E9%A2%91%E7%BB%93%E6%9E%9C/song_sora.mp4

Character: AI Lady from SORA
Vocal Source: Dua Lipa - Don't Start Now

Different Language & Portrait Style

Our method supports songs in various languages and brings diverse portrait styles to life. It intuitively recognizes tonal variations in the audio, enabling the generation of dynamic, expression-rich avatars.

Character: AI Girl generated by ChilloutMix
Vocal Source: David Tao - Melody. Covered by NINGNING (mandarin)

Character: AI Ymir from AnyLora & Ymir Fritz Adult
Vocal Source: 『衝撃』Music Video【TVアニメ「進撃の巨人」The Final Season エンディングテーマ曲】 (Japanese)

https://humanaigc.github.io/emote-portrait-alive/content/video/16%E6%AF%949%E8%A7%86%E9%A2%91%E7%BB%93%E6%9E%9C/song_zgr.mp4

Character: Leslie Cheung Kwok Wing
Vocal Source: Eason Chan - Unconditional. Covered by AI (Cantonese)

Character: AI girl generated by WildCardX-XL-Fusion
Vocal Source: JENNIE - SOLO. Cover by Aiana (Korean)

Rapid Rhythm

The driven avatar can keep up with fast-paced rhythms, guaranteeing that even the swiftest lyrics are synchronized with expressive and dynamic character animations.

Character: Leonardo Wilhelm DiCaprio

Vocal Source: EMINEM - GODZILLA (FT. JUICE WRLD) COVER

Character: KUN KUN

Vocal Source: Eminem - Rap God

Talking

Talking With Different Characters

Our approach is not limited to processing audio inputs from singing, it can also accommodate spoken audio in various languages. Additionally, our method has the capability to animate portraits from bygone eras, paintings, and both 3D models and AI generated content, infusing them with lifelike motion and realism.

Character: Audrey Kathleen Hepburn-Ruston
Vocal Source: Interview Clip

Character: AI Chloe: Detroit Become Human
Vocal Source: Interview Clip

Character: Mona Lisa
Vocal Source: Shakespeare's Monologue II As You Like It: Rosalind "Yes, one; and in this manner."

Character: AI Ymir from AnyLora & Ymir Fritz Adult
Vocal Source: NieR: Automata

Cross-Actor Performance

Explore the potential applications of our method, which enables the portraits of movie characters delivering monologues or performances in different languages and styles. we can expanding the possibilities of character portrayal in multilingual and multicultural contexts.

Character: Joaquin Rafael Phoenix - The Jocker - 《Jocker 2019》
Vocal Source: 《The Dark Knight》 2008

https://humanaigc.github.io/emote-portrait-alive/content/video/16%E6%AF%949%E8%A7%86%E9%A2%91%E7%BB%93%E6%9E%9C/talk_gqq.mp4

Character: SongWen Zhang - QiQiang Gao - 《The Knockout》
Vocal Source: Online courses for legal exams

Character: AI girl generated by xxmix_9realisticSDXL
Vocal Source: Videos published by itsjuli4.

bnew · Feb 28, 2024

Gemini image generation got it wrong. We'll do better.

An explanation of how the issues with Gemini’s image generation of people happened, and what we’re doing to fix it.

blog.google

GEMINI

Gemini image generation got it wrong. We'll do better.

Feb 23, 2024

2 min read

We recently made the decision to pause Gemini’s image generation of people while we work on improving the accuracy of its responses. Here is more about how this happened and what we’re doing to fix it.

Prabhakar Raghavan
Senior Vice President

Three weeks ago, we launched a new image generation feature for the Gemini conversational app (formerly known as Bard), which included the ability to create images of people.

It’s clear that this feature missed the mark. Some of the images generated are inaccurate or even offensive. We’re grateful for users’ feedback and are sorry the feature didn't work well.

We’ve acknowledged the mistake and temporarily paused image generation of people in Gemini while we work on an improved version.

What happened

The Gemini conversational app is a specific product that is separate from Search, our underlying AI models, and our other products. Its image generation feature was built on top of an AI model called Imagen 2.

When we built this feature in Gemini, we tuned it to ensure it doesn’t fall into some of the traps we’ve seen in the past with image generation technology — such as creating violent or sexually explicit images, or depictions of real people. And because our users come from all over the world, we want it to work well for everyone. If you ask for a picture of football players, or someone walking a dog, you may want to receive a range of people. You probably don’t just want to only receive images of people of just one type of ethnicity (or any other characteristic).

However, if you prompt Gemini for images of a specific type of person — such as “a Black teacher in a classroom,” or “a white veterinarian with a dog” — or people in particular cultural or historical contexts, you should absolutely get a response that accurately reflects what you ask for.

So what went wrong? In short, two things. First, our tuning to ensure that Gemini showed a range of people failed to account for cases that should clearly not show a range. And second, over time, the model became way more cautious than we intended and refused to answer certain prompts entirely — wrongly interpreting some very anodyne prompts as sensitive.

These two things led the model to overcompensate in some cases, and be over-conservative in others, leading to images that were embarrassing and wrong.

Next steps and lessons learned

This wasn’t what we intended. We did not want Gemini to refuse to create images of any particular group. And we did not want it to create inaccurate historical — or any other — images. So we turned the image generation of people off and will work to improve it significantly before turning it back on. This process will include extensive testing.

One thing to bear in mind: Gemini is built as a creativity and productivity tool, and it may not always be reliable, especially when it comes to generating images or text about current events, evolving news or hot-button topics. It will make mistakes. As we’ve said from the beginning, hallucinations are a known challenge with all LLMs — there are instances where the AI just gets things wrong. This is something that we’re constantly working on improving.

Gemini tries to give factual responses to prompts — and our double-check feature helps evaluate whether there’s content across the web to substantiate Gemini’s responses — but we recommend relying on Google Search, where separate systems surface fresh, high-quality information on these kinds of topics from sources across the web.

I can’t promise that Gemini won’t occasionally generate embarrassing, inaccurate or offensive results — but I can promise that we will continue to take action whenever we identify an issue. AI is an emerging technology which is helpful in so many ways, with huge potential, and we’re doing our best to roll it out safely and responsibly.

bnew · Feb 28, 2024

Paper page - Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping

Join the discussion on this paper page

huggingface.co

Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping

While Transformers have enabled tremendous progress in various application settings, such architectures still trail behind traditional symbolic planners for solving complex decision making tasks. In this work, we demonstrate how to train Transformers to solve complex planning tasks. This is...

arxiv.org

Computer Science > Artificial Intelligence

[Submitted on 21 Feb 2024]

Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping

Lucas Lehnert, Sainbayar Sukhbaatar, Paul Mcvay, Michael Rabbat, Yuandong Tian

While Transformers have enabled tremendous progress in various application settings, such architectures still lag behind traditional symbolic planners for solving complex decision making tasks. In this work, we demonstrate how to train Transformers to solve complex planning tasks and present Searchformer, a Transformer model that optimally solves previously unseen Sokoban puzzles 93.7% of the time, while using up to 26.8% fewer search steps than standard A∗ search. Searchformer is an encoder-decoder Transformer model trained to predict the search dynamics of A∗. This model is then fine-tuned via expert iterations to perform fewer search steps than A∗ search while still generating an optimal plan. In our training method, A∗'s search dynamics are expressed as a token sequence outlining when task states are added and removed into the search tree during symbolic planning. In our ablation studies on maze navigation, we find that Searchformer significantly outperforms baselines that predict the optimal plan directly with a 5-10× smaller model size and a 10× smaller training dataset. We also demonstrate how Searchformer scales to larger and more complex decision making tasks like Sokoban with improved percentage of solved tasks and shortened search dynamics.

Subjects:	Artificial Intelligence (cs.AI)
Cite as:	arXiv:2402.14083 [cs.AI]
	(or arXiv:2402.14083v1 [cs.AI] for this version)
	[2402.14083] Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping Focus to learn more

Submission history

From: Lucas Lehnert [view email]
[v1] Wed, 21 Feb 2024 19:17:28 UTC (758 KB)

https://arxiv.org/pdf/2402.14083.pdf

Sure, let's break this down:

Transformers are a type of AI model that have been really useful in many areas. But when it comes to making complex decisions, they're not as good as the old-school methods.

In this research, the authors trained a Transformer (which they named "Searchformer") to solve tricky puzzles called Sokoban. It was able to solve these puzzles correctly 93.7% of the time, and it did so using fewer steps than the traditional method (A* search).

Searchformer is a special kind of Transformer that's been trained to mimic the way A* search works. It's then fine-tuned to do even better, using fewer steps but still finding the best solution.

The way they trained it was by teaching it how A* search adds and removes possible solutions during the planning process. They represented this as a sequence of tokens (like a sentence made up of words).

They tested Searchformer on navigating mazes and found it did much better than other methods that try to find the best plan directly. It was able to do this even though it was 5-10 times smaller and trained on a dataset 10 times smaller.

Finally, they showed that Searchformer can handle bigger and more complex tasks, like more difficult Sokoban puzzles. It solved more puzzles and did so more efficiently.

bnew · Feb 28, 2024

Paper page - Nemotron-4 15B Technical Report

Join the discussion on this paper page

huggingface.co

Nemotron-4 15B Technical Report

Published on Feb 26
·Featured in Daily Papers on Feb 26

Authors:
Jupinder Parmar , Shrimai Prabhumoye , Joseph Jennings , Mostofa Patwary , Sandeep Subramanian , Dan Su , Chen Zhu , Deepak Narayanan , Aastha Jhunjhunwala , Ayush Dattagupta , Vibhu Jawa , Jiwei Liu , Ameya Mahabaleshwarkar , Osvald Nitski , Annika Brundyn , James Maki , Miguel Martinez , Jiaxuan You , John Kamalu , Patrick LeGresley , Denys Fridman , Jared Casper +5 authors

Abstract

We introduce Nemotron-4 15B, a 15-billion-parameter large multilingual language model trained on 8 trillion text tokens. Nemotron-4 15B demonstrates strong performance when assessed on English, multilingual, and coding tasks: it outperforms all existing similarly-sized open models on 4 out of 7 downstream evaluation areas and achieves competitive performance to the leading open models in the remaining ones. Specifically, Nemotron-4 15B exhibits the best multilingual capabilities of all similarly-sized models, even outperforming models over four times larger and those explicitly specialized for multilingual tasks.

Nemotron-4 15B Technical Report

We introduce Nemotron-4 15B, a 15-billion-parameter large multilingual language model trained on 8 trillion text tokens. Nemotron-4 15B demonstrates strong performance when assessed on English, multilingual, and coding tasks: it outperforms all existing similarly-sized open models on 4 out of 7...

arxiv.org

bnew · Feb 28, 2024

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Recent research, such as BitNet, is paving the way for a new era of 1-bit Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}. It matches the full-precision (i.e., FP16 or...

arxiv.org

Computer Science > Computation and Language

[Submitted on 27 Feb 2024]

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, Furu Wei

Recent research, such as BitNet, is paving the way for a new era of 1-bit Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}. It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption. More profoundly, the 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs that are both high-performance and cost-effective. Furthermore, it enables a new computation paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.

Comments:	Work in progress
Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2402.17764 [cs.CL]
	(or arXiv:2402.17764v1 [cs.CL] for this version)
	[2402.17764] The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits Focus to learn more

Submission history

From: Shuming Ma [view email]
[v1] Tue, 27 Feb 2024 18:56:19 UTC (201 KB)

https://arxiv.org/pdf/2402.17764.pdf

Large Language Models News & Discussions

Veteran

Veteran

Veteran

Introducing YOLO-NAS-Sat for Small Object Detection at the Edge​

YOLO-NAS-Sat’s Specialized Architecture​

The Need for A Specialized Architecture for Small Object Detection​

From YOLO-NAS to YOLO-NAS-Sat​

Leveraging AutoNAC for Architecture Optimization​

YOLO-NAS-Sat’s Training​

YOLO-NAS-Sat’s State-of-the-Art Performance​

Accuracy Compared to Fine-tuned YOLOv8​

Latency Compared to YOLOv8​

Why We Chose mAP@0.5 for Small Object Detection Evaluation​

YOLO-NAS-Sat’s Potential: Beyond Aerial Images​

Gaining Access to YOLO-NAS-Sat​

Veteran

Genie: Generative Interactive Environments​

A Foundation Model for Playable Worlds​

Learning to control without action labels​

Enabling a new generation of creators​

Veteran

Veteran

Veteran

Veteran

Veteran

Computer Science > Computation and Language​

Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models​

Submission history​

Veteran

Bezos, Nvidia Join OpenAI in Funding Humanoid Robot Startup​

Figure AI also gets investments from Intel, Samsung and Amazon​

Company, valued at roughly $2 billion, is raising $675 million​

Veteran

EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions​

Abstract​

Method​

Various Generated Videos​

Singing​

Make Portrait Sing​

Different Language & Portrait Style​

Rapid Rhythm​

Talking​

Talking With Different Characters​

Cross-Actor Performance​

Veteran

Gemini image generation got it wrong. We'll do better.​

What happened​

Next steps and lessons learned​

Veteran

Computer Science > Artificial Intelligence​

Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping​

Submission history​

Veteran

Nemotron-4 15B Technical Report​

Abstract​

Veteran

Computer Science > Computation and Language​

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits​

Submission history​

Introducing YOLO-NAS-Sat for Small Object Detection at the Edge

YOLO-NAS-Sat’s Specialized Architecture

The Need for A Specialized Architecture for Small Object Detection

From YOLO-NAS to YOLO-NAS-Sat

Leveraging AutoNAC for Architecture Optimization

YOLO-NAS-Sat’s Training

YOLO-NAS-Sat’s State-of-the-Art Performance

Accuracy Compared to Fine-tuned YOLOv8

Latency Compared to YOLOv8

Why We Chose mAP@0.5 for Small Object Detection Evaluation

YOLO-NAS-Sat’s Potential: Beyond Aerial Images

Gaining Access to YOLO-NAS-Sat

Genie: Generative Interactive Environments

A Foundation Model for Playable Worlds

Learning to control without action labels

Enabling a new generation of creators

Computer Science > Computation and Language

Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models

Submission history

Bezos, Nvidia Join OpenAI in Funding Humanoid Robot Startup

Figure AI also gets investments from Intel, Samsung and Amazon

Company, valued at roughly $2 billion, is raising $675 million

EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions

Abstract

Method

Various Generated Videos

Singing

Make Portrait Sing

Different Language & Portrait Style

Rapid Rhythm

Talking

Talking With Different Characters

Cross-Actor Performance

Gemini image generation got it wrong. We'll do better.

What happened

Next steps and lessons learned

Computer Science > Artificial Intelligence

Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping

Submission history

Nemotron-4 15B Technical Report

Abstract

Computer Science > Computation and Language

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Submission history