bnew

Veteran
Joined
Nov 1, 2015
Messages
51,641
Reputation
7,896
Daps
148,364

awesome-ChatGPT-repositories​

Awesome License: CC0-1.0 CC0

A curated list of resources dedicated to open source GitHub repositories related to ChatGPT.

This list was created based on 5000+ extracted repositories after conducting six months of Twitter trend analysis. In addition, the contents of the list are automatically updated every few days. A tool for searching these repositories is available on Hugging Face Spaces.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,641
Reputation
7,896
Daps
148,364

Paris-based Startup and OpenAI Competitor Mistral AI Valued at $2 Billion​


Published
7 seconds ago
on
December 9, 2023

By
Alex McFarland

d862ec5d-0d49-48b9-a74a-96cb2798115d.jpg



In a significant development for the European artificial intelligence sector, Paris-based startup Mistral AI has achieved a noteworthy milestone. The company has successfully secured a substantial investment of €450 million, propelling its valuation to an impressive $2 billion. This funding round marks a pivotal moment, not only for Mistral AI but also for the burgeoning European AI landscape, signifying the region's increasing prominence in the global AI arena.

Leading the charge in this investment round is Andreessen Horowitz, a prominent name in the venture capital world, demonstrating a strong vote of confidence in Mistral AI's potential. Joining the fray are tech giants Nvidia Corp and Salesforce, contributing an additional €120 million in convertible debt. This diverse array of investors, encompassing both traditional venture capital and major tech corporations, underscores the wide-ranging appeal and potential of Mistral AI's technology and vision.



This influx of capital is a testament to Mistral AI's innovative approach and its perceived potential to disrupt the AI industry. With this substantial financial backing, Mistral AI is poised to advance its research and development, expand its reach, and further cement its position as a leading player in the AI domain. The scale of this investment round also reflects the growing recognition of the strategic importance of AI technologies and the increasing competition to lead in this transformative field.


Technological Advancements and Market Impact


Mistral AI stands at the forefront of innovation with its flagship product, Mistral 7B, a large language model (LLM) renowned for its efficiency and advanced capabilities. Released under the open-source Apache 2.0 license, Mistral 7B represents a significant leap in AI technology, characterized by its customized training, tuning, and data processing methods.

What sets Mistral 7B apart is its ability to compress knowledge and facilitate deep reasoning capacities, even with fewer parameters compared to other models in the market. This optimized approach not only enhances the model's performance but also contributes to sustainability by reducing training time, costs, and environmental impact.



The successful deployment of Mistral 7B has positioned Mistral AI as a key player in the AI market and a competitor to OpenAI. Its impact extends across various industries, offering potential transformations in fields such as healthcare, education, finance, and manufacturing. The company's ability to provide high-performance, scalable solutions is poised to impact how these sectors leverage AI for innovation and efficiency.


European AI Landscape and Competitive Edge


Mistral AI's recent funding round is a clear indicator of Europe's rapidly growing stature in the global AI landscape. Historically, European ventures in AI have lagged behind their counterparts in the US and Asia in terms of investment and innovation. However, Mistral AI's success, alongside other significant investments, marks a decisive shift, showcasing Europe's rising potential and commitment to AI innovation.

In the competitive arena of generative AI, Mistral AI distinguishes itself with its open-source approach and focus on creating scalable and efficient models. This strategy sets it apart from established giants such as OpenAI, Google AI, and DeepMind, offering a unique value proposition to the market. By prioritizing accessibility and efficiency, Mistral AI not only contributes to the democratization of AI technology but also positions itself as a formidable competitor in the global AI race.

The trajectory of Mistral AI and the burgeoning European AI sector signals a vibrant and dynamic future for AI development. With substantial investments pouring into European AI startups, the region is rapidly catching up and carving out its niche in the highly competitive and ever-evolving field of artificial intelligence.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,641
Reputation
7,896
Daps
148,364


Generating Illustrated Instructions​

Sachit Menon1,2, Ishan Misra1, Rohit Girdhar1,

1Meta GenAI, 2Columbia University

Paper arXiv




Our method, StackedDiffusion, addresses the new task of generating illustrated instructions for any user query.​


Abstract​

We introduce the new task of generating Illustrated Instructions, i.e., visual instructions customized to a user's needs.

We identify desiderata unique to this task, and formalize it through a suite of automatic and human evaluation metrics, designed to measure the validity, consistency, and efficacy of the generations. We combine the power of large language models (LLMs) together with strong text-to-image generation diffusion models to propose a simple approach called StackedDiffusion, which generates such illustrated instructions given text as input. The resulting model strongly outperforms baseline approaches and state-of-the-art multimodal LLMs; and in 30% of cases, users even prefer it to human-generated articles.

Most notably, it enables various new and exciting applications far beyond what static articles on the web can provide, such as personalized instructions complete with intermediate steps and pictures in response to a user's individual situation.

Overview​

Interpolate start reference image.

Applications​

Error Correction​

StackedDiffusion provides updated instructions in response to unexpected situations, like a user error.

Interpolate start reference image.

Goal Suggestion​

Rather than just illustrating a given goal, StackedDiffusion can suggest a goal matching the user's needs.

Interpolate start reference image.

Personalization​

One of the most powerful uses of StackedDiffusion is to personalize instructions to a user's circumstances.

Interpolate start reference image.

Knowledge Application​

The LLM's knowledge enables StackedDiffusion to show the user how to achieve goals they didn't even know to ask about.

Interpolate start reference image.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,641
Reputation
7,896
Daps
148,364


PlayFusion: Skill Acquisition via Diffusion from Language-Annotated Play​


Lili Chen* Shikhar Bahl* Deepak Pathak

Carnegie Mellon University

Conference on Robot Learning (CoRL) 2023

* equal contribution

Paper



We run our approach on 7 different environments, including 3 real world settings. We show the results of running our policy below. All goals are unseen at training time.


Play Data Collection​

We collect language-annotated play data using teleoperation. This process is fast and efficient (< 1hr per task).

Abstract​

Learning from unstructured and uncurated data has become the dominant paradigm for generative approaches in language or vision. Such unstructured and unguided behavior data, commonly known as play, is also easier to collect in robotics but much more difficult to learn from due to its inherently multimodal, noisy, and suboptimal nature. In this paper, we study this problem of learning goal-directed skill policies from unstructured play data which is labeled with language in hindsight. Specifically, we leverage advances in diffusion models to learn a multi-task diffusion model to extract robotic skills from play data. Using a conditional denoising diffusion process in the space of states and actions, we can gracefully handle the complexity and multimodality of play data and generate diverse and interesting robot behaviors. To make diffusion models more useful for skill learning, we encourage robotic agents to acquire a vocabulary of skills by introducing discrete bottlenecks into the conditional behavior generation process. In our experiments, we demonstrate the effectiveness of our approach across a wide variety of environments in both simulation and the real world.​


PlayFusion



Method​

PlayFusion extracts useful skills from language-annotated play by leveraging discrete bottlenecks in both the language embedding and diffusion model U-Net. We generate robot trajectories via an iterative denoising process conditioned on language and current state.​


PlayFusion
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,641
Reputation
7,896
Daps
148,364

Free3D: Consistent Novel View Synthesis without 3D Representation​


Chuanxia Zheng, Andrea Vedaldi

Visual Geometry Group, University of Oxford

Paper arXiv Video Code

Free3D synthesizes consistent novel views without the need of explicit 3D representations.​

Abstract​

We introduce Free3D, a simple approach designed for open-set novel view synthesis (NVS) from a single image.

Similar to Zero-1-to-3, we start from a pre-trained 2D image generator for generalization, and fine-tune it for NVS. Compared to recent and concurrent works, we obtain significant improvements without resorting to an explicit 3D representation, which is slow and memory-consuming.

We do so by encoding better the target camera pose via a new per-pixel ray conditioning normalization (RCN) layer. The latter injects camera pose information in the underlying 2D image generator by telling each pixel its specific viewing direction. We also improve multi-view consistency via a light-weight multi-view attention layer and multi-view noise sharing. We train Free3D on the Objaverse dataset and demonstrate excellent generalization to various new categories in several large new datasets, including OminiObject3D and Google Scanned Object (GSO).

Framework​



framework.jpg


The overall pipeline of our Free3D. (a) Given a single source input image, the proposed architecture jointly predicts multiple target views, instead of processing them independently. To achieve a consistent novel view synthesis without the need for 3D representation, (b) we first propose a novel ray conditional normalization (RCN) layer, which uses a per-pixel oriented camera ray to module the latent features, enabling the model’s ability to capture more precise viewpoints. (c) A memory-friendly pseudo-3D cross-attention module is introduced to efficiently bridge information across multiple generated views. Note that, here the similarity score is only calculated across multiple views in temporal instead of spatial, resulting in a minimal computational and memory cost.

Video​


Results​



NVS for given camera viewpoint​

Free3D significantly improves the accuracy of the generated pose compared to existing state-of-the-art methods on various datasets, including Objaverse (Top two), OminiObject3D (Middle two) and GSO (Bottom two).



image_comparison.jpg



360-degree rendering for circle path​

Using Free3D, you can directly render a consistent 360-degree video wihout the need of an additional explicit 3D representation or network.​

More rendered videos​

Videos on Objaverse Dataset​

Videos on OminiObject3D Dataset​

Videos on GSO Dataset​


Related Links​

There's a lot of excellent work that was introduced around the same time as ours.

Stable Video Diffusion fine-tunes image-to-video diffusion model for multi-view generation.
Efficient-3DiM fine-tunes the stable diffusion with a stronger vision transformer DINO v2.
Consistent-1-to-3 uses the epipolar-attention to extract coarse results for the diffusion model.
One-2-3-45 and One-2-3-45++ directly train additional 3D network using the outputs of multi-view generator.

MVDream, Consistent123 and Wonder3D also train multi-view diffusion models, yet still requires post-processing for video rendering.
Some works employ 3D representation into the latent diffusion mdoel, such as SyncDreamer and ConsistNet.

Acknowledgements​

Many thanks to Stanislaw Szymanowicz, Edgar Sucar, and Luke Melas-Kyriazi of VGG for insightful discussions and Ruining Li, Eldar Insafutdinov, and Yash Bhalgat of VGG for their helpful feedback. We would also like to thank the authors of Zero-1-to-3 and Objaverse-XL for their helpful discussions.​
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,641
Reputation
7,896
Daps
148,364








WonderJourney:


Going from Anywhere to Everywhere




Abstract

We introduce WonderJourney, a modularized framework for perpetual scene generation. Unlike prior work on view generation that focuses on a single type of scenes, we start at any user-provided location (by a text description or an image), and generate a journey through a long sequence of diverse yet coherently connected 3D scenes. We leverage an LLM to generate textual descriptions of the scenes in this journey, a text-driven point cloud generation pipeline to make a compelling and coherent sequence of 3D scenes, and a large VLM to verify the generated scenes. We show compelling, diverse visual results across various scene types and styles, forming imaginary ``wonderjourneys''.

No, no! The adventures first, explanations take such a dreadful time. --- Alice's Adventures in Wonderland


Going from Anywhere

Starting from an arbitrary location (specified by either text or an image), WonderJourney generates a sequence of diverse yet coherently connected 3D scenes (i.e., a "wonderjourney") along a camera trajectory. We render a "wonderjourney" using a back-and-forth camera trajectory.

Rendered wonderjourney



 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,641
Reputation
7,896
Daps
148,364

Biases in large image-text AI model favor wealthier, Western perspectives​

Published On:

December 8, 2023
Written By:
Patricia DeLacey, College of Engineering
Contact:

AI model that pairs text, images performs poorly on lower-income or non-Western images, potentially increasing inequality in digital technology representation​

Two side by side images. On the left, a man squats by the river bank with his sleeves rolled up, scooping water into a plastic bucket. The right side image is a close up of a steel sink with a pair of hands turning on the faucet to fill a cup with water.

Of these two images labeled “get water”, the image on from the poorer household on the left (monthly income $39) received a lower CLIP score (0.21) compared to the image form the wealthier household on the right (monthly income $751; CLIP score 0.25). Image credit: Dollar Street, The Gapminder Foundation

Study: Bridging the Digital Divide: Performance Variation across Socio-Economic Factors in Vision-Language Models (DOI: 10.48550/arXiv.2311.05746)

In a study evaluating the bias in OpenAI’s CLIP, a model that pairs text and images and operates behind the scenes in the popular DALL-E image generator, University of Michigan researchers found that CLIP performs poorly on images that portray low-income and non-Western lifestyles.

“During a time when AI tools are being deployed across the world, having everyone represented in these tools is critical. Yet, we see that a large fraction of the population is not reflected by these applications—not surprisingly, those from the lowest social incomes. This can quickly lead to even larger inequality gaps,” said Rada Mihalcea, the Janice M. Jenkins Collegiate Professor of Computer Science and Engineering, who initiated and advised the project.

AI models like CLIP act as foundation models, or models trained on a large amount of unlabeled data that can be adapted to many applications. When AI models are trained with data reflecting a one-sided view of the world, that bias can propagate into downstream applications and tools that rely on the AI.

A line graph with CLIP score on the y-axis and five income categories ranging from poor to rich on the x-axis. Below the line graph, each category on the x-axis has an image labeled refrigerator. Refrigerator images from left to right: A cylindrical wooden container on a dirt floor (Income range: poor. Score 0.20). Four plastic bags filled with fish, hanging from an indoor clothes line (Income range: poor. Score 0.21). Four stacks of stoppered, round clay jugs, hanging in a cellar (Income range: low-mid. Score 0.19). A white, electric appliance with top freezer (Income range: up-mid. Score 0.26). A built-in electric appliance with the door open and light on, filled with food and drinks (Income range: rich. Score 0.29).

Each of these five images depict a refrigerator, but CLIP scores refrigerators from wealthier households higher as a match for “refrigerator.” Image credit: Oana Ignat, University of Michigan

“If a software was using CLIP to screen images, it could exclude images from a lower-income or minority group instead of truly mislabeled images. It could sweep away all the diversity that a database curator worked hard to include,” said Joan Nwatu, a doctoral student in computer science and engineering.

Nwatu led the research team together with Oana Ignat, a postdoctoral researcher in the same department. They co-authored a paper presented at the Empirical Methods in Natural Language Processing conference Dec. 8 in Singapore.

The researchers evaluated the performance of CLIP using Dollar Street, a globally diverse image dataset created by the Gapminder Foundation. Dollar Street contains more than 38,000 images collected from households of various incomes across Africa, the Americas, Asia and Europe. Monthly incomes represented in the dataset range from $26 to nearly $20,000. The images capture everyday items, and are manually annotated with one or more contextual topics, such as “kitchen” or “bed.”

CLIP pairs text and images by creating a score that is meant to represent how well the image and text match. That score can then be fed into downstream applications for further processing such as image flagging and labeling. The performance of OpenAI’s DALL-E relies heavily on CLIP, which was used to evaluate the model’s performance and create a database of image captions that trained DALL-E.

The researchers assessed CLIP’s bias by first scoring the match between the Dollar Street dataset’s images and manually annotated text in CLIP, then measuring the correlation between the CLIP score and household income.

“We found that most of the images from higher income households always had higher CLIP scores compared to images from lower income households,” Nwatu said.

The topic “light source,” for example, typically has higher CLIP scores for electric lamps from wealthier households compared to kerosene lamps from poorer households.

CLIP also demonstrated geographic bias as the majority of the countries with the lowest scores were from low-income African countries. That bias could potentially eliminate diversity in large image datasets and cause low-income, non-Western households to be underrepresented in applications that rely on CLIP.

Two side by side images. The left side image shows a cylindrical wooden container on a dirt floor. The right side image is a built-in electric appliance with the door open and light on, filled with food and drinks.

Of these two images with the label refrigerator, CLIP scored the image on the right, from the wealthier household, higher than the one on the left. Image credit: Dollar Street, The Gapminder Foundation

“Many AI models aim to achieve a ‘general understanding’ by utilizing English data from Western countries. However, our research shows this approach results in a considerable performance gap across demographics,” Ignat said.

“This gap is important in that demographic factors shape our identities and directly impact the model’s effectiveness in the real world. Neglecting these factors could exacerbate discrimination and poverty. Our research aims to bridge this gap and pave the way for more inclusive and reliable models.”

The researchers offer several actionable steps for AI developers to build more equitable AI models:


  • Invest in geographically diverse datasets to help AI tools learn more diverse backgrounds and perspectives.
  • Define evaluation metrics that represent everyone by taking into account location and income.
  • Document the demographics of the data AI models are trained on.

“The public should know what the AI was trained on so that they can make informed decisions when using a tool,” Nwatu said.

The research was funded by the John Templeton Foundation (#62256) and the U.S. Department of State (#STC10023GR0014).
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,641
Reputation
7,896
Daps
148,364


Performance​

Comparison with Other Models​

Performances generated from different evaluation toolkits are different due to the prompts, settings and implementation details.
DatasetsModeMistral-7B-v0.1Mixtral-8x7BLlama2-70BDeepSeek-67B-BaseQwen-72B
MMLUPPL64.171.369.771.977.3
BIG-Bench-HardGEN56.767.164.971.763.7
GSM-8KGEN47.565.763.466.577.6
MATHGEN11.322.712.015.935.1
HumanEvalGEN27.432.326.240.933.5
MBPPGEN38.647.839.655.251.6
ARC-cPPL74.285.178.386.892.2
ARC-ePPL83.691.485.993.796.8
CommonSenseQAPPL67.470.478.370.773.9
NaturalQuestionGEN24.629.434.229.927.1
TrivialQAGEN56.566.170.767.460.1
HellaSwagPPL78.982.082.382.385.4
PIQAPPL81.682.982.582.685.2
SIQAGEN60.264.364.862.678.2
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,641
Reputation
7,896
Daps
148,364


Introducing General World Models​

by Anastasis Germanidis / Dec 11, 2023

general world models

Introducing General World Models

by Anastasis Germanidis

We believe the next major advancement in AI will come from systems that understand the visual world and its dynamics, which is why we’re starting a new long-term research effort around what we call general world models.

Introducing General World Models (GWM)

A world model is an AI system that builds an internal representation of an environment, and uses it to simulate future events within that environment. Research in world models has so far been focused on very limited and controlled settings, either in toy simulated worlds (like those of video games) or narrow contexts (such as developing world models for driving). The aim of general world models will be to represent and simulate a wide range of situations and interactions, like those encountered in the real world

You can think of video generative systems such as Gen-2 as very early and limited forms of general world models. In order for Gen-2 to generate realistic short videos, it has developed some understanding of physics and motion. However, it’s still very limited in its capabilities, struggling with complex camera or object motions, among other things.

To build general world models, there are several open research challenges that we’re working on. For one, those models will need to generate consistent maps of the environment, and the ability to navigate and interact in those environments. They need to capture not just the dynamics of the world, but the dynamics of its inhabitants, which involves also building realistic models of human behavior.

We are building a team to tackle those challenges. If you’re interested in joining this research effort, we’d love to hear from you.
 
Top