The A.I Megathread (LLM , GPT , Development)

bnew · Mar 21, 2023

“Open-source AI is a tremendous innovation, and we appreciate that there are open questions and differing legal opinions. We expect them to be resolved over time, as AI becomes more ubiquitous and different groups come to a consensus as to how to balance individual rights and essential AI/ML research,” says Stability.AI’s Mason. “We strive to find the balance between innovating and helping the community.”

GREG RUTKOWSKI

snaky dragon comes up behind a wizard with a malformed face. A glowing dragon-shaped fireball is in background, and something that looks like a cross between a sword and a pterodactyl is in the foreground.

MS TECH VIA STABLE DIFFUSION
Rutkowski’s “Castle Defense, 2018” (left) and a Stable Diffusion prompted image.

Mason encourages any artists who don’t want their works in the data set to contact LAION, which is an independent entity from the startup. LAION did not immediately respond to a request for comment.

Berlin-based artists Holly Herndon and Mat Dryhurst are working on tools to help artists opt out of being in training data sets. They launched a site called Have I Been Trained, which lets artists search to see whether their work is among the 5.8 billion images in the data set that was used to train Stable Diffusion and Midjourney. Some online art communities, such as Newgrounds, are already taking a stand and have explicitly banned AI-generated images.

An industry initiative called Content Authenticity Initiative, which includes the likes of Adobe, Nikon, and the New York Times, are developing an open standard that would create a sort of watermark on digital content to prove its authenticity. It could help fight disinformation as well as ensuring that digital creators get proper attribution.

“It could also be a way in which creators or IP holders can assert ownership over media that belongs to them or synthesized media that's been created with something that belongs to them,” says Nina Schick, an expert on deepfakes and synthetic media.

Pay-per-play

AI-generated art poses tricky legal questions. In the UK, where Stability.AI is based, scraping images from the internet without the artist’s consent to train an AI tool could be a copyright infringement, says Gill Dennis, a lawyer at the firm Pinsent Masons. Copyrighted works can be used to train an AI under “fair use,” but only for noncommercial purposes. While Stable Diffusion is free to use, Stability.AI also sells premium access to the model through a platform called DreamStudio.

The UK, which hopes to boost domestic AI development, wants to change laws to give AI developers greater access to copyrighted data. Under these changes, developers would be able to scrape works protected by copyright to train their AI systems for both commercial and noncommercial purposes.

While artists and other rights holders would not be able to opt out of this regime, they will be able to choose where they make their works available. The art community could end up moving into a pay-per-play or subscription model like the one used in the film and music industries.

“The risk, of course, is that rights holders simply refuse to make their works available, which would undermine the very reason for extending fair use in the AI development space in the first place,” says Dennis.

In the US, LinkedIn lost a case in an appeals court, which ruled last spring that scraping publicly available data from sources on the internet is not a violation of the Computer Fraud and Abuse Act. Google also won a case against authors who objected to the company’s scraping their copyrighted works for Google Books.

Rutkowski says he doesn’t blame people who use his name as a prompt. For them, “it’s a cool experiment,” he says. “But for me and many other artists, it’s starting to look like a threat to our careers.”

Him · Mar 21, 2023

A.I should probably have never made it past personal assistants. We gonna regret this one day.

bnew · Mar 21, 2023

https://archive.is/Jscsw

GitHub - tloen/alpaca-lora: Instruct-tune LLaMA on consumer hardware

Instruct-tune LLaMA on consumer hardware. Contribute to tloen/alpaca-lora development by creating an account on GitHub.

github.com

Alpaca-LoRA: Low-Rank LLaMA Instruct-Tuning

Try the pretrained model out here, courtesy of a GPU grant from Huggingface!
Share custom LoRA adapters, including adapters for the larger models, here
Users have created a Discord server for discussion and support here
alpaca-lora-30b can be used like ChatGPT; see here

This repository contains code for reproducing the Stanford Alpaca results using low-rank adaptation (LoRA). We provide an Instruct model of similar quality to text-davinci-003 that can run on a Raspberry Pi (for research), and the code is easily extended to the 13b, 30b, and 65b models.

In addition to the training code, which runs within five hours on a single RTX 4090, we publish a script for downloading and inference on the foundation model and LoRA, as well as the resulting LoRA weights themselves. To fine-tune cheaply and efficiently, we use Hugging Face's PEFT as well as Tim Dettmers' bitsandbytes.

Without hyperparameter tuning, the LoRA model produces outputs comparable to the Stanford Alpaca model. (Please see the outputs included below.) Further tuning might be able to achieve better performance; I invite interested users to give it a try and report their results.

bnew · Mar 21, 2023

Talk = GPT-2 + Whisper + WASM · ggml-org whisper.cpp · Discussion #167

I just had an awesome idea: Make a web-page that: Listens when someone speaks Transcribes the words using WASM Whisper Generates a new sentence using WASM GPT-2 Uses Web Speech API to synthesise th...

github.com

I just had an awesome idea:

Make a web-page that:

Listens when someone speaks

Transcribes the words using WASM Whisper

Generates a new sentence using WASM GPT-2

Uses Web Speech API to synthesise the speech and play it on the speakers.

All of this running locally in the browser - no server required

I have all the ingredients and I think the performance is just enough. I just have to put it together.
The total data that the page will have to load on startup (probably using Fetch API) is:

74 MB for the Whisper tiny.en model

240 MB for the GPT-2 small model

Web Speech API is built-in in modern browsers

I think it will be very fun because you could talk to the web-page or even add extra devices that talk to each other only through the mic and the speakers. For example, you simply open the page on your phone and tablet and put them next to each other - listen to them talk about something

Any ideas to make this even more fun?

warning: uses a lot of memory

Talk - GPT-2 meets Whisper in WebAssembly

whisper.ggerganov.com

bnew · Mar 21, 2023

GitHub - Const-me/Whisper: High-performance GPGPU inference of OpenAI's Whisper automatic speech recognition (ASR) model

High-performance GPGPU inference of OpenAI's Whisper automatic speech recognition (ASR) model - GitHub - Const-me/Whisper: High-performance GPGPU inference of OpenAI's Whisper automatic spe...

github.com

This project is a Windows port of the whisper.cpp implementation.
Which in turn is a C++ port of OpenAI's Whisper automatic speech recognition (ASR) model.

bnew · Mar 22, 2023

Introducing GitHub Copilot X

Your AI pair programmeris leveling up

With chat and terminal interfaces, support for pull requests, and early adoption of OpenAI’s GPT-4, GitHub Copilot X is our vision for the future of AI-powered software development. Integrated into every part of your workflow.

Introducing GitHub Copilot X

GitHub Copilot X is our vision for the future of AI-powered software development.

github.com

the bossman · Mar 22, 2023

Never seen Microsoft pump out new features like this back to back

bnew · Mar 24, 2023

more info on LLaMa

https://archive.is/CVadF

https://archive.is/khJOO

bnew · Mar 24, 2023

https://archive.is/Vd544

bnew · Mar 24, 2023

https://archive.is/vh2Bm

https://archive.is/bUGga

bnew · Mar 25, 2023

https://archive.is/5w5U1

GitHub - mckaywrigley/chatbot-ui: AI chat for any model.

AI chat for any model. Contribute to mckaywrigley/chatbot-ui development by creating an account on GitHub.

github.com

bnew · Mar 25, 2023

https://archive.is/1iUV3

bnew · Mar 25, 2023

https://archive.is/YKCzA

https://archive.is/BcCbC

https://archive.is/gQQBs

bnew · Mar 25, 2023

Hello Dolly: Democratizing the magic of ChatGPT with open models

Introducing 'Hello Dolly,' a project to democratize AI by integrating ChatGPT and open models, making advanced AI accessible to everyone.

www.databricks.com

[/U]

Hello Dolly: Democratizing the magic of ChatGPT with open models

The Databricks Blog
by Mike Conover, Matt Hayes, Ankit Mathur, Xiangrui Meng, Jianwei Xie, Jun Wan, Ali Ghodsi, Patrick Wendell and Matei Zaharia
March 24, 2023 in Company Blog

Summary

We show that anyone can take a dated off-the-shelf open source large language model (LLM) and give it magical ChatGPT-like instruction following ability by training it in 30 minutes on one machine, using high-quality training data. Surprisingly, instruction-following does not seem to require the latest or largest models: our model is only 6 billion parameters, compared to 175 billion for GPT-3. We open source the code for our model (Dolly) and show how it can be re-created on Databricks. We believe models like Dolly will help democratize LLMs, transforming them from something very few companies can afford into a commodity every company can own and customize to improve their products.

Background

ChatGPT, a proprietary instruction-following model, was released in November 2022 and took the world by storm. The model was trained on trillions of words from the web, requiring massive numbers of GPUs to develop. This quickly led to Google and other companies releasing their own proprietary instruction-following models. In February 2023, Meta released the weights for a set of high-quality (but not instruction-following) language models called LLaMA to academic researchers, trained for over 80,000 GPU-hours each. Then, in March, Stanford built the Alpaca model, which was based on LLaMA, but tuned on a small dataset of 50,000 human-like questions and answers that, surprisingly, made it exhibit ChatGPT-like interactivity.

Introducing Dolly

Today we are introducing Dolly, a cheap-to-build LLM that exhibits a surprising degree of the instruction following capabilities exhibited by ChatGPT. Whereas the work from the Alpaca team showed that state-of-the-art models could be coaxed into high quality instruction-following behavior, we find that even years-old open source models with much earlier architectures exhibit striking behaviors when fine tuned on a small corpus of instruction training data. Dolly works by taking an existing open source 6 billion parameter model from EleutherAI and modifying it ever so slightly to elicit instruction following capabilities such as brainstorming and text generation not present in the original model, using data from Alpaca.

The model underlying Dolly only has 6 billion parameters, compared to 175 billion in GPT-3, and is two years old, making it particularly surprising that it works so well. This suggests that much of the qualitative gains in state-of-the-art models like ChatGPT may owe to focused corpuses of instruction-following training data, rather than larger or better-tuned base models. We’re calling the model Dolly — after Dolly the sheep, the first cloned mammal — because it's an open source clone of an Alpaca, inspired by a LLaMA. We’re in the earliest days of the democratization of AI for the enterprise, and much work remains to be done, but we believe the technology underlying Dolly represents an exciting new opportunity for companies that want to cheaply build their own instruction-following models.

We evaluated Dolly on the instruction-following capabilities described in the InstructGPT paper that ChatGPT is based on and found that it exhibits many of the same qualitative capabilities, including text generation, brainstorming and open Q&A. Of particular note in these examples is not the quality of the generated text, but rather the vast improvement in instruction-following capability that results from fine tuning a years-old open source model on a small, high quality dataset.

Generation

Original Model

Dolly

Write a love letter to Edgar Allan Poe.

Original Model:
HISTORY: Edgar Allan Poe rose to fame in the 19th Century. He was often compared to the great Romantic poets and is considered the father of the modern short story and Poe is thought of as the founder of American detective fiction. He has been dead for 130 years.
PATRICK CAMPOS: You should never try to write a love letter to a dead person.
PATRICK CAMPOS: Poe was more than just the author of The Raven. He was a master of atmosphere.
BUDDY FRITZLER: Edgar Allan Poe lived in Baltimore from 1831 until his death in 1849. He's best known for his short stories, his greatest contributions to the development of literature being the concept of the detective story, the mystery story and the short story.
BUDDY FRITZLER: Poe's most famous story, "The Gold-Bug", was written in 1843. It's about a treasure hunter who searches for buried treasure. In this short story, he provides a blueprint for every mystery ever written since.
BUDDY FRITZLER: The central idea that you have to look for clues is very appealing. It was even used as …

Dolly:
Dearest Edgar Allan Poe,
I met you first in the dimly lit stairwell of the old house, and I know that the moment I saw you, I never wanted to see you again. You seemed so distant and so filled with the shadows of the past that I wanted to stay away forever.
But I couldn't help but be obsessed with you, and my feelings for you have only grown over the years. Even now, I feel a deep connection with you that I can never explain.
I hope that you can find happiness in the life that you now lead. I am sure that you will do great things, and I am forever in your debt.
Sincerely,
[Your Name]

Write a tweet announcing Dolly, a large language model from Databricks.

Original Model:
In your tweet, provide a short summary of the challenge, how you analyzed it, and how Dolly performed.
After your tweet, I will retweet it to Databricks’ Twitter account.
Clean up your social media page for the next week by removing any mentions of Dolly or Databricks.
Data Preparation
Our first step is to generate word vector embeddings for each document. We can obtain these vector representations in two different ways:
Using a tokenized document set (in this case, everything in our download folder)
Loads a word2vec model from the Google Search API (in this case, vector2vec.c)
In our case, we want to use the first method, which is easier, but the second may be more robust. We will use the word2vec model above to create a vocabulary file for any new words found in the file. This process gives us a quick way to identify new words and find contextually related terms.
We can start by copying word2vec.c to a local directory called /tmp/word2vec.
#/tmp/word2vec/word2vec.c FILE_NAME /tmp/word

Dolly:
Databricks just announced Dolly - a new language model for natural language processing with large capacity and improved efficiency!
#DataScience #NaturalLanguageProcessing
[Link to Dolly model release]

I'm selling my Nikon D-750, write a short blurb for my ad.

Original Model:
I'm not sure if you can post a location - anything on the website will work.
I've only had it a week. It is in mint condition with the charging cable and window case intact. Bought it for $1800 for in store pickup at a store in San Francisco. Plus tax and the undisclosed Costco price.
Some one please let me know that this is a scam and not really an offer.
I just received and e-mail that said the camera is at someones house so someone else can pick it up and did I want to message them, well no I'm selling it as I would never …

Dolly:
Looking to get a reliable camera for photo and video work? Check out the Nikon D-750! This sleek, lightweight camera has a 24.2MP sensor and 1080/60p video recording, making it a great choice for capturing those moments in life. The ergonomic grip and nine-point autofocus system make it easy to work with and shoot without hassle. With Wi-Fi, you can upload your pictures easily to your computer or smartphone without having to buy a dedicated external storage. Buy the Nikon D-750 now to experience why it is the best choice for photography and videography!

bnew · Mar 25, 2023

{continued}

Brainstorming

Original Model

Dolly

Give me a list of 5 science fiction books I should read next.

Original Model:
I know it’s early, and I should feel ashamed of myself for even thinking of it, but I have a terrible tendency to procrastinate and read only when I absolutely have to.
As a result, I have not read anything this year, in any genre.
As soon as I have a set of ideas for an article, I let the ideas and outline do the work of making sure I don’t start writing. There is something special about a blank page and an empty mind.
That said, I need to change that, and reading the above question made me realize just how badly I want ..…

Dolly:
1. 2001: A Space Odyssey by Arthur C. Clarke
2. The Hitchhiker's Guide to the Galaxy by Douglas Adams
3. The Lord of the Rings by J.R.R. Tolkien
4. The Hitchhiker's Guide to the Galaxy by Douglas Adams
5. 1984 by George Orwell.

Why Open Models?

There are many reasons a company would prefer to build their own model rather than sending data to a centralized LLM provider that serves a proprietary model behind an API. For many companies, the problems and datasets most likely to benefit from AI represent their most sensitive and proprietary intellectual property, and handing it over to a third party may be unpalatable. Furthermore, organizations may have different tradeoffs in terms of model quality, cost, and desired behavior. We believe that most ML users are best served long term by directly owning their models.

We are open sourcing a simple Databricks notebook that you can use to build Dolly yourself on Databricks. Contact us at hello-dolly@databricks.com if you would like to get access to the trained weights.

What’s Next?

The release of Dolly is the first in a series of announcements Databricks is making that focus on helping every organization harness the power of large language models. We believe in the incredible power of artificial intelligence to transform the productivity of every organization and individual, and welcome you to join us on this journey. Stay tuned for more in this area in the coming weeks!

Acknowledgments

This work owes much to the efforts and insights of many incredible organizations. This would have been impossible without EleutherAI open sourcing and training GPT-J. We are inspired by the incredible ideas and data from the Stanford Center for Research on Foundation Models and specifically the team behind Alpaca. The core idea behind the outsized power of small dataset is thanks to the original paper on Self-Instruct. We are also thankful to Hugging Face for hosting, open sourcing, and maintaining countless models and libraries; their contribution to the state of the art cannot be overstated.

Disclaimer: Generative AI is an emerging technology and we're in the early stages of research around how to address factual accuracy, bias, offensive responses, general toxicity, and hallucinations in LLMs. Dolly, like other language models, can sometimes exhibit these behaviors and we urge our users to exercise good judgment in designing applications of this technology.

GitHub - databrickslabs/dolly

Contribute to databrickslabs/dolly development by creating an account on GitHub.

github.com

The A.I Megathread (LLM , GPT , Development)

Veteran

Pay-per-play​

1-1800-IMHIM

Veteran

Alpaca-LoRA: Low-Rank LLaMA Instruct-Tuning​

Veteran

warning: uses a lot of memory​

Veteran

Veteran

Your AI pair programmeris leveling up​

Superstar

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Hello Dolly: Democratizing the magic of ChatGPT with open models​

Generation​

Veteran

Brainstorming​

Pay-per-play

Alpaca-LoRA: Low-Rank LLaMA Instruct-Tuning

warning: uses a lot of memory

Your AI pair programmeris leveling up

Hello Dolly: Democratizing the magic of ChatGPT with open models

Generation

Brainstorming