bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,835

How to train your own Large Language Models

Wed Apr 19 2023 by Reza Shabani​

Header Image

How Replit trains Large Language Models (LLMs) using Databricks, Hugging Face, and MosaicML

Introduction​

Large Language Models, like OpenAI's GPT-4 or Google's PaLM, have taken the world of artificial intelligence by storm. Yet most companies don't currently have the ability to train these models, and are completely reliant on only a handful of large tech firms as providers of the technology.
At Replit, we've invested heavily in the infrastructure required to train our own Large Language Models from scratch. In this blog post, we'll provide an overview of how we train LLMs, from raw data to deployment in a user-facing production environment. We'll discuss the engineering challenges we face along the way, and how we leverage the vendors that we believe make up the modern LLM stack: Databricks, Hugging Face, and MosaicML.
While our models are primarily intended for the use case of code generation, the techniques and lessons discussed are applicable to all types of LLMs, including general language models. We plan to dive deeper into the gritty details of our process in a series of blog posts over the coming weeks and months.

Why train your own LLMs?​

One of the most common questions for the AI team at Replit is "why do you train your own models?" There are plenty of reasons why a company might decide to train its own LLMs, ranging from data privacy and security to increased control over updates and improvements.
At Replit, we care primarily about customization, reduced dependency, and cost efficiency.
  • Customization. Training a custom model allows us to tailor it to our specific needs and requirements, including platform-specific capabilities, terminology, and context that will not be well-covered in general-purpose models like GPT-4 or even code-specific models like Codex. For example, our models are trained to do a better job with specific web-based languages that are popular on Replit, including Javascript React (JSX) and Typescript React (TSX).
  • Reduced dependency. While we'll always use the right model based on the task at hand, we believe there are benefits to being less dependent on only a handful of AI providers. This is true not just for Replit but for the broader developer community. It's why we plan to open source some of our models, which we could not do without the means to train them.
  • Cost efficiency. Although costs will continue to go down, LLMs are still prohibitively expensive for use amongst the global developer community. At Replit, our mission is to bring the next billion software creators online. We believe that a student coding on their phone in India should have access to the same AI as a professional developer in Silicon Valley. To make this possible, we train custom models that are smaller, more efficient, and can be hosted with drastically reduced cost.

Data pipelines​

LLMs require an immense amount of data to train. Training them requires building robust data pipelines that are highly optimized and yet flexible enough to easily include new sources of both public and proprietary data.

The Stack​

We begin with The Stack as our primary data source which is available on Hugging Face. Hugging Face is a great resource for datasets and pre-trained models. They also provide a variety of useful tools as part of the Transformers library, including tools for tokenization, model inference, and code evaluation.
The Stack is made available by the BigCode project. Details of the dataset construction are available in Kocetkov et al. (2022). Following de-duplication, version 1.2 of the dataset contains about 2.7 TB of permissively licensed source code written in over 350 programming languages.
The Transformers library does a great job of abstracting away many of the challenges associated with model training, including working with data at scale. However, we find it insufficient for our process, as we need additional control over the data and the ability to process it in distributed fashion.
llm-training

Data processing​

When it comes time for more advanced data processing, we use Databricks to build out our pipelines. This approach also makes it easy for us to introduce additional data sources (such as Replit or Stack Overflow) into our process, which we plan to do in future iterations.
The first step is to download the raw data from Hugging Face. We use Apache Spark to parallelize the dataset builder process across each programming language. We then repartition the data and rewrite it out in parquet format with optimized settings for downstream processing.
Next, we turn to cleaning and preprocessing our data. Normally, it’s important to deduplicate the data and fix various encoding issues, but The Stack has already done this for us using a near-deduplication technique outlined in Kocetkov et al. (2022). We will, however, have to rerun the deduplication process once we begin to introduce Replit data into our pipelines. This is where it pays off to have a tool like Databricks, where we can treat The Stack, Stackoverflow, and Replit data as three sources within a larger data lake, and utilize them as needed in our downstream processes.
An additional benefit of using Databricks is that we can run scalable and tractable analytics on the underlying data. We run all types of summary statistics on our data sources, check long-tail distributions, and diagnose any issues or inconsistencies in the process. All of this is done within Databricks notebooks, which can also be integrated with MLFlow to track and reproduce all of our analyses along the way. This step, which amounts to taking a periodic x-ray of our data, also helps inform the various steps we take for preprocessing.
For preprocessing, we take the following steps:
  • We anonymize the data by removing any Personal Identifiable Information (PII), including emails, IP addresses, and secret keys.
  • We use a number of heuristics to detect and remove auto-generated code.
  • For a subset of languages, we remove code that doesn't compile or is not parseable using standard syntax parsers.
  • We filter out files based on average line length, maximum line length, and percentage of alphanumeric characters.
the-stack-db-notebook

Tokenization and vocabulary training​

Prior to tokenization, we train our own custom vocabulary using a random subsample of the same data that we use for model training. A custom vocabulary allows our model to better understand and generate code content. This results in improved model performance, and speeds up model training and inference.
This step is one of the most important in the process, since it's used in all three stages of our process (data pipelines, model training, inference). It underscores the importance of having a robust and fully-integrated infrastructure for your model training process.
We plan to dive deeper into tokenization in a future blog post. At a high-level, some important things we have to account for are vocabulary size, special tokens, and reserved space for sentinel tokens.
Once we've trained our custom vocabulary, we tokenize our data. Finally, we construct our training dataset and write it out to a sharded format that is optimized for feeding into the model training process.

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,835

Model training​

We train our models using MosaicML. Having previously deployed our own training clusters, we found that the MosaicML platform gives us a few key benefits.
  • Multiple cloud providers. Mosaic gives us the ability to leverage GPUs from different cloud providers without the overhead of setting up an account and all of the required integrations.
  • LLM training configurations. The Composer library has a number of well-tuned configurations for training a variety of models and for different types of training objectives.
  • Managed infrastructure. Their managed infrastructure provides us with orchestration, efficiency optimizations, and fault tolerance (i.e., recovery from node failures).
In determining the parameters of our model, we consider a variety of trade-offs between model size, context window, inference time, memory footprint, and more. Larger models typically offer better performance and are more capable of transfer learning. Yet these models have higher computational requirements for both training and inference. The latter is especially important to us. Replit is a cloud native IDE with performance that feels like a desktop native application, so our code completion models need to be lightning fast. For this reason, we typically err on the side of smaller models with a smaller memory footprint and low latency inference.
In addition to model parameters, we also choose from a variety of training objectives, each with their own unique advantages and drawbacks. The most common training objective is next token prediction. This typically works well for code completion, but fails to take into account the context further downstream in a document. This can be mitigated by using a "fill-in-the-middle" objective, where a sequence of tokens in a document are masked and the model must predict them using the surrounding context. Yet another approach is UL2 (Unsupervised Latent Language Learning), which frames different objective functions for training language models as denoising tasks, where the model has to recover missing sub-sequences of a given input.
loss-curves
Once we've decided on our model configuration and training objectives, we launch our training runs on multi-node clusters of GPUs. We're able to adjust the number of nodes allocated for each run based on the size of the model we're training and how quickly we'd like to complete the training process. Running a large cluster of GPUs is expensive, so it’s important that we’re utilizing them in the most efficient way possible. We closely monitor GPU utilization and memory to ensure that we're getting maximum possible usage out of our computational resources.
We use Weights & Biases to monitor the training process, including resource utilization as well as training progress. We monitor our loss curves to ensure that the model is learning effectively throughout each step of the training process. We also watch for loss spikes. These are sudden increases in the loss value and usually indicate issues with the underlying training data or model architecture. Because these occurrences often require further investigation and potential adjustments, we enforce data determinism within our process, so we can more easily reproduce, diagnose, and resolve the potential source of any such loss spike.

Evaluation​

To test our models, we use a variation of the HumanEval framework as described in Chen et al. (2021). We use the model to generate a block of Python code given a function signature and docstring. We then run a test case on the function produced to determine if the generated code block works as expected. We run multiple samples and analyze the corresponding Pass@K numbers.
This approach works best for Python, with ready to use evaluators and test cases. But because Replit supports many programming languages, we need to evaluate model performance for a wide range of additional languages. We've found that this is difficult to do, and there are no widely adopted tools or frameworks that offer a fully comprehensive solution. Two specific challenges include conjuring up a reproducible runtime environment in any programming language, and ambiguity for programming languages without widely used standards for test cases (e.g., HTML, CSS, etc.). Luckily, a "reproducible runtime environment in any programming language" is kind of our thing here at Replit! We're currently building an evaluation framework that will allow any researcher to plug in and test their multi-language benchmarks. We'll be discussing this in a future blog post.
humaneval-results

Deployment to production​

Once we've trained and evaluated our model, it's time to deploy it into production. As we mentioned earlier, our code completion models should feel fast, with very low latency between requests. We accelerate our inference process using NVIDIA's FasterTransformer and Triton Server. FasterTransformer is a library implementing an accelerated engine for the inference of transformer-based neural networks, and Triton is a stable and fast inference server with easy configuration. This combination gives us a highly optimized layer between the transformer model and the underlying GPU hardware, and allows for ultra-fast distributed inference of large models.
Upon deploying our model into production, we're able to autoscale it to meet demand using our Kubernetes infrastructure. Though we've discussed autoscaling in previous blog posts, it's worth mentioning that hosting an inference server comes with a unique set of challenges. These include large artifacts (i.e., model weights) and special hardware requirements (i.e., varying GPU sizes/counts). We've designed our deployment and cluster configurations so that we're able to ship rapidly and reliably. For example, our clusters are designed to work around GPU shortages in individual zones and to look for the cheapest available nodes.
Before we place a model in front of actual users, we like to test it ourselves and get a sense of the model's "vibes". The HumanEval test results we calculated earlier are useful, but there’s nothing like working with a model to get a feel for it, including its latency, consistency of suggestions, and general helpfulness. Placing the model in front of Replit staff is as easy as flipping a switch. Once we're comfortable with it, we flip another switch and roll it out to the rest of our users.
monitoring
We continue to monitor both model performance and usage metrics. For model performance, we monitor metrics like request latency and GPU utilization. For usage, we track the acceptance rate of code suggestions and break it out across multiple dimensions including programming language. This also allows us to A/B test different models, and get a quantitative measure for the comparison of one model to another.

Feedback and iteration​

Our model training platform gives us the ability to go from raw data to a model deployed in production in less than a day. But more importantly, it allows us to train and deploy models, gather feedback, and then iterate rapidly based on that feedback.
It's also important for our process to remain robust to any changes in the underlying data sources, model training objectives, or server architecture. This allows us to take advantage of new advancements and capabilities in a rapidly moving field where every day seems to bring new and exciting announcements.
Next, we’ll be expanding our platform to enable us to use Replit itself to improve our models. This includes techniques such as Reinforcement Learning Based on Human Feedback (RLHF), as well as instruction-tuning using data collected from Replit Bounties.

Next steps​

While we've made great progress, we're still in the very early days of training LLMs. We have tons of improvements to make and lots of difficult problems left to solve. This trend will only accelerate as language models continue to advance. There will be an ongoing set of new challenges related to data, algorithms, and model evaluation.
If you’re excited by the many engineering challenges of training LLMs, we’d love to speak with you. We love feedback, and would love to hear from you about what we're missing and what you would do differently.
We're always looking for talented engineers, researchers, and builders on the Replit AI team. Make sure to check out the open roles on our careers page. If you don't see the right role but think you can contribute, get in touch with us; we'd love to hear from you.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,835
[/U]

You can now change the look of Snap’s My AI.


You can now change the look of Snap’s My AI. Image: Snap

The OpenAI-powered chatbot is also being added to group chats, gaining the ability to make recommendations for things like AR filters, and will soon be able to even generate photos inside Snapchat.​


By ALEX HEATH / @alexeheath

Apr 19, 2023, 2:00 PM EDT



Snap is releasing its “My AI” chatbot to all of Snapchat’s 750 million monthly users for free, a move that comes less than two months after the OpenAI-powered bot was first made available to the app’s more than 3 million paid subscribers.

My AI is also becoming a more integral part of Snapchat. It can now be added to group chats by mentioning it with an @ symbol, and Snap will let people change the look and name of their bot with a custom Bitmoji avatar. In addition, My AI can now recommend AR filters to use in Snapchat’s camera or places to visit from the app’s map tab. And Snap plans to soon let people visually message My AI and receive generated responses; an example shown during the company’s annual conference today showed a photo of tomatoes in a garden, prompting the bot to respond with a generated image of gazpacho soup.

While Microsoft and Google are racing to integrate generative AI into their search engines, Snap CEO Evan Spiegel sees the technology as “an awesome creative tool.” During a recent interview, he shared personal examples of using My AI to create bedtime stories for his children and plan a birthday itinerary for his wife, Miranda Kerr. More than 2 million chats per day are already happening with My AI, he says.

“Just based on the way that they work, I think they’re much more suited to creative tasks,” Spiegel says of generative AI bots. “And some of the things that make them so creative are also the things that make them not so great at recalling specific information.”



“Just based on the way that they work, I think they’re much more suited to creative tasks.”

He describes Snap’s relationship with OpenAI, which is providing the foundational large language model for My AI, as a “close partnership.” It’s clear Spiegel is personally passionate about the project and sees My AI as a critical part of Snap’s future. While he declined to discuss the cost of running the chatbot, I’ve heard that Snap has been surprised at the affordability of operating it at scale.

Spiegel also remains tight-lipped about My AI’s potential impact on Snap’s advertising business, which has faced considerable growth challenges. He acknowledges that leveraging My AI’s interactions for ad targeting could be an opportunity but refrains from elaborating further, hinting at possible developments in the near future.

When My AI was first released to paying Snapchat Plus subscribers, it didn’t take long for it to misbehave. The Center for Humane Technology, for example, posted examples that included My AI coaching a 13-year-old girl about how to set the mood when having sex with a 31-year-old. Snap responded by adding more safeguards to My AI, including using a user’s self-reported age in Snapchat to inform how the bot responds to prompts.

Despite the problematic interactions that have already surfaced, Spiegel says the overwhelming majority of interactions with My AI have been positive. “The thing that gave us a lot of confidence is that as we monitored the way that people were using the service, we found that 99.5 percent of My AI replies conformed to our community guidelines,” he says.



An example conversation with My AI.


An example conversation with My AI. Image: Snap

There’s a debate raging in the AI industry about whether companies should anthropomorphize chatbots with human personas. According to Spiegel, the ability to change My AI’s name and customize its appearance was one of the top requests from early users. “To me, that just speaks to the human desire to personalize things and make them feel like they’re their own.”

As for the broader concerns regarding the potential harm of generative AI, Spiegel offers an optimistic perspective: “When I compare this to almost any other technology that has been invented in the last 20 years, it’d be hard to name one where people have been more thoughtful about the way it’s being implemented and rolled out.”
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,835
[/U]

About​

a simple yet interesting tool for chatting about video with chatGPT, miniGPT4 and StableLM

yinanhe.github.io/projects/chatvideo.html

Ask-Anything​

Currently, Ask-Anything is a simple yet interesting tool for chatting with video. Our team is trying to build smart and robust ChatBot for video understanding now.

Open in Spaces

🎥 Online Demo​

Click the image or here to chat with chatGPT around your videos
demo





[/U]

VideoChat with StableLM​

VideoChat is a multifunctional video question answering tool that combines the functions of Action Recognition, Visual Captioning and StableLM. Our solution generates dense, descriptive captions for any object and action in a video, offering a range of language styles to suit different user preferences. It supports users to have conversations in different lengths, emotions, authenticity of language.

  • Video-Text Generation
  • Chat about uploaded video
  • Interactive demo

🔥 Updates​

  • 2023/04/20: Code Release

💬 Example​

images images images
[/U]

edit:
theres no online demo using StableLM, only chatgpt
 
Last edited:

3rdWorld

Veteran
Joined
Mar 24, 2014
Messages
42,661
Reputation
3,297
Daps
124,859

Europe sounds the alarm on ChatGPT​

ChatGPT has recorded over 1.6 billion visits since December.​

Melissa Rossi
Sun, April 23, 2023 at 5:00 AM EDT
ChatGPT sign.

(Getty Images)
BARCELONA — Alarmed by the growing risks posed by generative artificial intelligence (AI) platforms like ChatGPT, regulators and law enforcement agencies in Europe are looking for ways to slow humanity’s headlong rush into the digital future.
With few guardrails in place, ChatGPT, which responds to user queries in the form of essays, poems, spreadsheets and computer code, recorded over 1.6 billion visits since December. Europol, the European Union Agency for Law Enforcement Cooperation, warned at the end of March that ChatGPT, just one of thousands of AI platforms currently in use, can assist criminals with phishing, malware creation and even terrorist acts.

“If a potential criminal knows nothing about a particular crime area, ChatGPT can speed up the research process significantly by offering key information that can then be further explored in subsequent steps,” the Europol report stated. “As such, ChatGPT can be used to learn about a vast number of potential crime areas with no prior knowledge, ranging from how to break into a home to terrorism, cybercrime and child sexual abuse.”

Last month, Italy slapped a temporary ban on ChatGPT after a glitch exposed user files. The Italian privacy rights board Garante threatened the program’s creator, OpenAI, with millions of dollars in fines for privacy violations until it addresses questions of where users’ information goes and establishes age restrictions on the platform. Spain, France and Germany are looking into complaints of personal data violations — and this month the EU's European Data Protection Board formed a task force to coordinate regulations across the 27-country European Union.

“It’s a wake-up call in Europe,” EU legislator Dragos Tudorache, co-sponsor of the Artificial Intelligence Act, which is being finalized in the European Parliament and would establish a central AI authority, told Yahoo News. “We have to discern very clearly what is going on and how to frame the rules.”

1cc7bb70-e084-11ed-adef-b13c7abec718

Even though artificial intelligence has been a part of everyday life for several years — Amazon’s Alexa and online chess games are just two of many examples — nothing has brought home the potential of AI like ChatGPT, an interactive “large language model” where users can have questions answered, or tasks completed, in seconds.

“ChatGPT has knowledge that even very few humans have,” said Mark Bünger, co-founder of Futurity Systems, a Barcelona-based consulting agency focused on science-based innovation. “Among the things it knows better than most humans is how to program a computer. So, it will probably be very good and very quick to program the next, better version of itself. And that version will be even better and program something no humans even understand.”

The startlingly efficient technology also opens the door for all kinds of fraud, experts say, including identity theft and plagiarism in schools.

“For educators, the possibility that submitted coursework might have been assisted by, or even entirely written by, a generative AI system like OpenAI’s ChatGPT or Google’s Bard, is a cause for concern,” Nick Taylor, deputy director of the Edinburgh Centre for Robotics, told Yahoo News.

OpenAI and Microsoft, which has financially backed OpenAI but has developed a rival chatbot, did not respond to a request for comment for this article.

“AI has been around for decades, but it’s booming now because it’s available for everyone to use,” said Cecilia Tham, CEO of Futurity Systems. Since ChatGPT was introduced as a free trial to the public on Nov. 30, Tham said, programmers have been adapting it to develop thousands of new chatbots, from PlantGPT, which helps to monitor houseplants, to the hypothetical ChaosGPT “that is designed to generate chaotic or unpredictable outputs,” according to its website, and ultimately “destroy humanity.”

10cd31b0-e15b-11ed-bdfe-05ec0cc4c910

Another variation, AutoGPT, short for Autonomous GPT, can perform more complicated goal-oriented tasks. “For instance,” said Tham, “you can say ‘I want to make 1,000 euros a day. How can I do that?’— and it will figure out all the intermediary steps to that goal. But what if someone says ‘I want to kill 1,000 people. Give me every step to do that’?” Even though the ChatGPT model has restrictions on the information it can give, she notes that “people have been able to hack around those.”

The potential hazards of chatbots, and AI in general, prompted the Future of Life Institute, a think tank focused on technology, to publish an open letter last month calling for a temporary halt to AI development. Signed by Elon Musk and Apple co-founder Steve Wozniak, it noted that “AI systems with human-competitive intelligence can pose profound risks to society and humanity,” and “AI labs [are] locked in an out-of-control race to develop and deploy ever more powerful digital minds that no one — not even their creators — can understand, predict, or reliably control.”
 

3rdWorld

Veteran
Joined
Mar 24, 2014
Messages
42,661
Reputation
3,297
Daps
124,859
The signatories called for a six-month pause on the development of AI systems more powerful than GPT-4 so that regulations could be hammered out, and they asked governments to “institute a moratorium” if the key players in the industry did not voluntarily do so.

EU parliamentarian Brando Benifei, co-sponsor of the AI Act, scoffs at that idea. “A moratorium is not realistic,” he told Yahoo News. “What we should do is to continue working on finding the correct rules for the development of AI,” he said, “We also need a global debate on how to address the challenges of this very powerful AI.”

This week, EU legislators working on AI published a “call to action” requesting that President Biden and European Commission President Ursula von der Leyen “convene a high-level global summit” to nail down “a preliminary set of governing principles for the development, control and deployment” of AI.

Tudorache told Yahoo News that the AI Act, which is expected to be enacted next year, “brings new powers to regulators to deal with AI applications” and gives EU regulators the authority to hand out hefty fines. The legislation also includes a risk-ordering of various AI activities, including those that are currently prohibited — such as “social scoring,” a dystopian monitoring scheme that would rate virtually every social interaction on a merit scale.

“Consumers should know what data ChatGPT is using and storing and what it is being used for,” Sébastien Pant, deputy head of communications at the European Consumer Organisation (BEUC), told Yahoo News. “It isn’t clear to us yet what data is being used, or whether data collection respects data protection law.”

The U.S., meanwhile, continues to lag on taking concrete steps to regulate AI, despite concerns recently raised by FTC Commissioner Alvaro Bedoya that “AI is being used right now to decide who to hire, who to fire, who gets a loan, who stays in the hospital and who gets sent home.”

When Biden was recently asked whether AI could be dangerous, he replied, “It remains to be seen — could be.”

The differing attitudes about protecting consumers’ personal data go back decades, Gabriela Zanfir-Fortuna, vice president for global privacy at the Future of Privacy Forum, a think tank focused on data protection, told Yahoo News.

d54af270-e08e-11ed-b7f7-8af1dd738824

“The EU has placed great importance on how the rights of people are affected by automating their personal data in this new computerized, digital age, to the point in which it included a provision in its Charter of Fundamental Rights,” Zanfirt-Fortuna said. European countries such as Germany, Sweden and France adopted data protection laws 50 years ago, she added. “U.S. lawmakers seem to have been less concerned with this issue in previous decades, as the country still lacks a general data protection law at the federal level.”

In the meantime, Gerd Leonhard, author of “Technology vs. Humanity,” and others worry about what will happen when ChatGPT and more advanced forms of AI are used by the military, banking institutions and those working on environmental problems.

“The ongoing joke in the AI community,” said Leonhard, “is that if you ask AI to fix climate change, it would kill all humans. It's inconvenient for us, but it is the most logical answer.”
 

JLova

Veteran
Supporter
Joined
May 6, 2012
Messages
58,239
Reputation
4,042
Daps
175,140
If some important decisions aren’t made, this is it. The future looking extremely bleak. Better have that UBI ready.
 

Mook

We should all strive to be like Mr. Rogers.
Supporter
Joined
Apr 30, 2012
Messages
22,921
Reputation
2,420
Daps
58,567
Reppin
Raleigh

Europe sounds the alarm on ChatGPT​

ChatGPT has recorded over 1.6 billion visits since December.​

Melissa Rossi
Sun, April 23, 2023 at 5:00 AM EDT
ChatGPT sign.

(Getty Images)
BARCELONA — Alarmed by the growing risks posed by generative artificial intelligence (AI) platforms like ChatGPT, regulators and law enforcement agencies in Europe are looking for ways to slow humanity’s headlong rush into the digital future.
With few guardrails in place, ChatGPT, which responds to user queries in the form of essays, poems, spreadsheets and computer code, recorded over 1.6 billion visits since December. Europol, the European Union Agency for Law Enforcement Cooperation, warned at the end of March that ChatGPT, just one of thousands of AI platforms currently in use, can assist criminals with phishing, malware creation and even terrorist acts.

“If a potential criminal knows nothing about a particular crime area, ChatGPT can speed up the research process significantly by offering key information that can then be further explored in subsequent steps,” the Europol report stated. “As such, ChatGPT can be used to learn about a vast number of potential crime areas with no prior knowledge, ranging from how to break into a home to terrorism, cybercrime and child sexual abuse.”

Last month, Italy slapped a temporary ban on ChatGPT after a glitch exposed user files. The Italian privacy rights board Garante threatened the program’s creator, OpenAI, with millions of dollars in fines for privacy violations until it addresses questions of where users’ information goes and establishes age restrictions on the platform. Spain, France and Germany are looking into complaints of personal data violations — and this month the EU's European Data Protection Board formed a task force to coordinate regulations across the 27-country European Union.

“It’s a wake-up call in Europe,” EU legislator Dragos Tudorache, co-sponsor of the Artificial Intelligence Act, which is being finalized in the European Parliament and would establish a central AI authority, told Yahoo News. “We have to discern very clearly what is going on and how to frame the rules.”

1cc7bb70-e084-11ed-adef-b13c7abec718

Even though artificial intelligence has been a part of everyday life for several years — Amazon’s Alexa and online chess games are just two of many examples — nothing has brought home the potential of AI like ChatGPT, an interactive “large language model” where users can have questions answered, or tasks completed, in seconds.

“ChatGPT has knowledge that even very few humans have,” said Mark Bünger, co-founder of Futurity Systems, a Barcelona-based consulting agency focused on science-based innovation. “Among the things it knows better than most humans is how to program a computer. So, it will probably be very good and very quick to program the next, better version of itself. And that version will be even better and program something no humans even understand.”

The startlingly efficient technology also opens the door for all kinds of fraud, experts say, including identity theft and plagiarism in schools.

“For educators, the possibility that submitted coursework might have been assisted by, or even entirely written by, a generative AI system like OpenAI’s ChatGPT or Google’s Bard, is a cause for concern,” Nick Taylor, deputy director of the Edinburgh Centre for Robotics, told Yahoo News.

OpenAI and Microsoft, which has financially backed OpenAI but has developed a rival chatbot, did not respond to a request for comment for this article.

“AI has been around for decades, but it’s booming now because it’s available for everyone to use,” said Cecilia Tham, CEO of Futurity Systems. Since ChatGPT was introduced as a free trial to the public on Nov. 30, Tham said, programmers have been adapting it to develop thousands of new chatbots, from PlantGPT, which helps to monitor houseplants, to the hypothetical ChaosGPT “that is designed to generate chaotic or unpredictable outputs,” according to its website, and ultimately “destroy humanity.”

10cd31b0-e15b-11ed-bdfe-05ec0cc4c910

Another variation, AutoGPT, short for Autonomous GPT, can perform more complicated goal-oriented tasks. “For instance,” said Tham, “you can say ‘I want to make 1,000 euros a day. How can I do that?’— and it will figure out all the intermediary steps to that goal. But what if someone says ‘I want to kill 1,000 people. Give me every step to do that’?” Even though the ChatGPT model has restrictions on the information it can give, she notes that “people have been able to hack around those.”

The potential hazards of chatbots, and AI in general, prompted the Future of Life Institute, a think tank focused on technology, to publish an open letter last month calling for a temporary halt to AI development. Signed by Elon Musk and Apple co-founder Steve Wozniak, it noted that “AI systems with human-competitive intelligence can pose profound risks to society and humanity,” and “AI labs [are] locked in an out-of-control race to develop and deploy ever more powerful digital minds that no one — not even their creators — can understand, predict, or reliably control.”

You can sign up right now and the US army will teach you how to kill 1000 people. That's gotta be the dumbest argument I ever heard.
 
Top