bnew

Veteran
Joined
Nov 1, 2015
Messages
57,335
Reputation
8,496
Daps
160,006


AF5slsK.png





 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,335
Reputation
8,496
Daps
160,006

Introduction​

The Yi series models are large language models trained from scratch by developers at 01.AI. The first public release contains two bilingual(English/Chinese) base models with the parameter sizes of 6B and 34B. Both of them are trained with 4K sequence length and can be extended to 32K during inference time.

News​

  • 🎯 2023/11/02: The base model of Yi-6B and Yi-34B.

Model Performance​

Model​
MMLU​
CMMLU​
C-Eval​
GAOKAO​
BBH​
Common-sense Reasoning​
Reading Comprehension​
Math & Code​
5-shot​
5-shot​
5-shot​
0-shot​
3-shot@1​
-​
-​
-​
LLaMA2-34B​
62.6​
-​
-​
-​
44.1​
69.9​
68.0​
26.0​
LLaMA2-70B​
68.9​
53.3​
-​
49.8​
51.2​
71.9​
69.4​
36.8​
Baichuan2-13B​
59.2​
62.0​
58.1​
54.3​
48.8​
64.3​
62.4​
23.0​
Qwen-14B​
66.3​
71.0​
72.1​
62.5​
53.4​
73.3​
72.5​
39.8
Skywork-13B​
62.1​
61.8​
60.6​
68.1​
41.7​
72.4​
61.4​
24.9​
InternLM-20B​
62.1​
59.0​
58.8​
45.5​
52.5​
78.3​
-​
30.4​
Aquila-34B​
67.8​
71.4​
63.1​
-​
-​
-​
-​
-​
Falcon-180B​
70.4​
58.0​
57.8​
59.0​
54.0​
77.3​
68.8​
34.0​
Yi-6B​
63.2​
75.5​
72.0​
72.2​
42.8​
72.3​
68.7​
19.8​
Yi-34B
76.3
83.7
81.4
82.8
54.3
80.1
76.4
37.1​
While benchmarking open-source models, we have observed a disparity between the results generated by our pipeline and those reported in public sources (e.g. OpenCompass). Upon conducting a more in-depth investigation of this difference, we have discovered that various models may employ different prompts, post-processing strategies, and sampling techniques, potentially resulting in significant variations in the outcomes. Our prompt and post-processing strategy remains consistent with the original benchmark, and greedy decoding is employed during evaluation without any post-processing for the generated content. For scores that were not reported by the original authors (including scores reported with different settings), we try to get results with our pipeline.

To evaluate the model's capability extensively, we adopted the methodology outlined in Llama2. Specifically, we included PIQA, SIQA, HellaSwag, WinoGrande, ARC, OBQA, and CSQA to assess common sense reasoning. SquAD, QuAC, and BoolQ were incorporated to evaluate reading comprehension. CSQA was exclusively tested using a 7-shot setup, while all other tests were conducted with a 0-shot configuration. Additionally, we introduced GSM8K (8-shot@1), MATH (4-shot@1), HumanEval (0-shot@1), and MBPP (3-shot@1) under the category "Math & Code". Due to technical constraints, we did not test Falcon-180 on QuAC and OBQA; the score is derived by averaging the scores on the remaining tasks. Since the scores for these two tasks are generally lower than the average, we believe that Falcon-180B's performance was not underestimated.



 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,335
Reputation
8,496
Daps
160,006



About​

Local AI talk with a custom voice based on Zephyr 7B model. Uses RealtimeSTT with faster_whisper for transcription and RealtimeTTS with Coqui XTTS for synthesis.

Local AI Voice Chat

Provides talk in realtime with AI, completely local on your PC, with customizable AI personality and voice.

About the Project

Integrates the powerful Zephyr 7B language model with real-time speech-to-text and text-to-speech libraries to create a fast and engaging voicebased local chatbot.

Local.AI.Talkbot.GithubClip.mov

Tech Stack

  • llama_cppwith Zephyr 7B
    • library interface for llamabased language models
  • RealtimeSTTwith faster_whisper
    • real-time speech-to-text transcription library
  • RealtimeTTSwith Coqui XTTS
    • real-time text-to-speech synthesis library

Notes

This software is in an experimental alpha state and does not provide production ready stability. The current XTTS model used for synthesis still has glitches and also Zephyr - while really good for a 7B model - of course can not compete with the answer quality of GPT 4, Claude or Perplexity.

Please take this as a first attempt to provide an early version of a local realtime chatbot.

Updates

  • Bugfix to RealtimeTTS (download of Coqui model did not work properly)

Prerequisites

You will need a GPU with around 8 GB VRAM to run this in real-time.

  • NVIDIA CUDA Toolkit 11.8:
  • NVIDIA cuDNN 8.7.0 for CUDA 11.x:
    • Navigate to NVIDIA cuDNN Archive.
    • Locate and download "cuDNN v8.7.0 (November 28th, 2022), for CUDA 11.x".
    • Follow the provided installation guide.
  • FFmpeg:
    Install FFmpeg according to your operating system:
    • Ubuntu/Debian:
      sudo apt update && sudo apt install ffmpeg

    • Arch Linux:
      sudo pacman -S ffmpeg

    • macOS (Homebrew):
      brew install ffmpeg

    • Windows (Chocolatey):
      choco install ffmpeg

    • Windows (Scoop):
      scoop install ffmpeg

Installation Steps

  1. Clone the repository or download the source code package.
  2. Run the install_win.bat script. This will automatically handle the installation of required dependencies and prepare your environment. There may be warnings about numpy or fsspec incompatibilies, but you can ignore them, it will work nevertheless.
    If you are running UNIX or MAC you need to adjust this script (if someone with more experience under these platforms could mail me install-scripts, I would love to add them for these platforms).
  3. Download zephyr-7b-beta.Q5_K_M.gguf from here.
    • Open creation_params.json and enter the filepath to the downloaded model into model_path.
    • Adjust n_gpu_layers (0-35, raise if you have more VRAM) and n_threads (number of CPU threads, i recommend not using all available cores but leave some for TTS)
  4. Implement a temporary workaround for an issue in the Coqui TTS library:
    • Activate your venv (test_env\Scripts\activate.bat under Windows, I think source test_env/bin/activate under Unix/Mac)
    • Execute the command pip show tts to find the installation path of the Coqui TTS library.
    • Navigate to the Coqui installation directory and proceed to TTS/tts/models.
    • Locate and open the xtts.py file in a text editor with administrative or sufficient privileges.
    • Within the handle_chunks method, modify the line if wav_overlap is not None: to if wav_overlap is not None and wav_chunk.shape[0] > 0:.
    • Note: This modification addresses a specific runtime issue I encountered during working with the coqui library. Although it resolves the problem, it is a provisional solution. I did not consider a pull request submission to the Coqui TTS repository yet, because I honestly do not fully understand the underlying cause and full implications of the change to even document it well. This adjustment ensures functionality but should be approached with caution and technical oversight.

Running the Application

To start the AI voice chat, run start.bat
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,335
Reputation
8,496
Daps
160,006

🔥🔥We propose #SEINE, a video diffusion model that focuses on generative transition and prediction.

#SEINE supports *video transition generation* and *image-to-video animation*


- Project: https://vchitect.github.io/SEINE-project/
- Paper: https://arxiv.org/abs/2310.20700
- Code: https://github.com/Vchitect/SEINE


SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction​


SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction

paper page: Paper page - SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction

Recently video generation has achieved substantial progress with realistic results. Nevertheless, existing AI-generated videos are usually very short clips ("shot-level") depicting a single scene. To deliver a coherent long video ("story-level"), it is desirable to have creative transition and prediction effects across different clips. This paper presents a short-to-long video diffusion model, SEINE, that focuses on generative transition and prediction. The goal is to generate high-quality long videos with smooth and creative transitions between scenes and varying lengths of shot-level videos. Specifically, we propose a random-mask video diffusion model to automatically generate transitions based on textual descriptions. By providing the images of different scenes as inputs, combined with text-based control, our model generates transition videos that ensure coherence and visual quality. Furthermore, the model can be readily extended to various tasks such as image-to-video animation and autoregressive video prediction. To conduct a comprehensive evaluation of this new generative task, we propose three assessing criteria for smooth and creative transition: temporal consistency, semantic similarity, and video-text semantic alignment. Extensive experiments validate the effectiveness of our approach over existing methods for generative transition and prediction, enabling the creation of story-level long videos.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,335
Reputation
8,496
Daps
160,006







About​

Distilled variant of Whisper for speech recognition. 6x faster, 50% smaller, within 1% word error rate.

Distil-Whisper

[Paper] [Models] [Colab]

Distil-Whisper is a distilled version of Whisper that is 6 times faster, 49% smaller, and performs within 1% word error rate (WER) on out-of-distribution evaluation sets.

ModelParams / MRel. LatencyShort-Form WERLong-Form WER
whisper-large-v215501.09.111.7
distil-large-v27565.810.111.6
distil-medium.en3946.811.112.4
Note: Distil-Whisper is currently only available for English speech recognition. Multilingual support will be provided soon.
 
Top