Skip to content

Haschtl/transcripy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multi-speaker audio transcription

This project is a CLI for multi-speaker audio transcription using OpenAI Whisper (text transcription), Pyannote-Audio (speaker-detection) and Spleeter (voice extraction). It can be used to extract audio-segments for each speaker and to create transcriptions in various formats (txt, srt, sami, dfx, transc).

It's compatible with Windows, Linux and Mac.


Setup

Install system dependencies

# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg

# on Arch Linux
sudo pacman -S ffmpeg

# on MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg

# on Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg

# on Windows using Scoop (https://scoop.sh/)
scoop install ffmpeg

Install python dependencies

pip3 install tqdm setuptools-rust pycaption simpleaudio simple-term-menu colour plotly mutagen pydub spleeter pyannotate.audio git+https://github.com/openai/whisper.git 

Data structure

This project requires a fixed folder structure for your data. Your input data in raw_audio/ or raw_audio_voices/ may be structured in subfolders

data/
    raw_audio/              Your original audio data (any formats)
    raw_audio_voices/       Preprocessed audio data (only .wav)

    diarization/            Output-folder of --audio-to-voices and --set-speakers
    text/                   Output-folder of --audio-to-text
    voice_splits/           Output-folder of --text-to-splits
    output/                 Outout-folder for various results
        slices/                 Audio-slices ordered by speaker (--slice)
        analysis/               Analysis output (--viewer)
        transcripts/            Transcripts output (--transcribe)

1. Optional: Audio preprocessing / voice extraction

Follow the setup instructions from Spleeter.

Run the voice extraction process to filter background audio in audio-files located in raw_audio/.

python -m transcripy --audio-extract-voice

Optional arguments:

--model [spleeter:2stems, spleeter:4stems, spleeter:5stems]    \\ Select the spleeter-model (2 voices, 4 voices, 5 voices) 
--data-path [path]              \\ Root direction of data (without raw_audio/)
--extract-all                   \\ Extract all voices

Alternatives

  • Use RipX to extract voices from audio files in data/raw_audio. Place them in data/raw_audio_voices.

2. Automatic speech recognition

Follow the setup instructions from OpenAI Whisper.

Run the transcription of audio-files (.wav only!) located in raw_audio_voices/ with

python -m transcripy --audio-to-text 

Optional arguments:

--model [tiny,base,small,medium,large]    \\ Select the whisper-model 
--language [lang]               \\ Force the language to detect
--data-path [path]              \\ Root direction of data (without raw_audio_voices/)

3. Detect individual people

Follow the setup instructions from Pyannote-Audio.

Run the diarization process to detect multiple readers in audio-files located in raw_audio_voices/.

python -m transcripy --audio-to-voices 

Optional arguments:

--model [pyannote/speaker-diarization, pyannote/segmentation, pyannote/speaker-segmentation, pyannote/overlapped-speech-detection, pyannote/voice-activity-detection]    \\ Select the pyannote-model 
--data-path [path]              \\ Root direction of data (without raw_audio_voices/)

Optional: Assign speakers

To rename the speakers of the audio-files, run

python -m transcripy --set-speakers

4. Create outputs

Important: Make sure that you have completed step 2 and 3

Create the data you need.

Transcriptions

Create transcriptions in various formats with

python -m transcripy --transcribe

Analysis

Create HTML files for visualization of the results with

python -m transcripy --viewer

Slice

Slice the audio files in separate text-slices with

python -m transcripy --slice

Extra: Text to speech synthetis

Option 1: Voice Cloning App

  • Download the executable for Voice-Cloning-App
  • Start it
  • Download model for your language
  • Create dataset for one speaker with python -m create-dataset <SPEAKER>
  • Load the dataset into Voice-Cloning-App

Option 2: Real Time Voice Cloning

Follow the setup instructions from Real-Time Voice Cloning.

python -m voice-synthesis

Related

See this jupyter-notebook for a different implementation.

About

Multi speaker audio transcription

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages