Provider Documentation
This section provides detailed documentation for each supported transcription provider.
Overview
The WTF Transcript Converter supports 6 major transcription providers, each with unique features and capabilities:
Whisper (OpenAI) - High-quality speech recognition with punctuation detection
Deepgram - Real-time and batch transcription with speaker diarization
AssemblyAI - Advanced AI transcription with sentiment analysis and auto-chapters
Rev.ai - Professional transcription services with speaker confidence
Canary (NVIDIA) - Hugging Face integration for local transcription
Parakeet (NVIDIA) - Hugging Face integration with transducer-based processing
Whisper (OpenAI)
Whisper is OpenAI’s automatic speech recognition system trained on 680,000 hours of multilingual and multitask supervised data.
Features
High-quality speech recognition
Punctuation detection
Word-level confidence scores
Log probability scores
Support for 99 languages
Data Format
Whisper returns transcriptions in the following format:
{
"text": "Hello, this is a test transcription.",
"language": "en",
"duration": 3.5,
"segments": [
{
"id": 0,
"start": 0.0,
"end": 3.5,
"text": "Hello, this is a test transcription.",
"avg_logprob": -0.4,
"words": [
{
"word": "Hello,",
"start": 0.0,
"end": 0.5,
"probability": 0.99
}
]
}
]
}
Usage
from wtf_transcript_converter.providers import WhisperConverter
converter = WhisperConverter()
wtf_doc = converter.convert_to_wtf(whisper_data)
# Access Whisper-specific features
print(f"Log probability: {wtf_doc.segments[0].confidence}")
print(f"Word probabilities: {[word.confidence for word in wtf_doc.words]}")
Deepgram
Deepgram provides real-time and batch transcription services with advanced features like speaker diarization and language detection.
Features
Real-time and batch transcription
Speaker diarization
Language detection
Channel support
Alternative transcriptions
Word-level timestamps
Data Format
Deepgram returns transcriptions in the following format:
{
"metadata": {
"duration": 3.5,
"language": "en-US"
},
"results": {
"channels": [
{
"alternatives": [
{
"transcript": "Hello, this is a test transcription.",
"confidence": 0.95,
"words": [
{
"word": "Hello,",
"start": 0.0,
"end": 0.5,
"confidence": 0.99,
"speaker": 0
}
]
}
]
}
]
}
}
Usage
from wtf_transcript_converter.providers import DeepgramConverter
converter = DeepgramConverter()
wtf_doc = converter.convert_to_wtf(deepgram_data)
# Access Deepgram-specific features
print(f"Speaker count: {len(wtf_doc.speakers)}")
print(f"Channel count: {len(deepgram_data['results']['channels'])}")
AssemblyAI
AssemblyAI provides advanced AI transcription services with features like sentiment analysis, auto-chapters, and IAB content classification.
Features
Advanced AI transcription
Sentiment analysis
Auto-chapters
IAB content classification
Speaker diarization
Utterance-level timestamps
Data Format
AssemblyAI returns transcriptions in the following format:
{
"text": "Hello, this is a test transcription.",
"language_code": "en_us",
"audio_duration": 3.5,
"confidence": 0.96,
"words": [
{
"text": "Hello,",
"start": 0,
"end": 500,
"confidence": 0.99
}
],
"utterances": [
{
"text": "Hello, this is a test transcription.",
"start": 0,
"end": 3500,
"confidence": 0.96,
"speaker": "A"
}
]
}
Usage
from wtf_transcript_converter.providers import AssemblyAIConverter
converter = AssemblyAIConverter()
wtf_doc = converter.convert_to_wtf(assemblyai_data)
# Access AssemblyAI-specific features
print(f"Utterance count: {len(assemblyai_data['utterances'])}")
print(f"Language code: {assemblyai_data['language_code']}")
Rev.ai
Rev.ai provides professional transcription services with speaker confidence scores and detailed timing information.
Features
Professional transcription services
Speaker confidence scores
Detailed timing information
Monologue-based structure
Element-level timestamps
Data Format
Rev.ai returns transcriptions in the following format:
{
"duration_seconds": 3.5,
"monologues": [
{
"speaker": 0,
"elements": [
{
"type": "text",
"value": "Hello,",
"ts": 0.0,
"end_ts": 0.5,
"confidence": 0.9
}
]
}
]
}
Usage
from wtf_transcript_converter.providers import RevAIConverter
converter = RevAIConverter()
wtf_doc = converter.convert_to_wtf(rev_ai_data)
# Access Rev.ai-specific features
print(f"Monologue count: {len(rev_ai_data['monologues'])}")
print(f"Duration: {rev_ai_data['duration_seconds']}s")
Canary (NVIDIA)
Canary is NVIDIA’s speech recognition model available through Hugging Face, providing local transcription capabilities.
Features
Local transcription (no API required)
Hugging Face integration
High-quality speech recognition
Word-level timestamps
Confidence scores
Data Format
Canary returns transcriptions in the following format:
{
"text": "Hello, this is a test transcription.",
"language": "en",
"duration": 3.5,
"segments": [
{
"start": 0.0,
"end": 3.5,
"text": "Hello, this is a test transcription.",
"confidence": 0.92,
"words": [
{
"word": "Hello,",
"start": 0.0,
"end": 0.5,
"confidence": 0.99
}
]
}
]
}
Usage
from wtf_transcript_converter.providers import CanaryConverter
converter = CanaryConverter()
wtf_doc = converter.convert_to_wtf(canary_data)
# Access Canary-specific features
print(f"Model: {wtf_doc.metadata.model}")
print(f"Provider: {wtf_doc.metadata.provider}")
Parakeet (NVIDIA)
Parakeet is NVIDIA’s transducer-based speech recognition model available through Hugging Face.
Features
Transducer-based processing
Hugging Face integration
Local transcription capabilities
Word-level timestamps
Confidence scores
Data Format
Parakeet returns transcriptions in the following format:
{
"text": "Hello, this is a test transcription.",
"language": "en",
"duration": 3.5,
"segments": [
{
"start": 0.0,
"end": 3.5,
"text": "Hello, this is a test transcription.",
"confidence": 0.91,
"words": [
{
"word": "Hello,",
"start": 0.0,
"end": 0.5,
"confidence": 0.98
}
]
}
]
}
Usage
from wtf_transcript_converter.providers import ParakeetConverter
converter = ParakeetConverter()
wtf_doc = converter.convert_to_wtf(parakeet_data)
# Access Parakeet-specific features
print(f"Model: {wtf_doc.metadata.model}")
print(f"Provider: {wtf_doc.metadata.provider}")
Provider Comparison
Feature Matrix
Performance Comparison
Metric |
Whisper |
Deepgram |
AssemblyAI |
Rev.ai |
Canary |
Parakeet |
|---|---|---|---|---|---|---|
Accuracy Speed Cost Setup Complexity |
High Medium Medium Low |
High Fast Low Low |
High Medium Medium Low |
High Medium High Low |
Medium Slow Free High |
Medium Slow Free High |
Choosing a Provider
Use Cases
Whisper * High-quality transcription needed * Offline processing preferred * Multiple languages required * Budget-conscious projects
Deepgram * Real-time transcription needed * Speaker diarization required * High-volume processing * Cost optimization important
AssemblyAI * Advanced features needed (sentiment, chapters) * High accuracy required * Speaker diarization needed * Budget allows for premium features
Rev.ai * Professional transcription services * High accuracy required * Speaker confidence important * Budget allows for premium pricing
Canary/Parakeet * Local processing required * No API costs desired * Hugging Face integration preferred * Setup complexity acceptable
Best Practices
Test Multiple Providers: Use cross-provider testing to compare results
Consider Use Case: Match provider features to your specific needs
Monitor Performance: Track accuracy, speed, and cost over time
Handle Errors Gracefully: Implement proper error handling for API failures
Cache Results: Store converted WTF documents to avoid re-processing
Validate Output: Always validate WTF documents after conversion
Integration Examples
See the Examples and Use Cases section for detailed integration examples with each provider.