Provider Documentation

This section provides detailed documentation for each supported transcription provider.

Overview

The WTF Transcript Converter supports 6 major transcription providers, each with unique features and capabilities:

Whisper (OpenAI) - High-quality speech recognition with punctuation detection
Deepgram - Real-time and batch transcription with speaker diarization
AssemblyAI - Advanced AI transcription with sentiment analysis and auto-chapters
Rev.ai - Professional transcription services with speaker confidence
Canary (NVIDIA) - Hugging Face integration for local transcription
Parakeet (NVIDIA) - Hugging Face integration with transducer-based processing

Whisper (OpenAI)

Whisper is OpenAI’s automatic speech recognition system trained on 680,000 hours of multilingual and multitask supervised data.

Features

High-quality speech recognition
Punctuation detection
Word-level confidence scores
Log probability scores
Support for 99 languages

Data Format

Whisper returns transcriptions in the following format:

{
  "text": "Hello, this is a test transcription.",
  "language": "en",
  "duration": 3.5,
  "segments": [
    {
      "id": 0,
      "start": 0.0,
      "end": 3.5,
      "text": "Hello, this is a test transcription.",
      "avg_logprob": -0.4,
      "words": [
        {
          "word": "Hello,",
          "start": 0.0,
          "end": 0.5,
          "probability": 0.99
        }
      ]
    }
  ]
}

Usage

from wtf_transcript_converter.providers import WhisperConverter

converter = WhisperConverter()
wtf_doc = converter.convert_to_wtf(whisper_data)

# Access Whisper-specific features
print(f"Log probability: {wtf_doc.segments[0].confidence}")
print(f"Word probabilities: {[word.confidence for word in wtf_doc.words]}")

Deepgram

Deepgram provides real-time and batch transcription services with advanced features like speaker diarization and language detection.

Features

Real-time and batch transcription
Speaker diarization
Language detection
Channel support
Alternative transcriptions
Word-level timestamps

Data Format

Deepgram returns transcriptions in the following format:

{
  "metadata": {
    "duration": 3.5,
    "language": "en-US"
  },
  "results": {
    "channels": [
      {
        "alternatives": [
          {
            "transcript": "Hello, this is a test transcription.",
            "confidence": 0.95,
            "words": [
              {
                "word": "Hello,",
                "start": 0.0,
                "end": 0.5,
                "confidence": 0.99,
                "speaker": 0
              }
            ]
          }
        ]
      }
    ]
  }
}

Usage

from wtf_transcript_converter.providers import DeepgramConverter

converter = DeepgramConverter()
wtf_doc = converter.convert_to_wtf(deepgram_data)

# Access Deepgram-specific features
print(f"Speaker count: {len(wtf_doc.speakers)}")
print(f"Channel count: {len(deepgram_data['results']['channels'])}")

AssemblyAI

AssemblyAI provides advanced AI transcription services with features like sentiment analysis, auto-chapters, and IAB content classification.

Features

Advanced AI transcription
Sentiment analysis
Auto-chapters
IAB content classification
Speaker diarization
Utterance-level timestamps

Data Format

AssemblyAI returns transcriptions in the following format:

{
  "text": "Hello, this is a test transcription.",
  "language_code": "en_us",
  "audio_duration": 3.5,
  "confidence": 0.96,
  "words": [
    {
      "text": "Hello,",
      "start": 0,
      "end": 500,
      "confidence": 0.99
    }
  ],
  "utterances": [
    {
      "text": "Hello, this is a test transcription.",
      "start": 0,
      "end": 3500,
      "confidence": 0.96,
      "speaker": "A"
    }
  ]
}

Usage

from wtf_transcript_converter.providers import AssemblyAIConverter

converter = AssemblyAIConverter()
wtf_doc = converter.convert_to_wtf(assemblyai_data)

# Access AssemblyAI-specific features
print(f"Utterance count: {len(assemblyai_data['utterances'])}")
print(f"Language code: {assemblyai_data['language_code']}")

Rev.ai

Rev.ai provides professional transcription services with speaker confidence scores and detailed timing information.

Features

Professional transcription services
Speaker confidence scores
Detailed timing information
Monologue-based structure
Element-level timestamps

Data Format

Rev.ai returns transcriptions in the following format:

{
  "duration_seconds": 3.5,
  "monologues": [
    {
      "speaker": 0,
      "elements": [
        {
          "type": "text",
          "value": "Hello,",
          "ts": 0.0,
          "end_ts": 0.5,
          "confidence": 0.9
        }
      ]
    }
  ]
}

Usage

from wtf_transcript_converter.providers import RevAIConverter

converter = RevAIConverter()
wtf_doc = converter.convert_to_wtf(rev_ai_data)

# Access Rev.ai-specific features
print(f"Monologue count: {len(rev_ai_data['monologues'])}")
print(f"Duration: {rev_ai_data['duration_seconds']}s")

Canary (NVIDIA)

Canary is NVIDIA’s speech recognition model available through Hugging Face, providing local transcription capabilities.

Features

Local transcription (no API required)
Hugging Face integration
High-quality speech recognition
Word-level timestamps
Confidence scores

Data Format

Canary returns transcriptions in the following format:

{
  "text": "Hello, this is a test transcription.",
  "language": "en",
  "duration": 3.5,
  "segments": [
    {
      "start": 0.0,
      "end": 3.5,
      "text": "Hello, this is a test transcription.",
      "confidence": 0.92,
      "words": [
        {
          "word": "Hello,",
          "start": 0.0,
          "end": 0.5,
          "confidence": 0.99
        }
      ]
    }
  ]
}

Usage

from wtf_transcript_converter.providers import CanaryConverter

converter = CanaryConverter()
wtf_doc = converter.convert_to_wtf(canary_data)

# Access Canary-specific features
print(f"Model: {wtf_doc.metadata.model}")
print(f"Provider: {wtf_doc.metadata.provider}")

Parakeet (NVIDIA)

Parakeet is NVIDIA’s transducer-based speech recognition model available through Hugging Face.

Features

Transducer-based processing
Hugging Face integration
Local transcription capabilities
Word-level timestamps
Confidence scores

Data Format

Parakeet returns transcriptions in the following format:

{
  "text": "Hello, this is a test transcription.",
  "language": "en",
  "duration": 3.5,
  "segments": [
    {
      "start": 0.0,
      "end": 3.5,
      "text": "Hello, this is a test transcription.",
      "confidence": 0.91,
      "words": [
        {
          "word": "Hello,",
          "start": 0.0,
          "end": 0.5,
          "confidence": 0.98
        }
      ]
    }
  ]
}

Usage

from wtf_transcript_converter.providers import ParakeetConverter

converter = ParakeetConverter()
wtf_doc = converter.convert_to_wtf(parakeet_data)

# Access Parakeet-specific features
print(f"Model: {wtf_doc.metadata.model}")
print(f"Provider: {wtf_doc.metadata.provider}")

Provider Comparison

Feature Matrix

Performance Comparison

Metric	Whisper	Deepgram	AssemblyAI	Rev.ai	Canary	Parakeet
Accuracy Speed Cost Setup Complexity	High Medium Medium Low	High Fast Low Low	High Medium Medium Low	High Medium High Low	Medium Slow Free High	Medium Slow Free High

Choosing a Provider

Use Cases

Whisper * High-quality transcription needed * Offline processing preferred * Multiple languages required * Budget-conscious projects

Deepgram * Real-time transcription needed * Speaker diarization required * High-volume processing * Cost optimization important

AssemblyAI * Advanced features needed (sentiment, chapters) * High accuracy required * Speaker diarization needed * Budget allows for premium features

Rev.ai * Professional transcription services * High accuracy required * Speaker confidence important * Budget allows for premium pricing

Canary/Parakeet * Local processing required * No API costs desired * Hugging Face integration preferred * Setup complexity acceptable

Best Practices

Test Multiple Providers: Use cross-provider testing to compare results
Consider Use Case: Match provider features to your specific needs
Monitor Performance: Track accuracy, speed, and cost over time
Handle Errors Gracefully: Implement proper error handling for API failures
Cache Results: Store converted WTF documents to avoid re-processing
Validate Output: Always validate WTF documents after conversion

Integration Examples

See the Examples and Use Cases section for detailed integration examples with each provider.