Transcription & Translation — Technical Details

Pipeline Architecture

The transcription pipeline is event-driven. When a recording file lands in customer S3 storage, an S3 event notification triggers the transcription queue. Workers on GPU instances consume jobs from the queue, run PanOps transcription engine inference, and write the resulting transcript to the customer's Aurora tenant. The pipeline has no synchronous dependencies — it runs asynchronously in the background, scaling based on queue depth.

Transcription Pipeline

[Platform recording complete (Teams / Zoom / Google Meet / RingCentral / etc.)]
         ↓
[Connector downloads recording to customer S3]
  → s3://customer-bucket/recordings/{platform}/{date}/{meeting-id}.mp4
         ↓
[S3 Event Notification → SQS Transcription Queue]
  → Message: { customer_id, s3_key, platform, language_config }
         ↓
[Auto-scaling group checks SQS queue depth]
  → ApproximateNumberOfMessages > 0 → launch GPU spot instance
  → Queue empty for 5 min → terminate instances (scale to zero)
         ↓
[Transcription Worker (Python, EC2 GPU)]
  1. Pull job from SQS
  2. Download recording from S3 (presigned URL)
  3. Run transcription_engine.transcribe(audio, task="translate", language=None)
     → auto-detect source language
     → translate to customer.preferred_language
  4. Parse segments: [{start, end, speaker, text}]
  5. Write transcript segments to Aurora (RLS-scoped to customer_id)
  6. Update job status: completed / failed
  7. Delete message from SQS
         ↓
[Transcript available in Aurora for AI retrieval]
  → Searchable via pgvector semantic search + full-text search

Transcription Engine Configuration

Parameter	Value	Notes
Model	PanOps transcription engine (open-weight, state-of-the-art)	Default configuration; tuned for enterprise call/meeting audio
Runtime	Python (GPU-accelerated inference runtime)	CUDA-accelerated on GPU instances
Task	transcribe + translate	Auto-detects source language; translates to CEO preferred language in single pass
Language Detection	Automatic	Auto-identifies language from first 30s of audio
Output Format	Segments with timestamps	[{start_ms, end_ms, speaker_label, text}] stored per segment
Speaker Diarization	Platform-native where available	Teams and Zoom provide speaker labels via their APIs; applied to transcript segments
Supported Languages	99 languages	Full range of major world languages supported

Single-pass translation: The PanOps transcription engine can transcribe and translate in one inference pass. There is no separate translation step and no call to a translation API. All processing is self-contained on PanOps infrastructure.

GPU Compute Configuration

Parameter	Value
Instance Type	g4dn.xlarge (primary) or g5.xlarge (fallback)
GPU	NVIDIA T4 (g4dn) or NVIDIA A10G (g5)
Pricing	EC2 Spot (up to ~70% discount vs on-demand)
Scaling Metric	SQS ApproximateNumberOfMessages
Scale-Up Trigger	Queue depth > 0 for 2 minutes
Scale-To-Zero	Queue empty for 5 consecutive minutes → terminate
Max Instances	Configurable per customer; default 3 concurrent
Spot Interruption	SQS message visibility timeout prevents data loss; job re-queued on interruption

Recording Sources & Ingestion

Platform	Recording Type	Trigger	Stage
Microsoft Teams	Meeting recordings (MP4)	OneDrive webhook → connector downloads	Live
Zoom Meetings	Cloud recordings (MP4)	recording.completed webhook → connector downloads	Live
RingCentral	Voice call recordings	call-recording webhook → connector downloads	Live
Google Meet	Meeting recordings	Google Drive webhook → connector downloads	Live
Dialpad	Call recordings	Webhook push with recording URL	Live
Zoom Phone	Call recordings	Zoom Phone webhook	Live
OpenPhone	Call recordings	OpenPhone webhook	Live

Transcript Storage Schema

Transcripts are stored in Aurora PostgreSQL with a normalized schema that preserves segment-level timing and speaker information. The transcript_segments table is indexed for full-text search and has a pgvector embedding column for semantic retrieval by the AI model.

-- Simplified schema

CREATE TABLE recordings (
  id            UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  customer_id   UUID NOT NULL,           -- RLS enforced
  platform      TEXT NOT NULL,           -- 'teams', 'zoom', 'ringcentral', etc.
  meeting_id    TEXT,
  recorded_at   TIMESTAMPTZ NOT NULL,
  s3_key        TEXT NOT NULL,
  duration_sec  INTEGER,
  status        TEXT DEFAULT 'pending',  -- pending | processing | complete | failed
  created_at    TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE transcript_segments (
  id            UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  recording_id  UUID REFERENCES recordings(id),
  customer_id   UUID NOT NULL,           -- RLS enforced (denormalized for policy)
  start_ms      INTEGER NOT NULL,
  end_ms        INTEGER NOT NULL,
  speaker       TEXT,
  text          TEXT NOT NULL,
  language_src  TEXT,                    -- detected source language
  language_out  TEXT,                    -- output language (CEO preferred)
  embedding     vector(1536),            -- pgvector: semantic search
  created_at    TIMESTAMPTZ DEFAULT NOW()
);

-- RLS policies enforced on both tables
ALTER TABLE recordings ENABLE ROW LEVEL SECURITY;
ALTER TABLE transcript_segments ENABLE ROW LEVEL SECURITY;

Language Support

The PanOps transcription engine supports transcription in 99 languages and translation to English (or to the configured target language). The following are the most commonly configured CEO languages in enterprise deployments:

English (en)
Spanish (es)
French (fr)
German (de)
Mandarin Chinese (zh)
Japanese (ja)
Portuguese (pt)
Arabic (ar)
Hindi (hi)
Korean (ko)

Language preference is configured per customer at onboarding and stored in the customer configuration record. The transcription worker reads this configuration per job and applies the appropriate output language.

← Back to overview