Specifications — Technical Details

Transcription & Translation

Pipeline architecture, transcription engine configuration, GPU scaling, language support, and data flow from recording ingestion to transcript storage.

← Back to Specifications

Pipeline Architecture

The transcription pipeline is event-driven. When a recording file lands in customer S3 storage, an S3 event notification triggers the transcription queue. Workers on GPU instances consume jobs from the queue, run PanOps transcription engine inference, and write the resulting transcript to the customer's Aurora tenant. The pipeline has no synchronous dependencies — it runs asynchronously in the background, scaling based on queue depth.

Transcription Pipeline

[Platform recording complete (Teams / Zoom / Google Meet / RingCentral / etc.)]
         ↓
[Connector downloads recording to customer S3]
  → s3://customer-bucket/recordings/{platform}/{date}/{meeting-id}.mp4
         ↓
[S3 Event Notification → SQS Transcription Queue]
  → Message: { customer_id, s3_key, platform, language_config }
         ↓
[Auto-scaling group checks SQS queue depth]
  → ApproximateNumberOfMessages > 0 → launch GPU spot instance
  → Queue empty for 5 min → terminate instances (scale to zero)
         ↓
[Transcription Worker (Python, EC2 GPU)]
  1. Pull job from SQS
  2. Download recording from S3 (presigned URL)
  3. Run transcription_engine.transcribe(audio, task="translate", language=None)
     → auto-detect source language
     → translate to customer.preferred_language
  4. Parse segments: [{start, end, speaker, text}]
  5. Write transcript segments to Aurora (RLS-scoped to customer_id)
  6. Update job status: completed / failed
  7. Delete message from SQS
         ↓
[Transcript available in Aurora for AI retrieval]
  → Searchable via pgvector semantic search + full-text search

Transcription Engine Configuration

ParameterValueNotes
ModelPanOps transcription engine (open-weight, state-of-the-art)Default configuration; tuned for enterprise call/meeting audio
RuntimePython (GPU-accelerated inference runtime)CUDA-accelerated on GPU instances
Tasktranscribe + translateAuto-detects source language; translates to CEO preferred language in single pass
Language DetectionAutomaticAuto-identifies language from first 30s of audio
Output FormatSegments with timestamps[{start_ms, end_ms, speaker_label, text}] stored per segment
Speaker DiarizationPlatform-native where availableTeams and Zoom provide speaker labels via their APIs; applied to transcript segments
Supported Languages99 languagesFull range of major world languages supported
Single-pass translation: The PanOps transcription engine can transcribe and translate in one inference pass. There is no separate translation step and no call to a translation API. All processing is self-contained on PanOps infrastructure.

GPU Compute Configuration

ParameterValue
Instance Typeg4dn.xlarge (primary) or g5.xlarge (fallback)
GPUNVIDIA T4 (g4dn) or NVIDIA A10G (g5)
PricingEC2 Spot (up to ~70% discount vs on-demand)
Scaling MetricSQS ApproximateNumberOfMessages
Scale-Up TriggerQueue depth > 0 for 2 minutes
Scale-To-ZeroQueue empty for 5 consecutive minutes → terminate
Max InstancesConfigurable per customer; default 3 concurrent
Spot InterruptionSQS message visibility timeout prevents data loss; job re-queued on interruption

Recording Sources & Ingestion

PlatformRecording TypeTriggerStage
Microsoft TeamsMeeting recordings (MP4)OneDrive webhook → connector downloadsLive
Zoom MeetingsCloud recordings (MP4)recording.completed webhook → connector downloadsLive
RingCentralVoice call recordingscall-recording webhook → connector downloadsLive
Google MeetMeeting recordingsGoogle Drive webhook → connector downloadsLive
DialpadCall recordingsWebhook push with recording URLLive
Zoom PhoneCall recordingsZoom Phone webhookLive
OpenPhoneCall recordingsOpenPhone webhookLive

Transcript Storage Schema

Transcripts are stored in Aurora PostgreSQL with a normalized schema that preserves segment-level timing and speaker information. The transcript_segments table is indexed for full-text search and has a pgvector embedding column for semantic retrieval by the AI model.

-- Simplified schema

CREATE TABLE recordings (
  id            UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  customer_id   UUID NOT NULL,           -- RLS enforced
  platform      TEXT NOT NULL,           -- 'teams', 'zoom', 'ringcentral', etc.
  meeting_id    TEXT,
  recorded_at   TIMESTAMPTZ NOT NULL,
  s3_key        TEXT NOT NULL,
  duration_sec  INTEGER,
  status        TEXT DEFAULT 'pending',  -- pending | processing | complete | failed
  created_at    TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE transcript_segments (
  id            UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  recording_id  UUID REFERENCES recordings(id),
  customer_id   UUID NOT NULL,           -- RLS enforced (denormalized for policy)
  start_ms      INTEGER NOT NULL,
  end_ms        INTEGER NOT NULL,
  speaker       TEXT,
  text          TEXT NOT NULL,
  language_src  TEXT,                    -- detected source language
  language_out  TEXT,                    -- output language (CEO preferred)
  embedding     vector(1536),            -- pgvector: semantic search
  created_at    TIMESTAMPTZ DEFAULT NOW()
);

-- RLS policies enforced on both tables
ALTER TABLE recordings ENABLE ROW LEVEL SECURITY;
ALTER TABLE transcript_segments ENABLE ROW LEVEL SECURITY;

Language Support

The PanOps transcription engine supports transcription in 99 languages and translation to English (or to the configured target language). The following are the most commonly configured CEO languages in enterprise deployments:

  • English (en)
  • Spanish (es)
  • French (fr)
  • German (de)
  • Mandarin Chinese (zh)
  • Japanese (ja)
  • Portuguese (pt)
  • Arabic (ar)
  • Hindi (hi)
  • Korean (ko)

Language preference is configured per customer at onboarding and stored in the customer configuration record. The transcription worker reads this configuration per job and applies the appropriate output language.


← Back to overview