Back to writings

Building a Voice-Driven Portfolio with ElevenLabs Real-Time API

This guide walks through creating a real-time, voice-driven conversational experience for your portfolio using ElevenLabs' Real-Time Conversational WebSocket API.

How to Create It

Build a simple frontend using React or Next.js with a clean, minimal UI. Add a microphone button that activates getUserMedia and captures live audio. Set up a backend (Node.js or any server) to store your ElevenLabs API key and create a secure session for the client. Then connect the frontend to the ElevenLabs Real-Time Conversational WebSocket API, which handles speech-to-speech interaction.

Frontend Setup

  • Create a React component with a microphone button
  • Use navigator.mediaDevices.getUserMedia({ audio: true }) to capture audio
  • Convert audio to 16-bit PCM format
  • Establish WebSocket connection to your backend
  • Stream audio chunks continuously to the WebSocket
  • Receive and play back TTS audio chunks in real-time
  • Backend Setup

  • Set up a Node.js server (Express, Fastify, etc.)
  • Store ElevenLabs API key securely in environment variables
  • Create an endpoint to generate temporary session tokens
  • Proxy WebSocket connections to ElevenLabs API
  • Never expose API keys to the client browser
  • How to Integrate It

    The frontend continuously captures microphone audio, converts it to 16-bit PCM, and sends it through a WebSocket. The backend provides a temporary token or proxies the WebSocket so the API key never touches the browser. ElevenLabs receives the user audio, understands it, generates a response, and streams back real-time TTS audio. The frontend immediately plays these audio chunks as they arrive.

    Key Integration Points

  • Audio Capture: Use Web Audio API or MediaRecorder
  • Format Conversion: Convert to 16-bit PCM at 16kHz sample rate
  • WebSocket Stream: Bidirectional communication for audio I/O
  • Audio Playback: Use Web Audio API for low-latency playback
  • Session Management: Handle connection lifecycle and errors
  • How It Works

    The system functions as a live, two-way audio stream. When the user speaks, raw audio is streamed directly to the ElevenLabs conversational endpoint. ElevenLabs performs live speech recognition, processes the query, and instantly streams back natural voice output. Each audio chunk is played on the frontend immediately, creating a near-instant response (usually under 1 second). The entire portfolio becomes a conversational, voice-driven experience powered by ElevenLabs' real-time API.

    Technical Flow

  • User speaks → Microphone captures audio
  • Audio processing → Convert to 16-bit PCM chunks
  • Stream to backend → WebSocket sends audio data
  • Backend proxy → Forwards to ElevenLabs API
  • ElevenLabs processing → Speech recognition + AI response generation
  • TTS streaming → Voice response sent back in chunks
  • Frontend playback → Audio played immediately as chunks arrive
  • Sub-second latency → Natural conversation flow
  • Benefits

  • Real-time interaction: No waiting for complete responses
  • Natural conversation: Voice-to-voice with minimal latency
  • Secure: API keys never exposed to client
  • Scalable: WebSocket connections handle multiple users
  • Immersive: Transforms static portfolio into interactive experience
  • Technologies Used

  • Frontend: React/Next.js, Web Audio API, WebSocket
  • Backend: Node.js, Express, WebSocket server
  • API: ElevenLabs Real-Time Conversational API
  • Audio: 16-bit PCM, 16kHz sample rate
  • Security: JWT tokens, environment variables, proxy pattern
  • Enjoyed this piece of writing?

    You should definitely subscribe my substack to get notified with more such posts