Building a Voice-Driven Portfolio with ElevenLabs Real-Time API

November 15, 2025

This guide walks through creating a real-time, voice-driven conversational experience for your portfolio using ElevenLabs' Real-Time Conversational WebSocket API.

How to Create It

Build a simple frontend using React or Next.js with a clean, minimal UI. Add a microphone button that activates getUserMedia and captures live audio. Set up a backend (Node.js or any server) to store your ElevenLabs API key and create a secure session for the client. Then connect the frontend to the ElevenLabs Real-Time Conversational WebSocket API, which handles speech-to-speech interaction.

Frontend Setup

Create a React component with a microphone button

Use navigator.mediaDevices.getUserMedia({ audio: true }) to capture audio

Convert audio to 16-bit PCM format

Establish WebSocket connection to your backend

Stream audio chunks continuously to the WebSocket

Receive and play back TTS audio chunks in real-time

Backend Setup

Set up a Node.js server (Express, Fastify, etc.)

Store ElevenLabs API key securely in environment variables

Create an endpoint to generate temporary session tokens

Proxy WebSocket connections to ElevenLabs API

Never expose API keys to the client browser

How to Integrate It

The frontend continuously captures microphone audio, converts it to 16-bit PCM, and sends it through a WebSocket. The backend provides a temporary token or proxies the WebSocket so the API key never touches the browser. ElevenLabs receives the user audio, understands it, generates a response, and streams back real-time TTS audio. The frontend immediately plays these audio chunks as they arrive.

Key Integration Points

Audio Capture: Use Web Audio API or MediaRecorder

Format Conversion: Convert to 16-bit PCM at 16kHz sample rate

WebSocket Stream: Bidirectional communication for audio I/O

Audio Playback: Use Web Audio API for low-latency playback

Session Management: Handle connection lifecycle and errors

How It Works

The system functions as a live, two-way audio stream. When the user speaks, raw audio is streamed directly to the ElevenLabs conversational endpoint. ElevenLabs performs live speech recognition, processes the query, and instantly streams back natural voice output. Each audio chunk is played on the frontend immediately, creating a near-instant response (usually under 1 second). The entire portfolio becomes a conversational, voice-driven experience powered by ElevenLabs' real-time API.

Technical Flow

User speaks → Microphone captures audio

Audio processing → Convert to 16-bit PCM chunks

Stream to backend → WebSocket sends audio data

Backend proxy → Forwards to ElevenLabs API

ElevenLabs processing → Speech recognition + AI response generation

TTS streaming → Voice response sent back in chunks

Frontend playback → Audio played immediately as chunks arrive

Sub-second latency → Natural conversation flow

Benefits

Real-time interaction: No waiting for complete responses

Natural conversation: Voice-to-voice with minimal latency

Secure: API keys never exposed to client

Scalable: WebSocket connections handle multiple users

Immersive: Transforms static portfolio into interactive experience

Technologies Used

Frontend: React/Next.js, Web Audio API, WebSocket

Backend: Node.js, Express, WebSocket server

API: ElevenLabs Real-Time Conversational API

Audio: 16-bit PCM, 16kHz sample rate

Security: JWT tokens, environment variables, proxy pattern

Enjoyed this piece of writing?

You should definitely subscribe my substack to get notified with more such posts