Back to Blog

In January 2024, a finance worker in Hong Kong wired $25 million after a video call with what he believed was his company's CFO. Every person on that call was a deepfake β€” face and voice. The attack used commercially available voice cloning to replicate the CFO's speech patterns from earnings call recordings freely available on YouTube.

That was 2024. The tools have gotten cheaper, faster, and more convincing since.

Bottom Line Up Front

Voice cloning now requires as little as 3 seconds of reference audio and costs under $5 per minute of generated speech. The human ear cannot reliably distinguish high-quality AI voice from real speech. Automated detection exists but is fragmented β€” no single tool covers all generator types, and phone-line compression destroys most detectable artifacts.

Section 1: The Deepfake Audio Problem

Voice cloning has moved from academic research to a consumer product in under three years. The barrier to entry is now effectively zero for anyone willing to spend a few dollars and upload a voice sample.

How voice cloning works in 2026

Modern voice synthesis uses neural codec models β€” architectures that learn to decompose speech into discrete tokens (capturing timbre, pitch, rhythm, and phonetics) and then reconstruct audio from those tokens with a new text input. The two dominant approaches are:

Platform Min. Sample Latency Cost Detection Difficulty
ElevenLabs ~30 sec Near real-time $0.30/min Very High
PlayHT 2.0 ~15 sec ~2 sec $0.15/min High
OpenAI Voice Engine 15 sec Near real-time API pricing Very High
Open-source (VALL-E X, XTTS) 3–10 sec Varies Free (compute) Moderate–High

Where voice deepfakes are already causing harm

Scam calls. The FBI reported a 300% increase in AI-assisted voice fraud between 2024 and 2025. The most common pattern: cloning a family member's voice from social media, then calling with an "emergency" demanding money. Victims consistently report that the voice sounded exactly right β€” tone, cadence, and even verbal tics.

Political disinformation. In the 2024 New Hampshire primary, AI-generated robocalls impersonated President Biden telling voters to skip the election. The audio was convincing enough to prompt an FCC ruling making AI-generated voice calls illegal without consent. But regulations don't stop bad actors β€” they just make the consequences clearer after the damage is done.

Fake podcasts and audio content. Entire podcast episodes have been generated using cloned host voices, published to platforms, and indexed by search engines before being flagged. The economic incentive is ad revenue; the reputational damage falls on the real host. One podcaster discovered 14 episodes under her name that she never recorded β€” each accumulating downloads for weeks before takedown.

Corporate fraud. Beyond the Hong Kong case, multiple companies have reported voice-phishing attacks where attackers cloned executive voices from publicly available conference calls, investor presentations, or media interviews. The target is always the same: wire transfers, credential sharing, or sensitive data access authorized by a "trusted" voice.

"The voice was my mother's. Same accent, same rhythm, same way she says my name. I almost sent $3,000 before my actual mother called me back." β€” Victim testimony, FTC report, 2025

Section 2: Technical Tells β€” What AI Voice Gets Wrong

Despite how convincing deepfake audio sounds to the human ear, synthetic speech has measurable differences from real human voice. The problem: most of these differences are below the threshold of conscious perception. They exist in the signal, not in what you "hear."

Spectral artifacts

Real human speech produces a complex frequency spectrum shaped by the physical resonance of the vocal tract β€” throat, mouth, nasal cavities. Every person's spectrum is unique and consistent across utterances. AI-generated voice approximates this spectrum but often shows unnaturally smooth formant transitions β€” the frequencies shift too evenly between vowel sounds, lacking the micro-variations caused by physical articulation. Spectrogram analysis can reveal these patterns, though they require trained interpretation or automated classifiers.

Breathing patterns and micro-pauses

Humans breathe. This sounds obvious, but it's a critical detection signal. Real speech contains involuntary breath sounds β€” inhalation before long phrases, slight catches between clauses, audible exhalation during laughter or emphasis. Most voice cloning systems generate speech continuously without realistic breathing patterns. Some newer systems (ElevenLabs, OpenAI) have added synthetic breath sounds, but these tend to be too regular β€” spaced at consistent intervals rather than varying with emotional state and sentence structure. The absence of breathing, or breathing that follows a mechanical pattern, is one of the strongest perceptual cues that something is off.

Emotional flatness and prosody gaps

Human emotion modulates speech in subtle, context-dependent ways. Genuine surprise involves a pitch spike followed by a specific decay pattern. Real anger compresses vowel duration. Authentic laughter has a chaotic spectral signature that's extremely difficult to synthesize convincingly. Current voice cloning captures the average prosody of a speaker but struggles with emotional transitions β€” the shift from calm to excited, from explaining to joking. The voice sounds like the person, but not like the person feeling something.

Pronunciation consistency

Real speakers have pronunciation habits that are deeply consistent but context-dependent. You might drop the "g" in "running" casually but pronounce it fully in a formal sentence. You might say "gonna" to friends and "going to" in a presentation. Voice clones trained on limited samples either normalize these variations (always formal or always casual) or reproduce them inconsistently β€” using casual pronunciation in formal contexts and vice versa. This is often the first thing close associates notice: "It sounds like them, but not how they'd say it in that context."

Room acoustics and environmental cues

Every real audio recording contains environmental information β€” room reverb, background noise floor, microphone characteristics. AI-generated voice exists in a conspicuously clean acoustic environment or has reverb applied as a uniform post-processing effect rather than the complex, frequency-dependent reverb of a real room. When a supposed "phone call" has studio-quality clarity with no line noise, compression artifacts, or environmental sound, that's a signal worth investigating.

The Core Challenge

Most of these tells require either specialized tools (spectrogram analysis, signal processing) or close familiarity with the cloned speaker's habits. The average person receiving a phone call has neither. This is why automated detection matters for voice even more than for text or images β€” the human ear is a worse detector than the human eye.

Section 3: Detection Methods That Exist Today

Voice deepfake detection is less mature than text or image detection, but the field is advancing rapidly. Here are the three main approaches and their current state.

Spectral and signal analysis

Classical audio forensics techniques analyze the frequency domain properties of audio β€” spectral envelope, formant trajectories, harmonic structure, and noise floor characteristics. These methods don't need to know which generator produced the audio; they look for statistical anomalies in the signal itself. Strengths: generator-agnostic, explainable results. Weaknesses: high false positive rates on heavily compressed or low-quality audio (phone calls, voice messages), which is exactly where deepfakes are most commonly deployed.

Neural network classifiers

Deep learning models trained on large datasets of real and synthetic speech. The current state of the art uses architectures trained on both raw audio waveforms and spectrogram features. These include models like those behind Resemble AI's detection API, which is specifically trained to identify artifacts from common voice cloning systems.

Detection Tool Approach Clean Audio Accuracy Compressed Audio Real-time?
Resemble AI Detect Neural classifier 89–94% ~72% Yes (API)
Pindrop Spectral + ML hybrid 91–95% ~78% Yes (enterprise)
Hiya AI Call Detection Neural network 84–88% ~68% Yes (phone)
Academic models (ASVspoof) Ensemble classifiers 92–97% ~61% No (research)

Watermarking and provenance

Some voice generation platforms now embed inaudible watermarks in their output β€” imperceptible to the human ear but detectable by verification tools. ElevenLabs embeds provenance markers in all generated audio. The limitation is identical to image watermarking: adversarial actors can strip, alter, or re-encode audio to remove watermarks. And open-source generators don't watermark at all. Watermarking is a trust signal when present, not a detection method when absent.

The Compression Problem

Notice the accuracy drop between "clean audio" and "compressed audio" in every tool. Phone calls use narrow-band codecs (8 kHz sampling, heavy compression) that destroy exactly the high-frequency artifacts detectors rely on. This means deepfake detection is worst where deepfakes are most commonly used β€” phone calls and voice messages. This is the single biggest unsolved problem in voice deepfake detection.

Section 4: Why Single-Tool Detection Fails for Voice

If you've read our posts on text detection accuracy and image detection, the pattern is familiar: no single detection tool reliably covers all generator types, all content conditions, and all delivery channels. Voice is worse.

Generator diversity

A detector trained primarily on ElevenLabs output may not recognize audio from XTTS or PlayHT. Each generator leaves a different statistical fingerprint, and the open-source ecosystem is evolving fast enough that detectors trained on last quarter's models may not catch this quarter's output. The training data problem is compounding: new generators appear faster than detection models can be retrained.

Channel degradation

Voice deepfakes deployed over phone lines lose the very artifacts that make detection possible. When audio is re-encoded through GSM, VoLTE, or WhatsApp's Opus codec, the spectral anomalies that flag synthetic speech get smoothed out alongside genuine compression noise. A detector seeing this compressed audio faces a signal-to-noise problem where the signal (synthetic artifacts) has been partially destroyed by the channel itself.

Multimodal attacks

The most sophisticated deepfakes combine voice with video (lip-synced deepfake video calls) or with contextual information (the caller knows details about your company, references real meetings, uses internal jargon). Detecting the voice in isolation, even if accurate, misses the broader attack surface. The Hong Kong case succeeded not because the voice was perfect, but because the voice was convincing enough in combination with the video, the meeting context, and the social engineering.

"Voice detection in isolation solves the wrong problem. The threat is multimodal β€” voice, video, and context combined. Detection has to be multimodal too."

Section 5: RealCheck's Approach β€” Multimodal Context for Voice Detection

RealCheck treats voice detection as part of a broader content authenticity problem β€” not a standalone audio classifier. Here's what that means in practice.

Multiple detection signals, not one score

When you submit audio to RealCheck, we don't return a single "real or fake" verdict. You see individual scores from spectral analysis, neural classification, and (when available) provenance/watermark verification. If the neural classifier flags the audio at 91% but spectral analysis returns 34%, you see both. The disagreement itself is informative β€” it may indicate compression artifacts that are confusing one detector but not the other.

Voice in context: cross-modal analysis

Where RealCheck adds the most value is when voice is part of a larger content package. A suspicious video call? We can analyze the voice track and the video independently, then flag inconsistencies β€” audio quality that doesn't match the video source, lip sync mismatches, or voice artifacts that only appear when the video shows specific facial movements. A podcast episode that might be cloned? We can cross-reference the text transcript against the speaker's known writing style and the audio against their known voice characteristics.

This cross-modal approach is why voice deepfakes matter to RealCheck even though our current speech detection is still in beta. The value isn't just "is this audio synthetic?" β€” it's "does this voice match this video, this context, this claimed identity?"

What we're honest about

Voice detection on compressed phone audio is unreliable β€” for us and for everyone else. We don't claim to solve a problem that isn't solved. Our current speech detection performs best on clean, high-quality audio (podcasts, recorded meetings, video narration). On phone-compressed audio, our detection accuracy drops to 65–75% β€” and we report that honestly rather than burying it behind marketing claims.

We're actively improving compressed-audio detection. The approach involves training on audio that has already been compressed through common codecs (GSM-FR, AMR, Opus) rather than only training on clean samples. This is the same "test on real-world conditions, not lab conditions" principle we apply to text and image detection.

Our Approach

RealCheck surfaces raw detection signals across modalities β€” voice, video, text, and images. We believe the future of deepfake detection is multimodal context, not siloed classifiers. Voice is the third pillar, and we're building it with the same transparency principles that define our text and image detection.

Detect AI across text, images, voice, and more.

Join the waitlist to get early access to RealCheck β€” multimodal AI detection that shows raw scores, not inflated averages.