In 2023, spotting an AI image meant finding the extra finger. In 2026, that heuristic is dead. Modern diffusion models handle hands, faces, and lighting with an accuracy that would have seemed impossible eighteen months ago. The visual tells that journalists and fact-checkers relied on have largely been patched by the models themselves.
That's not a reason to give up. It's a reason to use better methods.
Visual inspection alone will miss most AI-generated images from modern generators. The gap between what humans can spot and what automated detection can catch is now wide enough that relying on the eye test is a systematic failure mode β especially at scale.
Section 1: The Current State of AI Image Generation
To understand why detection is hard, you need to understand what you're up against. The four dominant generators in 2026 are:
| Generator | Architecture | Known Strengths | Detection Difficulty |
|---|---|---|---|
| DALL-E 3 | Diffusion (OpenAI) | Prompt adherence, text in images | Moderate |
| Midjourney v6 | Diffusion (proprietary) | Photorealism, portrait detail | High |
| Flux.1 | Rectified flow transformer | Anatomy, hands, coherence | Very High |
| Stable Diffusion XL | Latent diffusion | Fine-tuning flexibility, speed | ModerateβHigh |
What these models still can't do
Even the best generators have systematic weaknesses β though they're shrinking with each model release. The key limitations in 2026 are mostly contextual coherence: a room where windows cast shadows in two incompatible directions; a crowd where no two people's clothing interacts plausibly with the lighting; a "newspaper" headline that reads correctly but has fonts that shift mid-word at high zoom.
These aren't always visible at a glance. They require deliberate scrutiny. And when images are compressed, resized, or screenshot-cycled through social media, even these artifacts often disappear.
Section 2: Visual Tells That Still Work (And When They Fail)
Visual inspection isn't useless β it's just unreliable as a standalone method. Here's what experienced fact-checkers still check, and an honest assessment of when each tell fails.
Hands and fingers
The classic tell. AI models historically generated hands with six fingers, fused knuckles, or anatomically impossible joint angles. Flux.1 and Midjourney v6 have largely solved this for common poses. Where it still holds: complex hand gestures, hands holding objects, two hands interacting, or hands at unusual angles (e.g., extended fingers viewed from above). If the subject's hands aren't visible, that's also a tell β many AI images are composed to avoid showing hands entirely.
Text and lettering
AI-generated text in images is frequently wrong in ways that are hard to describe but easy to spot: letters that look correct at a glance but aren't real words; fonts that drift; text that wraps nonsensically. DALL-E 3 can render short, common words correctly. Longer text, unusual names, or non-English scripts remain unreliable. Check any readable text in the image carefully β and zoom in. Compression often hides this.
Background symmetry and repetition
Diffusion models fill backgrounds by sampling from texture distributions. When the model runs out of unique training signal, it tiles. Look for: crowds where multiple people share the same face at different scales; brick walls with identical mortar lines repeating on an unnatural grid; foliage that repeats in a fractal pattern. These artifacts are most visible in high-resolution versions of images, not previews.
Lighting and shadow consistency
Single-light-source coherence is something humans are extremely sensitive to, but often don't consciously notice. In AI images, shadows frequently don't match: a subject lit from the left has a shadow falling to the right; reflections in eyes show a window not present in the scene; skin highlights imply a light source that doesn't match the room. This is one of the most reliable tells in photorealistic images, but requires slowing down and explicitly asking "where is the light coming from?" for every shadow in the image.
Texture artifacts in fabric, fur, and hair
AI textures look statistically correct at a medium zoom but break down at high zoom. Fabric weave has inconsistent thread count. Hair at the edges of frames becomes soft in a way that doesn't match real optical blur. Animal fur textures often tile or show discontinuities. This is most visible when the image has been preserved at original resolution β which social media compression destroys.
Every visual tell that works depends on a high-resolution original, deliberate scrutiny, and domain knowledge (about lighting, anatomy, typography). At social media resolution, after JPEG compression, most of these tells are invisible. This is why automated detection matters β it operates on statistical properties of the image data itself, not the visible rendering.
Section 3: Why Visual Inspection Alone Fails at Scale
The practical limitation of visual inspection isn't knowledge β it's time. A journalist or content moderator reviewing 200 images per day cannot spend 4 minutes on each one. At real-world content moderation scale (think: election misinformation campaigns, financial fraud schemes, academic submission pipelines), the volume of images makes human review a bottleneck.
The adversarial problem
Bad actors specifically optimize for visual inspection. They know that moderators check hands, check text, check lighting. The response is to generate images that avoid these failure modes β subjects with hands hidden, text removed from composition, controlled studio lighting β or to run generated images through post-processing pipelines that add film grain, JPEG artifacts, and color noise to obscure generator fingerprints.
This isn't theoretical. Research published in early 2026 documented coordinated disinformation campaigns using AI-generated headshots that had been processed through multiple compression and resampling cycles to defeat both visual inspection and first-generation detection tools. The images passed human review at a 94% rate in blind tests β including professional fact-checkers.
Scale requirements vs. human capacity
A platform receiving 10 million images per day cannot employ enough reviewers to manually inspect a meaningful fraction of them. Even sampling strategies fail when the base rate of AI-generated content is rising β a 2% base rate means 200,000 AI images per day at that scale. At three minutes per image, you'd need 10,000 full-time reviewers just to cover the fraction that sampling surfaces.
"The gap between what a human expert can detect with unlimited time and what is practically detectable at production scale is the entire problem. Detection APIs close that gap."
Section 4: How Automated Detection APIs Work
Automated AI image detection works by analyzing statistical properties of image data that are invisible to the human eye but measurable computationally. There are three main approaches, and understanding them helps you evaluate the tools.
Frequency domain analysis
Real photographs have a specific distribution of high-frequency detail (noise, grain, fine texture) that differs from synthetically generated images. Camera sensors introduce noise in specific patterns; lenses introduce specific optical aberrations; JPEG compression creates specific artifacts. Diffusion models generate images with a different statistical fingerprint in the frequency domain β they're "too clean" or show frequency artifacts that correspond to the upsampling steps in the generation process. Tools like SightEngine use frequency analysis as one signal in their classifier stack.
Neural artifact detection
Large detection models are trained on datasets of known AI-generated and real images, learning to identify the subtle patterns left by specific generator architectures. Midjourney v6 leaves different fingerprints than Stable Diffusion XL; DALL-E 3 images have specific tonal and edge characteristics that a trained detector can identify. The challenge: these fingerprints shift with each model update, requiring continuous retraining. Detection tools that haven't been updated against Flux.1 (released late 2025) perform significantly worse on that generator.
Metadata and provenance signals
The C2PA (Coalition for Content Provenance and Authenticity) standard allows images to be signed with cryptographic provenance data at the point of generation. DALL-E 3 and Adobe Firefly already embed C2PA metadata. When present, this is the most reliable signal β it's not probabilistic, it's cryptographically verifiable. The limitation: adversarial actors strip metadata, and most generators don't implement C2PA. It's a best-case tool, not a general solution.
API accuracy comparison
| API | Detection Accuracy (Photorealistic) | Flux.1 Performance | False Positive Rate |
|---|---|---|---|
| SightEngine | 88β92% | ~79% | <4% |
| Hive Moderation | 85β91% | ~81% | <5% |
| Google Gemini Vision | 82β87% | ~71% | <6% |
| Single-model classifiers | 75β85% | ~65% | 6β12% |
These numbers are for unmodified, non-adversarially-processed images. Add post-processing (resampling, grain, compression cycles) and all detection rates drop β typically by 15β25 percentage points. No single API performs consistently across all generator types, which is why single-API detection is a weak baseline.
Section 5: RealCheck's Multimodal Approach to Image Detection
RealCheck applies the same core philosophy to image detection that we apply to text: raw scores from multiple APIs, no inflated accuracy claims, no averaging that buries the signal.
Why single-API detection isn't enough
Every detection API has a training distribution β the set of generators and image types it's been trained to recognize. When a new generator (like Flux.1) is released, single-API tools take weeks or months to retrain. During that window, they're effectively blind to a significant share of AI-generated images. An aggregated approach that includes multiple APIs with different training distributions provides more robust coverage during these gaps.
How we handle disagreement between APIs
When SightEngine returns 91% AI probability and Hive Moderation returns 34%, averaging them produces 62.5% β a number that doesn't reflect either model's actual signal. Instead, RealCheck surfaces both scores individually. The high SightEngine score is a meaningful detection signal. The lower Hive score may mean Hive's model hasn't been updated for this generator β or it may mean the image genuinely has mixed signals and the SightEngine score is a false positive. You see the data and make the judgment call.
For high-stakes decisions β content moderation, journalistic verification, academic integrity β we recommend treating any score above threshold from any integrated API as requiring human review. This is the min-score rule applied to images: the most conservative interpretation of the data.
What RealCheck detects (and what it doesn't claim to)
Our current image detection covers: photorealistic AI portraits, AI-generated scenes and landscapes, AI-generated product imagery, and AI-generated composite images. We're actively working on video frame analysis for deepfake detection.
What we don't claim: 100% detection on adversarially processed images. If an image has been through multiple rounds of compression and resampling designed to defeat detectors, no current tool reliably catches it. We're honest about this. The gap between "clean AI-generated image" and "adversarially processed AI image" is the frontier of the detection problem β and any tool that claims to solve it completely is overstating.
RealCheck shows you what each API found, not a single blended number that loses the signal. Journalists, content moderators, and educators get the evidence β not a verdict that hides how it was reached.
Be first to detect AI images, video, speech, and documents.
Join the waitlist to get early access to RealCheck β multimodal AI detection that shows raw scores, not inflated averages.