Can AI Detectors Actually Tell When a Story Is AI-Written?

I tested seven of them on the same flash fiction piece, sentence by sentence. The results were not reassuring.

I've started dipping my toe in AI-assisted writing after a few years of doing regular creative writing. It's been a fun tool for going straight from the idea stage to the editing stage, but one thing I noticed is that whenever I told the LLM to humanify it — stripping the formal transitions, loosening the sentence structure, trying to get something that didn't read like a press release — the story still came out bland and overly-polished. And to add to that, whenever I scanned my stories through the various AI detectors they all gave me widely varying results that may as well have been pulled from a hat. So, as a data nerd, I decided to test a handful of AI writing detectors in a systematic way and found that some performed no better than a coin toss, with only a couple doing any meaningful detection. The results are here if you're interested in seeing which detectors worked better for flash fiction.

(I have no affiliation with any AI detector companies — this is not a promotion, just honest results.)

How I Did It

I got ChatGPT to produce a roughly 100-word, four-sentence story. I scanned it through all seven detectors to get a baseline. Then I rewrote the first AI sentence in my own words and scanned again. Then the second, and so on — until all four sentences were mine, with a scan at each stage. Five data points in total: 100%, 75%, 50%, 25%, and 0% AI content.

Stage 1: 4 Out of 4 Sentences Written by AI (100%)

A single father receives a letter and, as he reads it carefully, realises he has been made redundant, a discovery that fills him with quiet dread.

As a result, he begins thinking about what this means for him and his young daughter, reflecting on the responsibilities he carries and the uncertainty ahead.

Ultimately, the weight of the situation becomes overwhelming, and he breaks down in tears as he understands that this letter has changed his stability, his future, and his hopes.

Still, this is not his first setback, and he knows he can fight on.

Predicted AI-written % below (remember, actual % is 100):

Actual AI Content %	QuillBot AI % Prediction	GPTZero AI % Prediction	ZeroGPT AI % Prediction	Copyleaks AI % Prediction	Justdone AI % Prediction	ChatGPT AI % Prediction	Gemini AI % Prediction
100% AI	0%	92%	100%	100%	96%	72%	85%

QuillBot was the only one off the mark here — predicting 0% AI content on a story that was entirely AI-written. Which became a pattern.

Stage 2: 3 Out of 4 Sentences Written by AI (75%)

A letter arrives for the young widower and, while reading it, he realises his employment contract is being terminated.

As a result, he begins thinking about what this means for him and his young daughter, reflecting on the responsibilities he carries and the uncertainty ahead.

Ultimately, the weight of the situation becomes overwhelming, and he breaks down in tears as he understands that this letter has changed his stability, his future, and his hopes.

Still, this is not his first setback, and he knows he can fight on.

Actual AI Content %	QuillBot AI % Prediction	GPTZero AI % Prediction	ZeroGPT AI % Prediction	Copyleaks AI % Prediction	Justdone AI % Prediction	ChatGPT AI % Prediction	Gemini AI % Prediction
~75% AI	22%	75%	100%	100%	94%	45%	85%

Stage 3: 2 Out of 4 Sentences Written by AI (50%)

A letter arrives for the young widower and, while reading it, he realises his employment contract is being terminated.

He begins to think about the ramifications this will have for him and his young daughter and how he will continue providing for her.

Ultimately, the weight of the situation becomes overwhelming, and he breaks down in tears as he understands that this letter has changed his stability, his future, and his hopes.

Still, this is not his first setback, and he knows he can fight on.

Actual AI Content %	QuillBot AI % Prediction	GPTZero AI % Prediction	ZeroGPT AI % Prediction	Copyleaks AI % Prediction	Justdone AI % Prediction	ChatGPT AI % Prediction	Gemini AI % Prediction
~50% AI	0%	0%	100%	0%	87%	55%	85%

GPTZero and Copyleaks both drop to 0% here — a significant miss at the halfway point.

Stage 4: 1 Out of 4 Sentences Written by AI (25%)

A letter arrives for the young widower and, while reading it, he realises his employment contract is being terminated.

He begins to think about the ramifications this will have for him and his young daughter and how he will continue providing for her.

Lowering his head into his hands he shuts his eyes tightly, remaining there motionless for several minutes, trying — and failing — to steady the thoughts in his mind.

Still, this is not his first setback, and he knows he can fight on.

Actual AI Content %	QuillBot AI % Prediction	GPTZero AI % Prediction	ZeroGPT AI % Prediction	Copyleaks AI % Prediction	Justdone AI % Prediction	ChatGPT AI % Prediction	Gemini AI % Prediction
~25% AI	0%	0%	100%	0%	85%	42%	35%

Stage 5: 0 Out of 4 Sentences Written by AI (0%)

A letter arrives for the young widower and, while reading it, he realises his employment contract is being terminated.

He begins to think about the ramifications this will have for him and his young daughter and how he will continue providing for her.

Lowering his head into his hands he shuts his eyes tightly, remaining there motionless for several minutes, trying — and failing — to steady the thoughts in his mind.

Bit by bit, his head raises again, to reveal the dry and menacingly determined gaze in his eyes.

Actual AI Content %	QuillBot AI % Prediction	GPTZero AI % Prediction	ZeroGPT AI % Prediction	Copyleaks AI % Prediction	Justdone AI % Prediction	ChatGPT AI % Prediction	Gemini AI % Prediction
0% AI	0%	0%	0%	0%	81%	28%	85%

Full Results

Actual AI Content %	QuillBot AI % Prediction	GPTZero AI % Prediction	ZeroGPT AI % Prediction	Copyleaks AI % Prediction	Justdone AI % Prediction	ChatGPT AI % Prediction	Gemini AI % Prediction
0% AI	0%	0%	0%	0%	81%	28%	85%
~25% AI	0%	0%	100%	0%	85%	42%	35%
~50% AI	0%	0%	100%	0%	87%	55%	85%
~75% AI	22%	75%	100%	100%	94%	45%	85%
100% AI	0%	92%	100%	100%	96%	72%	85%

Here's the ugly chart — don't worry if the lines take a moment to untangle, the breakdown below explains what you're looking at:

AI Detector Output vs Actual AI Content — line chart comparing seven detectors across five data points

Blue lines = detectors that tracked AI content; red lines = detectors that didn't.

What the Results Actually Show

The detectors fall into three rough categories.

Tracked the signal — with caveats

Justdone moved in the right direction at every single stage, which makes it the most consistent performer in this test. But "consistent" doesn't mean "accurate." Its score for the fully human-written piece was 81% — meaning it would flag a story you wrote entirely yourself as almost certainly AI-generated. That's a serious problem in practice. If you're using a detector to check your own work before submitting somewhere, a tool that cries wolf on human writing is almost worse than useless — you'd either distrust your own prose or learn to ignore the score entirely. Justdone tracked the direction of AI content well; its calibration is another matter. It may still be the most useful of the seven for identifying trends across drafts, but treat its absolute percentages with real scepticism.

ChatGPT's detector showed a similar directional pattern, more conservatively. It scored 28% on the fully human piece and 72% on the fully AI one — a wider spread than you'd want, but at least the scores moved meaningfully across the five stages.

GPTZero performed well at the extremes (92% at 100% AI, 0% at 0% AI) but collapsed entirely at the 50% mark — suggesting it detects obviously AI-patterned prose rather than tracking blended content in any nuanced way.

Erratic

Gemini scored 85% at most stages but dropped to 35% at the 25% AI mark — which rules out a simple fixed-threshold explanation, but doesn't make the results any more interpretable. It scored 85% on the fully human-written piece, which is a significant false positive, and its pattern across the five stages has no clear logic. It may be reacting to surface features of the prose rather than anything structural about the AI content.

Not fit for purpose

QuillBot scored 0% on the fully AI-written piece and never meaningfully recovered across any stage. It is not a useful detector in this context.

Gemini is worth singling out separately. It scored 85% on the fully human-written piece — the same score it returned at 50% and 75% AI content. A detector that flags entirely human writing as 85% AI-generated is not detecting anything useful; it's just producing noise with a confident number attached.

A Note on Sample Size

This test used a four-sentence, ~100-word piece. The sample is intentionally small — that was the point of the controlled substitution method — but it means the results may not generalise to longer work. Detectors may perform differently on 1,000-word stories where statistical patterns are easier to identify across a larger body of text. A follow-up test at that length would be more informative, and is worth doing.

The takeaway, for now: if you're relying on any of these tools to flag AI-assisted writing, most of them will let you down. Only two moved consistently in the right direction, and even those come with significant caveats about false positive rates and calibration.

The deeper question — whether AI-assisted writing that's been substantially rewritten by a human constitutes "AI writing" at all — is harder to answer, and probably more interesting. If you're thinking about what makes writing worth reading regardless of how it was produced, there's something relevant in our piece on what good writing actually is and how it gets judged.