How we test and score AI note-taking apps
Every head-to-head comparison on this site includes scored benchmarks for transcription accuracy, summary quality, and recording reliability. This page explains exactly how those scores are produced, what each number means, and where the methodology has real limitations.
Disclosure: These tests were conducted by the Audionotes team. Audionotes is one of the two products in every comparison. We have a direct interest in the outcome. We have documented our process here so you can evaluate it, replicate it, or weigh it accordingly.
All transcription and summary scores are based on a single shared test file applied consistently across every product.
Products that do not accept uploaded audio (e.g. meeting-bot-only tools) were tested by running a live session using the same source audio played through speakers. This introduces a small additional noise floor, which is noted where relevant.
Scored /10 by a human evaluator on the Audionotes team. The same evaluator scored all products against the same recording to ensure consistency across comparisons.
What the evaluator assessed
Word accuracy — meaningful errors, dropped words, garbled phrases that change the meaning
Speaker separation — were the two speakers correctly identified and consistently labelled throughout?
Proper noun handling — names, product names, and technical terms
Punctuation and readability — is the transcript usable without manual cleanup?
Scoring scale
9 – 10 Near-perfect. Rare or minor errors that do not affect meaning. Usable as-is.
7 – 8 Accurate enough for practical use. Noticeable errors but meaning is preserved.
5 – 6 Significant errors requiring correction before sharing or acting on.
3 – 4 Frequent errors that undermine usability. Heavy editing required.
1 – 2 Largely unusable. Output not reliably connected to source audio.
This is qualitative evaluation, not a calculated word error rate (WER). WER requires a validated reference transcript and character-level diff tooling. Human evaluation captures practical usability more directly but introduces subjectivity. The same evaluator scored all 14 comparisons to minimise inconsistency.
Scored /10 by an LLM judge (GPT-4o) applied against a fixed rubric. The same prompt and rubric were used for every product.
Rubric criteria
Key point capture — did the summary include all major topics discussed in the recording?
Hallucination rate — did the summary introduce facts, names, or conclusions not present in the source audio?
Formatting quality — is the output structured, appropriately concise, and easy to scan?
Actionability — are action items, decisions, and follow-ups clearly surfaced where applicable?
Scoring scale
9 – 10 All key points captured. No hallucinations. Clean formatting. Actionable output.
7 – 8 Most key points present. Minor omissions. No significant hallucinations.
5 – 6 Partial coverage. Some missing points or minor invented detail.
3 – 4 Significant gaps or hallucinations. Needs substantial correction.
Undetermined Product does not generate summaries, or a comparable output could not be produced.
A score of "Undetermined" appears when a product does not offer the capability being measured, or when a comparable output could not be produced under consistent conditions. It is not a negative score — it means the comparison does not apply.
LLM judges can produce inconsistent scores across runs. We used a single fixed evaluation run per product rather than averaging multiple runs. This is a known limitation.
Unlike the transcription and summary scores, recording reliability was not derived from the isolated test recording. It reflects patterns from App Store and Google Play reviews collected in March 2026.
What this measures
Reliability — does the recording function consistently save and process as expected?
Failure modes — how often do users report silent failures, missing recordings, or processing errors?
Consistency across conditions — does reliability hold across mobile, desktop, and different network conditions?
Scoring scale
9 – 10 Highly reliable. Failures rarely or never mentioned in reviews.
7 – 8 Mostly reliable. Isolated failures mentioned but not a dominant theme.
5 – 6 Reliability issues appear in reviews with enough frequency to be a real risk.
3 – 4 Frequent reliability complaints. Data loss is a documented user experience.
This metric is more subjective than the test-based scores. It reflects review sentiment, not a controlled experiment. Products with fewer total reviews have a noisier signal.
Tester bias. Tests were run by the team that built Audionotes. Independent third-party verification would be more reliable. We have tried to be consistent and document everything, but you should weigh this accordingly.
Single test file. One 30-minute English recording does not capture performance across accents, longer recordings, specialist vocabulary, or non-English speech.
Qualitative transcription scoring. Scores are not calculated word error rates. They reflect a human evaluator's judgment of practical usability, which introduces subjectivity.
LLM judge variability. Summary quality scores were produced by a single LLM evaluation run, not averaged across multiple runs. Different models or prompt phrasings may produce different scores.
Recording reliability from reviews. This metric reflects public review data, not isolated testing. It is more impressionistic than the other scores.
Products update frequently. Scores reflect performance as of March 2026. A product update can materially change accuracy or reliability.
Choose a plan based on your requirements. Audionotes is for everyone.
Still not sure that Audionotes.app is right for you?
Let ChatGPT, Claude, or Perplexity help you to choose.
Click a button and see what your favourite AI says about Audionotes.app.
