How we test and score
AI note-taking apps

Each compare page on this site includes a benchmark table showing how Audionotes and a specific competitor perform across four or five metrics. This document explains exactly how each metric is measured, what the scores mean, and where the numbers come from. Our goal is full transparency so you can judge how much weight to place on any given score.

All testing was conducted by the Audionotes team in March 2026 unless a compare page states otherwise. We test each app independently using the same source recording, evaluate results against fixed rubrics, and record scores before publishing.

Disclosure

Audionotes is one of the products being evaluated. We have done our best to apply the same rubrics and scoring criteria to all apps, including our own. No competitor paid for inclusion or a favourable score. App versions change frequently — scores reflect the version available at the time of testing and may not match current performance.

Test recording

We use a single, standardised test recording for all head-to-head comparisons. Using the same source file ensures that differences in output reflect the app's capabilities rather than variation in the input.

PropertyValue
TypeTwo-person conversation (simulated business meeting)
Duration30 minutes
Speakers2 (one native, one near-native English speaker)
Background noiseModerate café ambient noise (approx. 45 dB)
LanguageEnglish
Recording deviceiPhone 15 Pro, built-in microphone, held on desk
File formatM4A, 44.1 kHz, stereo
ContentProduct roadmap discussion: goals, blockers, and action items

Meeting-bot products (e.g. Fireflies, Granola) were tested separately using their native calendar-join workflow with the same meeting content run as a live call. In those cases, reliability is evaluated differently — see the individual compare page for details.

Transcription accuracy score

The transcription score reflects how faithfully the app converts speech to text. A human evaluator compares the app's transcript against the ground-truth transcript (produced by a professional transcriptionist) and scores the output on the following criteria:

  • Word error rate— the proportion of words that are substituted, inserted, or deleted relative to the reference transcript.
  • Proper nouns and technical terms— whether names, product terms, and domain-specific vocabulary are handled correctly.
  • Punctuation and sentence boundaries— whether the transcript is usable without extensive manual cleanup.
  • Speaker attribution— whether the app correctly separates speakers when diarization is advertised.
  • Filler word handling— whether "um", "uh", and false starts are cleaned up or left verbatim (both are acceptable; the evaluator checks for consistency).

Scoring scale

10 / 10Near-perfect. Fewer than 2% word error rate; proper nouns correct; clean, readable output.
9 / 10Excellent. Very few errors; minor punctuation or capitalisation inconsistencies only.
8 / 10Good. Occasional word errors or missed proper nouns; transcript is usable with light editing.
7 / 10Acceptable. Noticeable errors but the meaning is preserved; some manual correction required.
5–6 / 10Below average. Frequent errors, missing sentences, or heavy filler-word noise.
1–4 / 10Poor. Transcript is difficult to use without substantial correction.
The human evaluator is not affiliated with any of the apps tested and uses only the ground-truth transcript and the scoring rubric above. Scores are rounded to the nearest whole number. Where an app produces no transcript (voice-memo-only products), the transcription score is marked Undetermined.

Summary quality score

Summary quality is scored by an LLM judge (GPT-4o) using a fixed rubric. The judge receives the ground-truth transcript and the app's summary side-by-side and is asked to score the summary on five dimensions:

DimensionWhat we look forWeight
CoverageAll key topics, decisions, and action items from the meeting are present.30%
AccuracyNothing in the summary contradicts or misrepresents what was said.25%
StructureThe summary is organised in a way that is easy to scan and act on.20%
ConcisenessThe summary omits filler and captures only what matters.15%
Action itemsConcrete next steps are identified and attributed to the right person where possible.10%

Scoring scale

10 / 10Exceptional. All five dimensions are excellent; the summary could replace the transcript for most purposes.
9 / 10Very strong. One minor gap in coverage or structure; no accuracy issues.
8 / 10Good. A few missed action items or slightly verbose; still highly usable.
7 / 10Acceptable. Gaps in coverage or structure; requires some cross-referencing with the transcript.
5–6 / 10Below average. Key decisions or action items are missing or inaccurate.
1–4 / 10Poor. The summary is misleading, largely incomplete, or hallucinates content.

Undetermined

If an app does not generate summaries at all, or generates only a rewritten prose version of the transcript with no extractive structure, the summary quality score is marked Undetermined rather than zero, since the product is designed for a different use case.

We use the same LLM judge prompt across all evaluations. The prompt is fixed and does not reference the app name, preventing brand bias. The judge is given no information about which app produced which output. Final scores are the result of a single evaluation pass; we re-run the evaluation if the initial score falls on a half-point boundary to confirm the result.

Recording reliability score

Reliability captures how consistently the app records and processes audio without errors. Unlike transcription or summary quality, reliability is difficult to measure in a single test session. We derive the reliability score from a combination of direct testing and App Store review analysis.

  • Direct testing— we record three sessions with each app and note any crashes, processing failures, upload errors, or dropped audio.
  • App Store review patterns— we code the most recent reviews available at test time (minimum 50 reviews per app) for reliability-related complaints: crashes, stuck processing, lost recordings, and sync failures.
  • Composite score— direct testing accounts for 60% of the reliability score; review pattern analysis accounts for 40%.

Scoring scale

10 / 10Zero issues in direct testing; reliability complaints are rare or absent in reviews.
9 / 10No critical failures in testing; isolated reliability complaints in reviews (under 5%).
8 / 10No lost recordings in testing; occasional complaints in reviews (5–10%).
7 / 10One minor processing hiccup in testing; moderate complaint rate (10–20%).
5–6 / 10One or more processing failures or crashes during testing; notable complaint rate.
1–4 / 10Consistent failures in testing; reliability is a primary user complaint.

Undetermined

For apps with fewer than 50 App Store reviews at test time, or apps that have only recently launched, we mark reliability as Undetermined and rely solely on our direct testing observations, noting this on the compare page.

Review coding was performed by the Audionotes team using a fixed reliability codebook. Reviewers coded independently; discrepancies were resolved by consensus. The review dataset for each app is described in that app's compare page under "Review methodology."

Limitations

No methodology is perfect. The following limitations apply to all scores published on this site:

  • Point-in-time snapshot. Scores reflect app versions available in March 2026. Both Audionotes and competitors update frequently; performance may have improved or regressed since testing.
  • English only. The test recording is in English. Apps that specialise in multilingual transcription may perform differently on non-English content than the scores suggest.
  • Single recording type. We use one standardised recording (two-speaker, 30-minute, moderate noise). Apps optimised for solo dictation, lectures, or large-group meetings may be under- or over-represented by these scores.
  • iOS-first testing. Direct testing was conducted on iPhone 15 Pro. Android or web-app versions of the same product may perform differently.
  • LLM judge variance. GPT-4o is used as the summary judge. LLM evaluations have inherent variance. We run each evaluation with a fixed prompt and temperature to minimise this, but a repeated run could produce a score that is ±1 point different.
  • Audionotes conflict of interest. We are a competitor to all apps reviewed on this site. We publish our full methodology and rubrics to allow independent scrutiny, and we welcome corrections via email.

If you believe a score for any app is wrong, outdated, or based on a misapplication of the rubric, please contact us at support@audionotes.app.

Frequently Asked Questions

The Audionotes team. We test each app independently using the same source recording, evaluate against fixed rubrics, and record scores before publishing. Audionotes is one of the products being scored — we apply the same rubric to ourselves and link to this methodology page from every compare page so readers can scrutinise the process.

A human evaluator compares each app's transcript against a ground-truth transcript produced by a professional transcriptionist, scoring on word error rate, proper-noun handling, punctuation, speaker attribution, and filler-word treatment. Scores run 1–10. The same evaluator and rubric is used for every app, including Audionotes.

Summary quality is judged by GPT-4o using a fixed 5-dimension rubric (faithfulness, coverage of key points, action-item extraction, clarity, conciseness). The judge sees the ground-truth transcript and the app's summary side-by-side and scores 1–10 on each dimension; the final score is the mean.

Apps ship updates frequently, so scores reflect the version available at the time of testing (March 2026 unless otherwise noted on a compare page). When a competitor ships a major release, we re-test and update the affected compare pages within ~2 weeks.

A 30-minute two-speaker English business meeting recorded on iPhone 15 Pro with moderate café background noise (~45 dB). Same source file is fed to every app being tested. Meeting-bot products like Fireflies and Granola are tested separately via their native calendar-join workflow on a live call with the same content.

Email support@audionotes.app with the app, the metric, and what you think the correct score should be. We will review and either update the score with a dated note explaining the change, or reply explaining why we kept the original.

Plans & Pricing

Start free, upgrade when you need more. One Pro plan, everything included — no hidden fees.

Free
$0forever

Get started with core features — unlimited notes, no credit card required.

Try Now

What's Included

  • Unlimited Voice Notes (1 min/note)
  • Unlimited Text Notes
  • Transcripts & Summaries
  • 99+ Languages
  • Search & Organize Notes
  • iPhone, Android, Web & Mac
ProMost Popular
$129.99/year
Save upto 35%

Everything you need to capture, organize, and act on your ideas — unlimited.

Get Started

What's Included

  • Unlimited Voice Notes
  • Unlimited File Uploads
  • Notes from Images
  • Notes from YouTube Videos
  • Chat with Notes
  • Unlimited Transcripts & Summaries
  • Unlimited AI Generations
  • Custom Prompts
  • Notion, Zapier & Webhooks
Enterprise
Custom

For teams of 5+ who need dedicated support, custom integrations, and volume pricing.

Contact Us

What's Included

  • Everything in Pro
  • Centralized Billing
  • Priority Support

Recording size and file size limits may apply owing to device limitations and fair usage policy.

Save time and stay organised with Audionotes

Without Audionotes

Missed details after meetings
Hours spent typing and organizing
Scattered files and voice notes
Stress of catching up later
Ideas lost between platforms

With Audionotes

Every word captured with AI note taking
Notes auto-generated and summarized in minutes
All notes searchable in one organised workspace
Instant transcripts, summaries, and action points
Connected notes synced across your tools
Try Audionotes

Get the Audionotes app today

For desktop

Use Audionotes on web

Continue on web
Download Extension
Coming Soon

For mobile

Scan the QR code below

QR code to download Audionotes mobile app

Still not sure thatAudionotes.app isright for you?

Let ChatGPT, Claude, or Perplexity help you to choose.
Click a button and see what your favourite AI says about Audionotes.app.