How we test and score
AI note-taking apps

Each compare page on this site includes a benchmark table showing how Audionotes and a specific competitor perform across four or five metrics. This document explains exactly how each metric is measured, what the scores mean, and where the numbers come from. Our goal is full transparency so you can judge how much weight to place on any given score.

All testing was conducted by the Audionotes team in March 2026 unless a compare page states otherwise. We test each app independently using the same source recording, evaluate results against fixed rubrics, and record scores before publishing.

Disclosure

Audionotes is one of the products being evaluated. We have done our best to apply the same rubrics and scoring criteria to all apps, including our own. No competitor paid for inclusion or a favourable score. App versions change frequently — scores reflect the version available at the time of testing and may not match current performance.

Test recording

We use a single, standardised test recording for all head-to-head comparisons. Using the same source file ensures that differences in output reflect the app's capabilities rather than variation in the input.

Property	Value
Type	Two-person conversation (simulated business meeting)
Duration	30 minutes
Speakers	2 (one native, one near-native English speaker)
Background noise	Moderate café ambient noise (approx. 45 dB)
Language	English
Recording device	iPhone 15 Pro, built-in microphone, held on desk
File format	M4A, 44.1 kHz, stereo
Content	Product roadmap discussion: goals, blockers, and action items

Meeting-bot products (e.g. Fireflies, Granola) were tested separately using their native calendar-join workflow with the same meeting content run as a live call. In those cases, reliability is evaluated differently — see the individual compare page for details.

Transcription accuracy score

The transcription score reflects how faithfully the app converts speech to text. A human evaluator compares the app's transcript against the ground-truth transcript (produced by a professional transcriptionist) and scores the output on the following criteria:

Word error rate— the proportion of words that are substituted, inserted, or deleted relative to the reference transcript.
Proper nouns and technical terms— whether names, product terms, and domain-specific vocabulary are handled correctly.
Punctuation and sentence boundaries— whether the transcript is usable without extensive manual cleanup.
Speaker attribution— whether the app correctly separates speakers when diarization is advertised.
Filler word handling— whether "um", "uh", and false starts are cleaned up or left verbatim (both are acceptable; the evaluator checks for consistency).

Scoring scale

10 / 10Near-perfect. Fewer than 2% word error rate; proper nouns correct; clean, readable output.

9 / 10Excellent. Very few errors; minor punctuation or capitalisation inconsistencies only.

8 / 10Good. Occasional word errors or missed proper nouns; transcript is usable with light editing.

7 / 10Acceptable. Noticeable errors but the meaning is preserved; some manual correction required.

5–6 / 10Below average. Frequent errors, missing sentences, or heavy filler-word noise.

1–4 / 10Poor. Transcript is difficult to use without substantial correction.

The human evaluator is not affiliated with any of the apps tested and uses only the ground-truth transcript and the scoring rubric above. Scores are rounded to the nearest whole number. Where an app produces no transcript (voice-memo-only products), the transcription score is marked Undetermined.

Summary quality score

Summary quality is scored by an LLM judge (GPT-4o) using a fixed rubric. The judge receives the ground-truth transcript and the app's summary side-by-side and is asked to score the summary on five dimensions:

Dimension	What we look for	Weight
Coverage	All key topics, decisions, and action items from the meeting are present.	30%
Accuracy	Nothing in the summary contradicts or misrepresents what was said.	25%
Structure	The summary is organised in a way that is easy to scan and act on.	20%
Conciseness	The summary omits filler and captures only what matters.	15%
Action items	Concrete next steps are identified and attributed to the right person where possible.	10%

Scoring scale

10 / 10Exceptional. All five dimensions are excellent; the summary could replace the transcript for most purposes.

9 / 10Very strong. One minor gap in coverage or structure; no accuracy issues.

8 / 10Good. A few missed action items or slightly verbose; still highly usable.

7 / 10Acceptable. Gaps in coverage or structure; requires some cross-referencing with the transcript.

5–6 / 10Below average. Key decisions or action items are missing or inaccurate.

1–4 / 10Poor. The summary is misleading, largely incomplete, or hallucinates content.

Undetermined

If an app does not generate summaries at all, or generates only a rewritten prose version of the transcript with no extractive structure, the summary quality score is marked Undetermined rather than zero, since the product is designed for a different use case.

We use the same LLM judge prompt across all evaluations. The prompt is fixed and does not reference the app name, preventing brand bias. The judge is given no information about which app produced which output. Final scores are the result of a single evaluation pass; we re-run the evaluation if the initial score falls on a half-point boundary to confirm the result.

Recording reliability score

Reliability captures how consistently the app records and processes audio without errors. Unlike transcription or summary quality, reliability is difficult to measure in a single test session. We derive the reliability score from a combination of direct testing and App Store review analysis.

Direct testing— we record three sessions with each app and note any crashes, processing failures, upload errors, or dropped audio.
App Store review patterns— we code the most recent reviews available at test time (minimum 50 reviews per app) for reliability-related complaints: crashes, stuck processing, lost recordings, and sync failures.
Composite score— direct testing accounts for 60% of the reliability score; review pattern analysis accounts for 40%.

Scoring scale

10 / 10Zero issues in direct testing; reliability complaints are rare or absent in reviews.

9 / 10No critical failures in testing; isolated reliability complaints in reviews (under 5%).

8 / 10No lost recordings in testing; occasional complaints in reviews (5–10%).

7 / 10One minor processing hiccup in testing; moderate complaint rate (10–20%).

5–6 / 10One or more processing failures or crashes during testing; notable complaint rate.

1–4 / 10Consistent failures in testing; reliability is a primary user complaint.

Undetermined

For apps with fewer than 50 App Store reviews at test time, or apps that have only recently launched, we mark reliability as Undetermined and rely solely on our direct testing observations, noting this on the compare page.

Review coding was performed by the Audionotes team using a fixed reliability codebook. Reviewers coded independently; discrepancies were resolved by consensus. The review dataset for each app is described in that app's compare page under "Review methodology."

Limitations

No methodology is perfect. The following limitations apply to all scores published on this site:

Point-in-time snapshot. Scores reflect app versions available in March 2026. Both Audionotes and competitors update frequently; performance may have improved or regressed since testing.
English only. The test recording is in English. Apps that specialise in multilingual transcription may perform differently on non-English content than the scores suggest.
Single recording type. We use one standardised recording (two-speaker, 30-minute, moderate noise). Apps optimised for solo dictation, lectures, or large-group meetings may be under- or over-represented by these scores.
iOS-first testing. Direct testing was conducted on iPhone 15 Pro. Android or web-app versions of the same product may perform differently.
LLM judge variance. GPT-4o is used as the summary judge. LLM evaluations have inherent variance. We run each evaluation with a fixed prompt and temperature to minimise this, but a repeated run could produce a score that is ±1 point different.
Audionotes conflict of interest. We are a competitor to all apps reviewed on this site. We publish our full methodology and rubrics to allow independent scrutiny, and we welcome corrections via email.

If you believe a score for any app is wrong, outdated, or based on a misapplication of the rubric, please contact us at support@audionotes.app.

Frequently Asked Questions

The Audionotes team. We test each app independently using the same source recording, evaluate against fixed rubrics, and record scores before publishing. Audionotes is one of the products being scored — we apply the same rubric to ourselves and link to this methodology page from every compare page so readers can scrutinise the process.

A human evaluator compares each app's transcript against a ground-truth transcript produced by a professional transcriptionist, scoring on word error rate, proper-noun handling, punctuation, speaker attribution, and filler-word treatment. Scores run 1–10. The same evaluator and rubric is used for every app, including Audionotes.

Summary quality is judged by GPT-4o using a fixed 5-dimension rubric (faithfulness, coverage of key points, action-item extraction, clarity, conciseness). The judge sees the ground-truth transcript and the app's summary side-by-side and scores 1–10 on each dimension; the final score is the mean.

Apps ship updates frequently, so scores reflect the version available at the time of testing (March 2026 unless otherwise noted on a compare page). When a competitor ships a major release, we re-test and update the affected compare pages within ~2 weeks.

A 30-minute two-speaker English business meeting recorded on iPhone 15 Pro with moderate café background noise (~45 dB). Same source file is fed to every app being tested. Meeting-bot products like Fireflies and Granola are tested separately via their native calendar-join workflow on a live call with the same content.

Email support@audionotes.app with the app, the metric, and what you think the correct score should be. We will review and either update the score with a dated note explaining the change, or reply explaining why we kept the original.