How we test and score
AI note-taking apps
Each compare page on this site includes a benchmark table showing how Audionotes and a specific competitor perform across four or five metrics. This document explains exactly how each metric is measured, what the scores mean, and where the numbers come from. Our goal is full transparency so you can judge how much weight to place on any given score.
All testing was conducted by the Audionotes team in March 2026 unless a compare page states otherwise. We test each app independently using the same source recording, evaluate results against fixed rubrics, and record scores before publishing.
Disclosure
Audionotes is one of the products being evaluated. We have done our best to apply the same rubrics and scoring criteria to all apps, including our own. No competitor paid for inclusion or a favourable score. App versions change frequently — scores reflect the version available at the time of testing and may not match current performance.
Test recording
We use a single, standardised test recording for all head-to-head comparisons. Using the same source file ensures that differences in output reflect the app's capabilities rather than variation in the input.
| Property | Value |
|---|---|
| Type | Two-person conversation (simulated business meeting) |
| Duration | 30 minutes |
| Speakers | 2 (one native, one near-native English speaker) |
| Background noise | Moderate café ambient noise (approx. 45 dB) |
| Language | English |
| Recording device | iPhone 15 Pro, built-in microphone, held on desk |
| File format | M4A, 44.1 kHz, stereo |
| Content | Product roadmap discussion: goals, blockers, and action items |
Meeting-bot products (e.g. Fireflies, Granola) were tested separately using their native calendar-join workflow with the same meeting content run as a live call. In those cases, reliability is evaluated differently — see the individual compare page for details.
Transcription accuracy score
The transcription score reflects how faithfully the app converts speech to text. A human evaluator compares the app's transcript against the ground-truth transcript (produced by a professional transcriptionist) and scores the output on the following criteria:
- Word error rate— the proportion of words that are substituted, inserted, or deleted relative to the reference transcript.
- Proper nouns and technical terms— whether names, product terms, and domain-specific vocabulary are handled correctly.
- Punctuation and sentence boundaries— whether the transcript is usable without extensive manual cleanup.
- Speaker attribution— whether the app correctly separates speakers when diarization is advertised.
- Filler word handling— whether "um", "uh", and false starts are cleaned up or left verbatim (both are acceptable; the evaluator checks for consistency).
Scoring scale
Summary quality score
Summary quality is scored by an LLM judge (GPT-4o) using a fixed rubric. The judge receives the ground-truth transcript and the app's summary side-by-side and is asked to score the summary on five dimensions:
| Dimension | What we look for | Weight |
|---|---|---|
| Coverage | All key topics, decisions, and action items from the meeting are present. | 30% |
| Accuracy | Nothing in the summary contradicts or misrepresents what was said. | 25% |
| Structure | The summary is organised in a way that is easy to scan and act on. | 20% |
| Conciseness | The summary omits filler and captures only what matters. | 15% |
| Action items | Concrete next steps are identified and attributed to the right person where possible. | 10% |
Scoring scale
Undetermined
If an app does not generate summaries at all, or generates only a rewritten prose version of the transcript with no extractive structure, the summary quality score is marked Undetermined rather than zero, since the product is designed for a different use case.
Recording reliability score
Reliability captures how consistently the app records and processes audio without errors. Unlike transcription or summary quality, reliability is difficult to measure in a single test session. We derive the reliability score from a combination of direct testing and App Store review analysis.
- Direct testing— we record three sessions with each app and note any crashes, processing failures, upload errors, or dropped audio.
- App Store review patterns— we code the most recent reviews available at test time (minimum 50 reviews per app) for reliability-related complaints: crashes, stuck processing, lost recordings, and sync failures.
- Composite score— direct testing accounts for 60% of the reliability score; review pattern analysis accounts for 40%.
Scoring scale
Undetermined
For apps with fewer than 50 App Store reviews at test time, or apps that have only recently launched, we mark reliability as Undetermined and rely solely on our direct testing observations, noting this on the compare page.
Limitations
No methodology is perfect. The following limitations apply to all scores published on this site:
- Point-in-time snapshot. Scores reflect app versions available in March 2026. Both Audionotes and competitors update frequently; performance may have improved or regressed since testing.
- English only. The test recording is in English. Apps that specialise in multilingual transcription may perform differently on non-English content than the scores suggest.
- Single recording type. We use one standardised recording (two-speaker, 30-minute, moderate noise). Apps optimised for solo dictation, lectures, or large-group meetings may be under- or over-represented by these scores.
- iOS-first testing. Direct testing was conducted on iPhone 15 Pro. Android or web-app versions of the same product may perform differently.
- LLM judge variance. GPT-4o is used as the summary judge. LLM evaluations have inherent variance. We run each evaluation with a fixed prompt and temperature to minimise this, but a repeated run could produce a score that is ±1 point different.
- Audionotes conflict of interest. We are a competitor to all apps reviewed on this site. We publish our full methodology and rubrics to allow independent scrutiny, and we welcome corrections via email.
If you believe a score for any app is wrong, outdated, or based on a misapplication of the rubric, please contact us at support@audionotes.app.
Frequently Asked Questions
The Audionotes team. We test each app independently using the same source recording, evaluate against fixed rubrics, and record scores before publishing. Audionotes is one of the products being scored — we apply the same rubric to ourselves and link to this methodology page from every compare page so readers can scrutinise the process.
A human evaluator compares each app's transcript against a ground-truth transcript produced by a professional transcriptionist, scoring on word error rate, proper-noun handling, punctuation, speaker attribution, and filler-word treatment. Scores run 1–10. The same evaluator and rubric is used for every app, including Audionotes.
Summary quality is judged by GPT-4o using a fixed 5-dimension rubric (faithfulness, coverage of key points, action-item extraction, clarity, conciseness). The judge sees the ground-truth transcript and the app's summary side-by-side and scores 1–10 on each dimension; the final score is the mean.
Apps ship updates frequently, so scores reflect the version available at the time of testing (March 2026 unless otherwise noted on a compare page). When a competitor ships a major release, we re-test and update the affected compare pages within ~2 weeks.
A 30-minute two-speaker English business meeting recorded on iPhone 15 Pro with moderate café background noise (~45 dB). Same source file is fed to every app being tested. Meeting-bot products like Fireflies and Granola are tested separately via their native calendar-join workflow on a live call with the same content.
Email support@audionotes.app with the app, the metric, and what you think the correct score should be. We will review and either update the score with a dated note explaining the change, or reply explaining why we kept the original.
Plans & Pricing
Start free, upgrade when you need more. One Pro plan, everything included — no hidden fees.
What's Included
- Unlimited Voice Notes (1 min/note)
- Unlimited Text Notes
- Transcripts & Summaries
- 99+ Languages
- Search & Organize Notes
- iPhone, Android, Web & Mac
Everything you need to capture, organize, and act on your ideas — unlimited.
Get StartedWhat's Included
- Unlimited Voice Notes
- Unlimited File Uploads
- Notes from Images
- Notes from YouTube Videos
- Chat with Notes
- Unlimited Transcripts & Summaries
- Unlimited AI Generations
- Custom Prompts
- Notion, Zapier & Webhooks
For teams of 5+ who need dedicated support, custom integrations, and volume pricing.
Contact UsWhat's Included
- Everything in Pro
- Centralized Billing
- Priority Support
Recording size and file size limits may apply owing to device limitations and fair usage policy.
Save time and stay organised with Audionotes
Without Audionotes
With Audionotes
Get the Audionotes app today
For desktop
Use Audionotes on web
For mobile
Scan the QR code below
Still not sure thatAudionotes.app isright for you?
Let ChatGPT, Claude, or Perplexity help you to choose.
Click a button and see what your favourite AI says about Audionotes.app.