Dataset
Coverage
This page describes what Perpetua delivers on day one — by field, quality tier, sector, and accounts type — and names the gaps we know about. The live numbers are available at/v1/meta/coverage; this page is the qualitative frame around them.
Data Quality Tiers
Every company carries adata_quality_tierset by our nightly audit. By default, endpoints return gold + silver + review; toxic is never returned.
Gold
Full financials (revenue, profit, cash-flow), balance sheet balances, valid ratios. The tier most API users want.
Silver
Limited data — typically micro-entity or abridged filings. Balance-sheet signals present; P&L may be sparse by filing regime.
Review
Anomalies detected during the nightly quality audit. Surfaced because they may still be useful context, but flagged so you can opt out.
Toxic
Mathematically impossible (e.g. 100% margins from an XBRL scale bug). Never returned from any public endpoint — stripped at the query layer.
Sector Taxonomy
Companies are bucketed into 17 sectors derived from their UK SIC 2007 division. Mappings are shared with NACE Rev. 2 at the division level. Financial services (SIC 64/65/66) and holding-company SIC 70100 are filtered out at ingestion, so Finance does not appear as a bucket — the API rejects sector=Financerather than silently returning nothing. Companies whose primary SIC does not match a division appear as “Unclassified” (a very small share).
P&L Addressable Subset
Revenue, profit before tax, staff costs, and cash-flow coverage are reported two ways — against the full stage-3 population (pct) and against the subset of companies whose filing regime actually requires a P&L (FULL, GROUP, MEDIUM; theaddressable_pctfield). Use the addressable percentage when judging how complete a P&L-dependent metric is — the full-population percentage is diluted by filings that are not legally required to disclose it.
Known Gaps
We disclose these rather than paper over them. If you need a silent fill, Perpetua is probably not the right tool for the job; if you want to know exactly what the dataset will and won’t return, this is the list.
FULL filings without a tagged P&L
stream3_no_pnlA cohort of FULL-type filings ships without a machine-tagged P&L — the turnover is written as untagged HTML. These companies still appear with balance-sheet data. Revenue is not imputed; it is simply reported asnullin the response.
Micro-entity and dormant filings
micro_entity_no_pnlUK filing rules do not require MICRO-ENTITY or DORMANT companies to report revenue, profit before tax, staff costs, or cash flow. These fields will be null by design; balance-sheet coverage remains strong. Useaccounts_typefilters to opt out.
Gold tier calibration: 93.1% vs 98% target
gold_tier_calibrationCalibration rate — the share of stored gold-tier values that round-trip exactly to the raw XBRL filing when independently re-extracted. Currently 94 of 101 sampled fields match; the 7 mismatches are a mix of consolidation, context-pick, and a known total-assets scale anomaly. This is not the share of companies tagged gold; it is the fidelity of the values on the companies we do tag gold.
Small total-assets scale anomalies
total_assets_scale_bugA small number of companies (single digits) have a known scale bug in total_assets pending forensic repair. Migration 122 removed the bulk of affected rows; the remainder are flagged review-tier.
Freshness
The pipeline runs nightly. Bulk ingest and quality audits complete before 06:00 UTC; the Companies House API enrichment pass follows and finishes well before business hours in the UK. Coverage is snapshotted once a day and is available atlast_coverage_snapshot_aton the health endpoint. Individual companyupdated_attimestamps reflect the last pipeline touch.