Dataset

Coverage

This page describes what Perpetua delivers on day one — by field, quality tier, sector, and accounts type — and names the gaps we know about. The live numbers are available at/v1/meta/coverage; this page is the qualitative frame around them.

Data Quality Tiers

Every company carries adata_quality_tierset by our nightly audit. By default, endpoints return gold + silver + review; toxic is never returned.

Gold

Full financials (revenue, profit, cash-flow), balance sheet balances, valid ratios. The tier most API users want.

Silver

Limited data — typically micro-entity or abridged filings. Balance-sheet signals present; P&L may be sparse by filing regime.

Review

Anomalies detected during the nightly quality audit. Surfaced because they may still be useful context, but flagged so you can opt out.

Toxic

Mathematically impossible (e.g. 100% margins from an XBRL scale bug). Never returned from any public endpoint — stripped at the query layer.

Sector Taxonomy

Companies are bucketed into 17 sectors derived from their UK SIC 2007 division. Mappings are shared with NACE Rev. 2 at the division level. Financial services (SIC 64/65/66) and holding-company SIC 70100 are filtered out at ingestion, so Finance does not appear as a bucket — the API rejects sector=Financerather than silently returning nothing. Companies whose primary SIC does not match a division appear as “Unclassified” (a very small share).

AgricultureEnergy & MiningManufacturingUtilitiesConstructionRetail & WholesaleLogisticsHospitalityTechnologyReal EstateProfessional ServicesBusiness ServicesPublic AdministrationEducationHealthcareArts & EntertainmentOther Services

P&L Addressable Subset

Revenue, profit before tax, staff costs, and cash-flow coverage are reported two ways — against the full stage-3 population (pct) and against the subset of companies whose filing regime actually requires a P&L (FULL, GROUP, MEDIUM; theaddressable_pctfield). Use the addressable percentage when judging how complete a P&L-dependent metric is — the full-population percentage is diluted by filings that are not legally required to disclose it.

Known Gaps

We disclose these rather than paper over them. If you need a silent fill, Perpetua is probably not the right tool for the job; if you want to know exactly what the dataset will and won’t return, this is the list.

FULL filings without a tagged P&L

stream3_no_pnl

A cohort of FULL-type filings ships without a machine-tagged P&L — the turnover is written as untagged HTML. These companies still appear with balance-sheet data. Revenue is not imputed; it is simply reported asnullin the response.

Micro-entity and dormant filings

micro_entity_no_pnl

UK filing rules do not require MICRO-ENTITY or DORMANT companies to report revenue, profit before tax, staff costs, or cash flow. These fields will be null by design; balance-sheet coverage remains strong. Useaccounts_typefilters to opt out.

Gold tier calibration: 93.1% vs 98% target

gold_tier_calibration

Calibration rate — the share of stored gold-tier values that round-trip exactly to the raw XBRL filing when independently re-extracted. Currently 94 of 101 sampled fields match; the 7 mismatches are a mix of consolidation, context-pick, and a known total-assets scale anomaly. This is not the share of companies tagged gold; it is the fidelity of the values on the companies we do tag gold.

Small total-assets scale anomalies

total_assets_scale_bug

A small number of companies (single digits) have a known scale bug in total_assets pending forensic repair. Migration 122 removed the bulk of affected rows; the remainder are flagged review-tier.

Freshness

The pipeline runs nightly. Bulk ingest and quality audits complete before 06:00 UTC; the Companies House API enrichment pass follows and finishes well before business hours in the UK. Coverage is snapshotted once a day and is available atlast_coverage_snapshot_aton the health endpoint. Individual companyupdated_attimestamps reflect the last pipeline touch.