Datasets — overview
Bilby Quant Data provides a single unified dataset containing all policy documents from Chinese government sources, spanning from 2021 to the present day.
Critical: Date partitioning by insertion date
⚠️ IMPORTANT
All files are partitioned by the inserted_at timestamp, not the
published_at timestamp.
This is the date on which a document entered our processing pipeline, which may differ significantly from when it was originally published. This design choice has important implications:
Documents are organised by when they were ingested into the system, not when they were published by their original sources. A document published in 2019 but scraped by Bilby in November 2024 will appear in a file dated November 2024, not 2019.
This approach ensures point-in-time reproducibility: The state of the dataset on any given date can be reconstructed precisely, because new data is always appended to new files rather than modifying existing ones. Any documents scraped tomorrow will be added as new daily files, preserving the integrity of previously published files that clients may have already downloaded.
However, this choice can lead to unusually large files if Bilby performs
significant backfilling during a particular date range. When historical
documents are batch-processed and inserted on a single day, all those documents
will appear in that day's file, potentially necessitating a split into multiple
parts (indicated by the _p01, _p02 suffixes). A file dated 2024-11-15 might
contain thousands of documents published across many years, all inserted into
the pipeline on that single day.
To filter documents by their actual publication date — which is often what you
want for analysis — use the published_at field within the data itself.
Dataset structure
Each row in the dataset corresponds to one document. Every document appears
exactly once, identified by a unique uuid.
The dataset includes:
- Core metadata: Document identifiers, publication information, source details, and timestamps
- Text content: Full document text in both the original language (typically Chinese) and English translation, including titles, subheadings, body text, and summaries
- Entity extraction data: Machine-extracted entities such as people, organisations, locations, currency mentions, events, and initiatives mentioned in each document
- Policy lifecycle classification: Labels indicating each document's stage in the policy development process (not policy, informing, deciding, implementing)
Data sources
Documents are collected from publicly available sources, including:
- Official government websites and portals
- State media and newspapers
- Ministry reports and announcements
- State-Owned Enterprise publications
Coverage & updates
The dataset currently contains documents from 2021 onwards. New documents are scraped several times per day and added to the dataset through a daily dump generation process. Each day's new additions are available as separate parquet files, with monthly and annual aggregations also provided for convenience.
For details on specific fields and their meanings, see the Field Documentation section.