Quick start

Welcome to Bilby Quant Data. Once you've signed up, our team will contact you to grant access to the data portal.

Accessing the data

Once authenticated, you can navigate to the /portal page to browse, filter, and download our datasets. The portal provides a mirror of our GCS bucket structure, organised by date.

Critical: Date partitioning by insertion date

⚠️ IMPORTANT

All files are partitioned by the inserted_at timestamp, not the published_at timestamp.

This is the date on which a document entered our processing pipeline, which may differ significantly from when it was originally published. This design choice has important implications:

Documents are organised by when they were ingested into the system, not when they were published by their original sources. A document published in 2019 but scraped by Bilby in November 2024 will appear in a file dated November 2024, not 2019.

This approach ensures point-in-time reproducibility: The state of the dataset on any given date can be reconstructed precisely, because new data is always appended to new files rather than modifying existing ones. Any documents scraped tomorrow will be added as new daily files, preserving the integrity of previously published files that clients may have already downloaded.

However, this choice can lead to unusually large files if Bilby performs significant backfilling during a particular date range. When historical documents are batch-processed and inserted on a single day, all those documents will appear in that day's file, potentially necessitating a split into multiple parts (indicated by the _p01, _p02 suffixes). A file dated 2024-11-15 might contain thousands of documents published across many years, all inserted into the pipeline on that single day.

To filter documents by their actual publication date — which is often what you want for analysis — use the published_at field within the data itself.

Directory structure

The data is organised hierarchically by year and month:

root/
├── 2021/
   ├── combined_yearly_2021.parquet
   ├── 01/
   ├── combined_monthly_2021-01.parquet
   ├── daily_2021-01-01.parquet
   ├── daily_2021-01-02.parquet
   └── ...
   ├── 02/
   └── ...
├── 2022/
├── 2023/
├── 2024/
   ├── combined_yearly_2024.parquet
   ├── 01/
   ├── combined_monthly_2024-01.parquet
   ├── daily_2024-01-01.parquet
   ├── daily_2024-01-02.parquet
   └── ...
   ├── 02/
   └── ...
└── 2025/

Year folders (2021–2025)

Each year folder contains:

  • Annual amalgamated file: A single parquet file containing all documents inserted during that year (e.g., combined_yearly_2024.parquet)
  • Month subdirectories: Folders numbered 0112.

Month folders (01–12)

Each month folder contains:

  • Monthly amalgamated file: A single parquet file containing all documents inserted during that month (e.g., combined_monthly_2024-10.parquet)

  • Daily files: Individual parquet files for each day of the month (e.g., daily_2024-10-12.parquet)

For the current month, only files up to today's date will be present.

Understanding date partitioning

IMPORTANT: The parquet files in the data dump are partitioned by the field inserted_at — the date a document entered our processing pipeline — NOT by the field published_at. This means:

  • Historical documents scraped recently will appear in folders corresponding to their insertion date, not their original publication date
  • New articles scraped tomorrow will be appended as new daily files, ensuring point-in-time reproducibility
  • To filter by publication date, use the published_at field within the data itself